Bernie Roesler A personal blog

Kolmogorov-Smirnov Test for Two Samples

If you’re reading this post, you’ve probably heard of the Kolmogorov-Smirnov Test for normality testing, and maybe even the two-sample version. But how is this result derived? As I was working my way through MIT’s 18.650, this question is raised through one of the homework assignments. This post loosely follows that assignment, with some additional commentary and code I wrote to actually implement my algorithms described herein.

Problem Statement

Consider two independent samples $X_1, \dots, X_n$, and $Y_1, \dots, Y_m$ of independent, real-valued, continuous random variables, and assume that the $X_i$’s are i.i.d. with some cdf $F$ and that the $Y_i$’s are i.i.d. with some cdf $G$. Note that the two samples may have different sizes (if $n \ne m$). We want to test whether $F = G$. Consider the following hypotheses:

\[\begin{align*} H_0 \colon ``F = G" \\ H_1 \colon ``F \ne G" \end{align*}\]

For simplicity, we will assume that $F$ and $G$ are continuous and increasing.

Example Experiment

An example experiment in which testing if two samples are from the same distribution is of interest may be encountered in a lab setting where we have two devices for measurement, and wish to determine if the errors have the same distribution for our analysis.

CDF Distributions

Let

\[\begin{align*} U_i &= F(X_i), \quad \forall i = 1, \dots, n, \\ V_j &= G(Y_j), \quad \forall j = 1, \dots, n. \end{align*}\]

Proposition 1. The distribution of the cdf of a continuous random variable is uniform on $[0, 1]$.

Proof. The distributions of $U_i$ and $V_j$ can be determined by finding their cdfs. The cdf of $U_i$ is defined by $F_U(t) \coloneqq \mathbb{P}\left[U_i \le t\right]$. Assuming that $F(X)$ and $G(Y)$ are invertible, it follows that

\[\begin{align*} \mathbb{P}\left[U_i \le t\right] &= \mathbb{P}\left[F(X_i) \le t\right] &\quad&\text{(definition of $U_i$)} \\ &= \mathbb{P}\left[X_i \le F^{-1}(t)\right] \\ &= F(F^{-1}(t)) &\quad&\text{(definition of cdf)} \\ &= t \\ \therefore F_U(t) &= t \\ \implies f_U(t) &= \mathcal{U}\left(\left[ 0, 1 \right]\right) \tag*{◻} \end{align*}\]

Likewise, $f_V(t) = \mathcal{U}\left(\left[ 0, 1 \right]\right)$.

Empirical CDFs

Let $F_n$ be the empirical cdf of ${X_1, \dots, X_n}$ and $G_m$ be the empirical cdf of ${Y_1, \dots, Y_m}$.

The Test Statistic

Let the test statistic be

\[T_{n,m} = \sup_{t \in \mathbb{R}} \left| F_n(t) - G_m(t) \right|.\]

Proposition 2. The test statistic $T_{n,m}$ can be written as the maximum value of a finite set of numbers.

Proof. By definition, The indicator function is defined as \(\mathbb{1}\!\left\{X \le t\right\} = \begin{cases} 1 & \text{if } X \le t \\ 0 & \text{otherwise} \end{cases}.\) the cdf

\[\begin{align*} F(t) &= \mathbb{P}\left[X \le t\right] \quad \forall t \in \mathbb{R}\\ &= \mathbb{E}\left[\mathbb{1}\!\left\{X \le t\right\}\right].\\ \end{align*}\]

By the Law of Large Numbers, the expectation can be approximated by the sample average, so we can define the empirical cdf as

\[\begin{align} F_n(t) &= \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} \label{1}\tag{1} \end{align}\]

Likewise,

\[\begin{align} G_m(t) &= \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\}. \label{2}\tag{2} \end{align}\] \[\therefore T_{n,m} = \sup_{t \in \mathbb{R}} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\} \right|.\]

The empirical cdfs $\eqref{1}$ and $\eqref{2}$ can also be written

\[\begin{align*} F_n(t) &= \#\{i=1, \dots, n \colon X_i \le t\} \cdot \frac{1}{n} \\ G_m(t) &= \#\{i=1, \dots, m \colon Y_j \le t\} \cdot \frac{1}{m}, \end{align*}\]

so the only values that the empirical cdfs can take are the discrete sets

\[\begin{align*} F_n(i) &= \frac{i}{n} \quad \forall i = 1, \dots, n \\ G_m(j) &= \frac{j}{m} \quad \forall j = 1, \dots, m. \end{align*}\]

Therefore, the test statistic can be rewritten as the maximum value of a finite set of numbers:

\[\begin{split} T_{n,m} = \max_{i=0,\dots,n} \Bigg[ &\max_{j=0,\dots,m} \left| \frac{i}{n} - \frac{j}{m} \right| \mathbb{1}\!\left\{Y^{(j)} \le X^{(i)} < Y^{(j+1)}\right\}, \\ &\max_{k=j+1, \dots, m} \left| \frac{i}{n} - \frac{k}{m} \right| \mathbb{1}\!\left\{Y^{(k)} \le X^{(i+1)}\right\} \Bigg] \end{split}\]

where $X^{(i)}$ is the $i^\text{th}$ value in the ordered set of data $X^{(1)} \le \cdots \le X^{(n)}$. The value $X^{(0)} \coloneqq -\infty$ is prepended to the otherwise finite realizations to simplify the computation. 

The following algorithm calculates the KS test statistic for two given samples.

Algorithm 1. Calculate the KS test statistic $T_{n,m}$ for two samples.

Require: $X, Y$ are vectors of real numbers.

Ensure: $0 \le T_{n,m} \le 1$.

Procedure KS2Sample($X, Y$)

$X_s \gets \{-\infty,$ Sort($X$)$\}$

$Y_s \gets$ Sort($Y$)

$n \gets \dim X_s$

$m \gets \dim Y_s$

$T_v \gets$ empty array of size $n$

for all $i \in \{0, \dots, n\}$ do

$j \gets j$ + Rank($\{Y_s^{(\ell)}\}_{\ell=j}^m, X_s^{(i)}$) ▷ Only search remaining $j$ values

$k \gets j$ + Rank($\{Y_s^{(\ell)}\}_{\ell=j}^m, X_s^{(\min(i+1, n))}$)

$\displaystyle{T_v^{(i)} \gets \max\left(\left|\frac{i}{n} - \frac{j}{m}\right|, \left|\frac{i}{n} - \frac{k}{m}\right|\right)}$

end for

Return $\max_i T_v$

end procedure

Function Rank($A, k$)

Assert $A$ is sorted in ascending order.

Return $\#\{i=1,\dots,\dim A \colon k < A_i\}$

end function

The following subroutine is an implementation of Algorithm 1. It computes an array of values $T_v(i)$ for each value of $X_i$. The test statistic $T_{n,m}$ is the maximum of these values.

The entire source code for the figures and algorithms in this post is available here.

def _ks_2samp(X, Y):
    """Compute the Kolmogorov-Smirnov statistic on 2 samples.

    Parameters
    ----------
    X, Y : (N,), (M,) array_like
        Two arrays of sample observations assumed to be drawn from a continuous
        distribution, sample sizes can differ.

    Returns
    -------
    Tv : (N+1,) ndarray
        Maximum difference in CDFs for each value of X.

    .. note:: The KS statistic itself is the maximum of these `Tv` values, but
        use this helper function for debugging.
    """
    n = len(X)
    m = len(Y)
    # Sort copies of the data
    Xs = np.hstack([-np.inf, np.sort(X)])  # pad extra point
    Ys = np.sort(Y)
    # Calculate the maximum difference in the empirical CDFs
    Tv = np.zeros(n+1)  # extra value at Fn = 0 (Xs -> -infty)
    js = np.zeros(n+1, dtype=int)
    j = 0
    for i in range(n+1):
        # Find greatest Ys point j s.t. Ys[j] <= Xs[i] and Xs[i] < Ys[j+1]
        j += _rank(Ys[j:], Xs[i])  # only search remaining values

        test_lo = np.abs(i/n - j/m)
        j_lo = j

        # Find next greatest Ys point k s.t. Ys[k] < X[i+1]
        k = _rank(Ys[j:], Xs[min(i+1, n)]) + j
        test_hi = np.abs(i/n - k/m)
        j_hi = k

        # Take the maximum distance, and corresponding index
        Tv[i] = np.max((test_lo, test_hi))
        js[i] = j_lo if np.argmax((test_lo, test_hi)) == 0 else j_hi

    return Tv, js

def _rank(A, k):
    """Return the number of keys in `A` strictly less than `k`.

    Parameters
    ----------
    A : (M,) array_like
        Array of values, must be sorted in ascending order.
    k : comparable
        Key for which to search in `A`. Must be comparable to values in `A`.
        `k` need not be in the array.

    Returns
    -------
    result : int
        Number of keys in `A` strictly less than `k`. 0 if `k` is less than all
        elements of `A`. `len(A)` if `k` is greater than all elements of `A`.
    """
    assert all(A == sorted(A))
    lo = 0
    hi = len(A) - 1
    while lo <= hi:
        mid = (hi + lo) // 2
        if k < A[mid]:
            hi = mid - 1
        elif k > A[mid]:
            lo = mid + 1
        else:  # k == A[mid]
            return mid
    return lo

An example two-sample KS-test is shown in Figure 1.

Figure 1. The empirical cdfs of two independent random samples from \(\mathcal{N}\left( 0, 1 \right)\) and \(\mathcal{N}\left( 0, 2 \right)\). The test statistic \(T_{n,m}\) is shown by the double arrow.

The Null Hypothesis

Proposition 3. If $H_0$ is true, then the test statistic

\[T_{n,m} = \sup_{0 \le x \le 1} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{U_i \le x\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{V_j \le x\right\} \right|.\]

Proof. By $\eqref{1}$ and $\eqref{2}$,

\[\begin{equation} \label{3}\tag{3} T_{n,m} = \sup_{t \in \mathbb{R}} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\} \right|. \end{equation}\]

To show the proposition is true, we make a change of variable. Let

\[x = F(t).\]

Then,

\[t \in \mathbb{R}\implies x \in [0, 1].\]

Since $F$ and $G$ are continuous and monotonically increasing,

\[\begin{align*} X_i \le t &\iff F(X_i) \le F(t) \\ &\iff U_i \le x &\quad&\text{(definition)}. \end{align*}\]

Similarly,

\[\begin{align*} Y_i \le t &\iff G(Y_i) \le G(t) \\ &\iff G(Y_i) \le F(t) &\quad&\text{(under $H_0$)} \\ &\iff V_i \le x &\quad&\text{(definition)}. \end{align*}\]

Substitution of these expressions into $\eqref{3}$ completes the proof. 

The Joint Distribution of the Samples

Proposition 4. If $H_0$ is true, the joint distribution of $U_1, \dots, U_n, V_1, \dots, V_m$ $(n+m)$ random variables is uniform on $[0, 1]$.

Proof.

\[\begin{align*} \mathbb{P}\left[U_i \le t\right] &= \mathbb{P}\left[F(X_i) \le t\right] \\ &= \mathbb{P}\left[F(X_1) \le t\right] &\quad&\text{(i.i.d.)} \\ &= \mathbb{P}\left[G(X_1) \le t\right] &\quad&\text{(under $H_0$)} \\ &= \mathbb{P}\left[G(Y_1) \le t\right] &\quad&\text{(i.i.d.)} \\ &= \mathbb{P}\left[V_1 \le t\right] &\quad&\text{(definition)} \\ \end{align*}\]

These probabilities can be rearranged to find the cdfs of $U$ and $V$

\[\begin{align*} &= \mathbb{P}\left[X_1 \le F^{-1}(t)\right] \\ &= F(F^{-1}(t)) &\quad&\text{(definition of cdf)} \\ &= t \\ \therefore F_U(t) &= G_V(t) = t \\ \implies f_{U,V}(t) &= \mathcal{U}\left(\left[ 0, 1 \right]\right) \tag*{◻} \end{align*}\]

The Test Statistic is Pivotal

Since Proposition 4 has been shown to be true under the null hypothesis $H_0$, and the distributions of $U_i$ and $V_j$ have been shown to be $\mathcal{U}\left(\left[ 0, 1 \right]\right)$ independent of the distributions of the underlying samples $X_i$, $Y_j$, we conclude that $T_{n,m}$ is pivotal, i.e. it does not itself depend on the unknown distributions of the samples.

Quantiles of the Test Statistic

Let $\alpha \in (0, 1)$ and $q_\alpha$ be the $(1 - \alpha)$-quantile of the distribution of $T_{n,m}$ under $H_0$. The quantile $q_\alpha$ is given by

\[\begin{align*} q_\alpha &= F^{-1}(1-\alpha) \\ &= \inf\{x \colon F(x) \ge 1 - \alpha\} \\ &\approx \min\{x \colon F_n(x) \ge 1 - \alpha\}, \quad n < \infty \\ \implies q_\alpha \approx \hat{q}_\alpha &= \min_i \left\{ T_{n,m}^{(i)} \colon \tfrac{i}{M} \ge 1 - \alpha \right\} \end{align*}\]

where $M \in \mathbb{N}$ is large, and $T_{n,m}^{(i)}$ is the $i^\text{th}$ value in a sorted sample of $M$ test statistics. Thus, $q_\alpha$ can be approximated by choosing $i = \ceil{M(1 - \alpha)}$. An algorithm to approximate $q_\alpha$ given $\alpha$ is as follows.

Algorithm 2. Approximate $q_\alpha$, the $(1 - \alpha)$-quantile of the distribution of $T_{n,m}$ under $H_0$.

Require: $n = \dim X$, $m = \dim Y$, $M \in \mathbb{N}$, and $\alpha \in (0, 1)$.

Ensure: $q_\alpha \in [0, 1]$.

Procedure KSQuantile($n, m, M, \alpha$)

$T_v \gets$ empty array of size $n$

for all $i \in \{0,\dots,M\}$ do

$X_s \gets$ sample of size $n$ from $\mathcal{N}\left( 0, 1 \right)$.

$Y_s \gets$ sample of size $m$ from $\mathcal{N}\left( 0, 1 \right)$.

$T_v^{(i)} \gets$ KS2Sample($X_s, Y_s$) ▷ defined in Algorithm 1

end for

$T_{vs} \gets$ Sort($T_v$)

$j \gets \ceil{M(1 - \alpha)}$

Return $T_{vs}^{(j)}$

end procedure

A plot of the distribution of

\[\frac{T_{n,m}^M - \overline{T}_{n,m}^M}{\sqrt{\operatorname{Var}\left(T_{n,m}^M\right)}}\]

is shown in Figure 2 in comparison to a standard normal. The test statistic distribution is skewed to the left, and has a longer right tail than the standard normal. Since the asymptotic distribution of the test statistic is not readily found in theory, we rely on simulation via Algorithm 2 to estimate the quantiles.

Figure 2. Empirical distribution of samples of the test statistic \(T_{n,m}\). We can see via the plot on the left that the empirical distribution is skewed to the left from a standard normal. The QQ-plot on the right also shows that the empirical distribution has a shorter tail to the left of the mean (blue data points closer to zero than the normal line), and a longer tail to the right of the mean (blue data points closer to infinity than the normal line).

The Hypothesis Test

Given the aproximation for \(\hat{q}_\alpha\) for \(q_\alpha\) from Algorithm 2, we define a test with non-asymptotic level $\alpha$ for $H_0$ vs. $H_1$:

\[\delta_\alpha = \mathbb{1}\!\left\{T_{n,m} > \hat{q}_\alpha^{(n, M)}\right\}\]

where $T_{n,m}$ is found by Algorithm 1. The p-value for this test is

\[\begin{align*} \text{p-value} &\coloneqq \mathbb{P}\left[Z \ge T_{n,m}\right] \\ &\approx \frac{\#\{j = 1, \dots, M \colon T_{n,m}^{(j)} \ge T_{n,m}\}}{M} \end{align*}\]

where $Z$ is a random variable distributed as $T_{n,m}$.

For the example shown in Figure 1 and Figure 2, let $\alpha = 0.05$. The test statistic, its estimated $(1-\alpha)$-quantile, and the p-value are

\[\begin{align*} T_{100, 70} &= 0.2614, \\ \hat{q}_\alpha &= 0.2057, \\ \text{p-value} &= 0.005. \end{align*}\]

Since $T_{100, 70} > \hat{q}_\alpha$, and the p-value $< \alpha$, we reject the null hypothesis at the 5% level.

Conclusions

Well, what have we learned? We’ve shown that the K-S test is distribution-free, since the test statistic does not rely on the underlying cdfs of the samples. This fact means that we can use the K-S test with no prior knowledge of the system that produced the data… which is entirely the point of performing a test! Satisfying. If you know (or assume) some more information about the underlying distributions, you can use the Anderson-Darling Test.

As we’ve seen from the algorithms and code, it is also (relatively) easy to compute the test statistic and to estimate its quantiles. There are some practical limitations to using K-S tests, but overall it has become a practical tool to hang on your belt.

The entire source code for the figures and algorithms in this post is available here.