Kolmogorov-Smirnov Test for Two Samples
February 3, 2021 · 24 min readIf you’re reading this post, you’ve probably heard of the Kolmogorov-Smirnov Test for normality testing, and maybe even the two-sample version. But how is this result derived? As I was working my way through MIT’s 18.650, this question is raised through one of the homework assignments. This post loosely follows that assignment, with some additional commentary and code I wrote to actually implement my algorithms described herein.
Problem Statement
Consider two independent samples $X_1, \dots, X_n$, and $Y_1, \dots, Y_m$ of independent, real-valued, continuous random variables, and assume that the $X_i$’s are i.i.d. with some cdf $F$ and that the $Y_i$’s are i.i.d. with some cdf $G$. Note that the two samples may have different sizes (if $n \ne m$). We want to test whether $F = G$. Consider the following hypotheses:
\[\begin{align*} H_0 \colon ``F = G" \\ H_1 \colon ``F \ne G" \end{align*}\]For simplicity, we will assume that $F$ and $G$ are continuous and increasing.
Example Experiment
An example experiment in which testing if two samples are from the same distribution is of interest may be encountered in a lab setting where we have two devices for measurement, and wish to determine if the errors have the same distribution for our analysis.
CDF Distributions
Let
\[\begin{align*} U_i &= F(X_i), \quad \forall i = 1, \dots, n, \\ V_j &= G(Y_j), \quad \forall j = 1, \dots, n. \end{align*}\]Proposition 1. The distribution of the cdf of a continuous random variable is uniform on $[0, 1]$.
Proof. The distributions of $U_i$ and $V_j$ can be determined by finding their cdfs. The cdf of $U_i$ is defined by $F_U(t) \coloneqq \mathbb{P}\left[U_i \le t\right]$. Assuming that $F(X)$ and $G(Y)$ are invertible, it follows that
\[\begin{align*} \mathbb{P}\left[U_i \le t\right] &= \mathbb{P}\left[F(X_i) \le t\right] &\quad&\text{(definition of $U_i$)} \\ &= \mathbb{P}\left[X_i \le F^{-1}(t)\right] \\ &= F(F^{-1}(t)) &\quad&\text{(definition of cdf)} \\ &= t \\ \therefore F_U(t) &= t \\ \implies f_U(t) &= \mathcal{U}\left(\left[ 0, 1 \right]\right) \tag*{◻} \end{align*}\]Likewise, $f_V(t) = \mathcal{U}\left(\left[ 0, 1 \right]\right)$.
Empirical CDFs
Let $F_n$ be the empirical cdf of ${X_1, \dots, X_n}$ and $G_m$ be the empirical cdf of ${Y_1, \dots, Y_m}$.
The Test Statistic
Let the test statistic be
\[T_{n,m} = \sup_{t \in \mathbb{R}} \left| F_n(t) - G_m(t) \right|.\]Proposition 2. The test statistic $T_{n,m}$ can be written as the maximum value of a finite set of numbers.
Proof. By definition, The indicator function is defined as \(\mathbb{1}\!\left\{X \le t\right\} = \begin{cases} 1 & \text{if } X \le t \\ 0 & \text{otherwise} \end{cases}.\) the cdf
\[\begin{align*} F(t) &= \mathbb{P}\left[X \le t\right] \quad \forall t \in \mathbb{R}\\ &= \mathbb{E}\left[\mathbb{1}\!\left\{X \le t\right\}\right].\\ \end{align*}\]By the Law of Large Numbers, the expectation can be approximated by the sample average, so we can define the empirical cdf as
\[\begin{align} F_n(t) &= \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} \label{1}\tag{1} \end{align}\]Likewise,
\[\begin{align} G_m(t) &= \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\}. \label{2}\tag{2} \end{align}\] \[\therefore T_{n,m} = \sup_{t \in \mathbb{R}} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\} \right|.\]The empirical cdfs $\eqref{1}$ and $\eqref{2}$ can also be written
\[\begin{align*} F_n(t) &= \#\{i=1, \dots, n \colon X_i \le t\} \cdot \frac{1}{n} \\ G_m(t) &= \#\{i=1, \dots, m \colon Y_j \le t\} \cdot \frac{1}{m}, \end{align*}\]so the only values that the empirical cdfs can take are the discrete sets
\[\begin{align*} F_n(i) &= \frac{i}{n} \quad \forall i = 1, \dots, n \\ G_m(j) &= \frac{j}{m} \quad \forall j = 1, \dots, m. \end{align*}\]Therefore, the test statistic can be rewritten as the maximum value of a finite set of numbers:
\[\begin{split} T_{n,m} = \max_{i=0,\dots,n} \Bigg[ &\max_{j=0,\dots,m} \left| \frac{i}{n} - \frac{j}{m} \right| \mathbb{1}\!\left\{Y^{(j)} \le X^{(i)} < Y^{(j+1)}\right\}, \\ &\max_{k=j+1, \dots, m} \left| \frac{i}{n} - \frac{k}{m} \right| \mathbb{1}\!\left\{Y^{(k)} \le X^{(i+1)}\right\} \Bigg] \end{split}\]where $X^{(i)}$ is the $i^\text{th}$ value in the ordered set of data $X^{(1)} \le \cdots \le X^{(n)}$. The value $X^{(0)} \coloneqq -\infty$ is prepended to the otherwise finite realizations to simplify the computation. ◻
The following algorithm calculates the KS test statistic for two given samples.
Require: $X, Y$ are vectors of real numbers.
Ensure: $0 \le T_{n,m} \le 1$.
Procedure KS2Sample($X, Y$)
$X_s \gets \{-\infty,$ Sort($X$)$\}$
$Y_s \gets$ Sort($Y$)
$n \gets \dim X_s$
$m \gets \dim Y_s$
$T_v \gets$ empty array of size $n$
for all $i \in \{0, \dots, n\}$ do
$j \gets j$ + Rank($\{Y_s^{(\ell)}\}_{\ell=j}^m, X_s^{(i)}$) ▷ Only search remaining $j$ values
$k \gets j$ + Rank($\{Y_s^{(\ell)}\}_{\ell=j}^m, X_s^{(\min(i+1, n))}$)
$\displaystyle{T_v^{(i)} \gets \max\left(\left|\frac{i}{n} - \frac{j}{m}\right|, \left|\frac{i}{n} - \frac{k}{m}\right|\right)}$
end for
Return $\max_i T_v$
end procedure
Function Rank($A, k$)
Assert $A$ is sorted in ascending order.
Return $\#\{i=1,\dots,\dim A \colon k < A_i\}$
end function
The following subroutine is an implementation of Algorithm 1. It computes an array of values $T_v(i)$ for each value of $X_i$. The test statistic $T_{n,m}$ is the maximum of these values.
The entire source code for the figures and algorithms in this post is available here.
def _ks_2samp(X, Y):
"""Compute the Kolmogorov-Smirnov statistic on 2 samples.
Parameters
----------
X, Y : (N,), (M,) array_like
Two arrays of sample observations assumed to be drawn from a continuous
distribution, sample sizes can differ.
Returns
-------
Tv : (N+1,) ndarray
Maximum difference in CDFs for each value of X.
.. note:: The KS statistic itself is the maximum of these `Tv` values, but
use this helper function for debugging.
"""
n = len(X)
m = len(Y)
# Sort copies of the data
Xs = np.hstack([-np.inf, np.sort(X)]) # pad extra point
Ys = np.sort(Y)
# Calculate the maximum difference in the empirical CDFs
Tv = np.zeros(n+1) # extra value at Fn = 0 (Xs -> -infty)
js = np.zeros(n+1, dtype=int)
j = 0
for i in range(n+1):
# Find greatest Ys point j s.t. Ys[j] <= Xs[i] and Xs[i] < Ys[j+1]
j += _rank(Ys[j:], Xs[i]) # only search remaining values
test_lo = np.abs(i/n - j/m)
j_lo = j
# Find next greatest Ys point k s.t. Ys[k] < X[i+1]
k = _rank(Ys[j:], Xs[min(i+1, n)]) + j
test_hi = np.abs(i/n - k/m)
j_hi = k
# Take the maximum distance, and corresponding index
Tv[i] = np.max((test_lo, test_hi))
js[i] = j_lo if np.argmax((test_lo, test_hi)) == 0 else j_hi
return Tv, js
def _rank(A, k):
"""Return the number of keys in `A` strictly less than `k`.
Parameters
----------
A : (M,) array_like
Array of values, must be sorted in ascending order.
k : comparable
Key for which to search in `A`. Must be comparable to values in `A`.
`k` need not be in the array.
Returns
-------
result : int
Number of keys in `A` strictly less than `k`. 0 if `k` is less than all
elements of `A`. `len(A)` if `k` is greater than all elements of `A`.
"""
assert all(A == sorted(A))
lo = 0
hi = len(A) - 1
while lo <= hi:
mid = (hi + lo) // 2
if k < A[mid]:
hi = mid - 1
elif k > A[mid]:
lo = mid + 1
else: # k == A[mid]
return mid
return lo
An example two-sample KS-test is shown in Figure 1.
The Null Hypothesis
Proposition 3. If $H_0$ is true, then the test statistic
\[T_{n,m} = \sup_{0 \le x \le 1} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{U_i \le x\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{V_j \le x\right\} \right|.\]Proof. By $\eqref{1}$ and $\eqref{2}$,
\[\begin{equation} \label{3}\tag{3} T_{n,m} = \sup_{t \in \mathbb{R}} \left| \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left\{X_i \le t\right\} - \frac{1}{m}\sum_{j=1}^{m} \mathbb{1}\!\left\{Y_j \le t\right\} \right|. \end{equation}\]To show the proposition is true, we make a change of variable. Let
\[x = F(t).\]Then,
\[t \in \mathbb{R}\implies x \in [0, 1].\]Since $F$ and $G$ are continuous and monotonically increasing,
\[\begin{align*} X_i \le t &\iff F(X_i) \le F(t) \\ &\iff U_i \le x &\quad&\text{(definition)}. \end{align*}\]Similarly,
\[\begin{align*} Y_i \le t &\iff G(Y_i) \le G(t) \\ &\iff G(Y_i) \le F(t) &\quad&\text{(under $H_0$)} \\ &\iff V_i \le x &\quad&\text{(definition)}. \end{align*}\]Substitution of these expressions into $\eqref{3}$ completes the proof. ◻
The Joint Distribution of the Samples
Proposition 4. If $H_0$ is true, the joint distribution of $U_1, \dots, U_n, V_1, \dots, V_m$ $(n+m)$ random variables is uniform on $[0, 1]$.
Proof.
\[\begin{align*} \mathbb{P}\left[U_i \le t\right] &= \mathbb{P}\left[F(X_i) \le t\right] \\ &= \mathbb{P}\left[F(X_1) \le t\right] &\quad&\text{(i.i.d.)} \\ &= \mathbb{P}\left[G(X_1) \le t\right] &\quad&\text{(under $H_0$)} \\ &= \mathbb{P}\left[G(Y_1) \le t\right] &\quad&\text{(i.i.d.)} \\ &= \mathbb{P}\left[V_1 \le t\right] &\quad&\text{(definition)} \\ \end{align*}\]These probabilities can be rearranged to find the cdfs of $U$ and $V$
\[\begin{align*} &= \mathbb{P}\left[X_1 \le F^{-1}(t)\right] \\ &= F(F^{-1}(t)) &\quad&\text{(definition of cdf)} \\ &= t \\ \therefore F_U(t) &= G_V(t) = t \\ \implies f_{U,V}(t) &= \mathcal{U}\left(\left[ 0, 1 \right]\right) \tag*{◻} \end{align*}\]The Test Statistic is Pivotal
Since Proposition 4 has been shown to be true under the null hypothesis $H_0$, and the distributions of $U_i$ and $V_j$ have been shown to be $\mathcal{U}\left(\left[ 0, 1 \right]\right)$ independent of the distributions of the underlying samples $X_i$, $Y_j$, we conclude that $T_{n,m}$ is pivotal, i.e. it does not itself depend on the unknown distributions of the samples.
Quantiles of the Test Statistic
Let $\alpha \in (0, 1)$ and $q_\alpha$ be the $(1 - \alpha)$-quantile of the distribution of $T_{n,m}$ under $H_0$. The quantile $q_\alpha$ is given by
\[\begin{align*} q_\alpha &= F^{-1}(1-\alpha) \\ &= \inf\{x \colon F(x) \ge 1 - \alpha\} \\ &\approx \min\{x \colon F_n(x) \ge 1 - \alpha\}, \quad n < \infty \\ \implies q_\alpha \approx \hat{q}_\alpha &= \min_i \left\{ T_{n,m}^{(i)} \colon \tfrac{i}{M} \ge 1 - \alpha \right\} \end{align*}\]where $M \in \mathbb{N}$ is large, and $T_{n,m}^{(i)}$ is the $i^\text{th}$ value in a sorted sample of $M$ test statistics. Thus, $q_\alpha$ can be approximated by choosing $i = \ceil{M(1 - \alpha)}$. An algorithm to approximate $q_\alpha$ given $\alpha$ is as follows.
Require: $n = \dim X$, $m = \dim Y$, $M \in \mathbb{N}$, and $\alpha \in (0, 1)$.
Ensure: $q_\alpha \in [0, 1]$.
Procedure KSQuantile($n, m, M, \alpha$)
$T_v \gets$ empty array of size $n$
for all $i \in \{0,\dots,M\}$ do
$X_s \gets$ sample of size $n$ from $\mathcal{N}\left( 0, 1 \right)$.
$Y_s \gets$ sample of size $m$ from $\mathcal{N}\left( 0, 1 \right)$.
$T_v^{(i)} \gets$ KS2Sample($X_s, Y_s$) ▷ defined in Algorithm 1
end for
$T_{vs} \gets$ Sort($T_v$)
$j \gets \ceil{M(1 - \alpha)}$
Return $T_{vs}^{(j)}$
end procedure
A plot of the distribution of
\[\frac{T_{n,m}^M - \overline{T}_{n,m}^M}{\sqrt{\operatorname{Var}\left(T_{n,m}^M\right)}}\]is shown in Figure 2 in comparison to a standard normal. The test statistic distribution is skewed to the left, and has a longer right tail than the standard normal. Since the asymptotic distribution of the test statistic is not readily found in theory, we rely on simulation via Algorithm 2 to estimate the quantiles.
The Hypothesis Test
Given the aproximation for \(\hat{q}_\alpha\) for \(q_\alpha\) from Algorithm 2, we define a test with non-asymptotic level $\alpha$ for $H_0$ vs. $H_1$:
\[\delta_\alpha = \mathbb{1}\!\left\{T_{n,m} > \hat{q}_\alpha^{(n, M)}\right\}\]where $T_{n,m}$ is found by Algorithm 1. The p-value for this test is
\[\begin{align*} \text{p-value} &\coloneqq \mathbb{P}\left[Z \ge T_{n,m}\right] \\ &\approx \frac{\#\{j = 1, \dots, M \colon T_{n,m}^{(j)} \ge T_{n,m}\}}{M} \end{align*}\]where $Z$ is a random variable distributed as $T_{n,m}$.
For the example shown in Figure 1 and Figure 2, let $\alpha = 0.05$. The test statistic, its estimated $(1-\alpha)$-quantile, and the p-value are
\[\begin{align*} T_{100, 70} &= 0.2614, \\ \hat{q}_\alpha &= 0.2057, \\ \text{p-value} &= 0.005. \end{align*}\]Since $T_{100, 70} > \hat{q}_\alpha$, and the p-value $< \alpha$, we reject the null hypothesis at the 5% level.
Conclusions
Well, what have we learned? We’ve shown that the K-S test is distribution-free, since the test statistic does not rely on the underlying cdfs of the samples. This fact means that we can use the K-S test with no prior knowledge of the system that produced the data… which is entirely the point of performing a test! Satisfying. If you know (or assume) some more information about the underlying distributions, you can use the Anderson-Darling Test.
As we’ve seen from the algorithms and code, it is also (relatively) easy to compute the test statistic and to estimate its quantiles. There are some practical limitations to using K-S tests, but overall it has become a practical tool to hang on your belt.