\(k\)-Sample Test¶
In this tutorial, we explore
The theoretical formulation of the \(k\)-Sample test
The implementation of the \(k\)-Sample test in
mgcpy
Theory¶
The \(k\)-Sample test is a test for sameness of distributions. For \(k = 2\), the test is written as follows.
We wish to test:
Note that random variables \(U\) and \(V\) much be defined over the same space, usually \(\mathbb{R}^p\) for the test to make sense. Additionally, the sample sizes \(n\) and \(m\) can be different, and the samples are unpaired.
The 2-Sample Transform¶
A 2-Sample test can be written as an independence test with the following transform. Let \(X_i = U_i\) and \(Y_i = 0\) for \(i = 1, ..., n\). Similarly, let \(X_i = V_{i-n}\) and \(Y_i = 1\) for \(i = n+1, ..., n+m\). We now have a sample \(\{(X_i, Y_i)\}_{i=1}^{n+m}\), for which to run an independence test. The intuition is that if the samples of \(U\) and \(V\) are dependent with their sample label, then they are from different distributions [1].
Generalization to \(k\)-Samples¶
The \(k\)-Sample problem is a natural extension. In this scenario, we have for \(k = 1, ..., K\):
We wish to test:
The \(k\)-Sample transform is computed similarly, by concatenating the individual samples into an \(N = \sum_k n_k\) size data set, with labels \(Y_i\) taking values in \(\{1, ..., k\}\). The final transformed dataset \(\{(X_i, Y_i)\}_{i=1}^N\) can be run through an independence test.
Using \(K\)-Sample Transform¶
[1]:
import numpy as np
from mgcpy.hypothesis_tests.transforms import k_sample_transform
from mgcpy.benchmarks.simulations import w_sim
Below, we simulate W-shaped data to form one sample, and rotate it to form another sample. We then convert the data into an input for an independence test.
[2]:
n_U = 60
n_V = 40
Q = np.array([[0, -1], [1, 0]]) # Rotation matrix.
# Simulate 2 dimensional data and rotate it 90 degrees.
u1, u2 = w_sim(num_samp = n_U, num_dim = 1, noise = 1)
U = np.concatenate((u1,u2), axis = 1)
V = np.dot(U, Q)[range(n_V),:]
print("The shape of U is:", U.shape)
print("The shape of V is:", V.shape)
The shape of U is: (60, 2)
The shape of V is: (40, 2)
[3]:
X, Y = k_sample_transform(U, V)
print("The shape of X is: ", X.shape)
print("The shape of Y is: ", Y.shape)
The shape of X is: (100, 2)
The shape of Y is: (100, 1)
At this point, many of the independence tests in mgcpy
can be used on this data.
[4]:
from mgcpy.independence_tests.dcorr import DCorr
from mgcpy.independence_tests.mgc import MGC
dcorr = DCorr(which_test='biased')
mgc = MGC()
print("The p-value of DCorr for the 2-Sample test is: %.3f" % dcorr.p_value(X,Y)[0])
print("The p-value of MGC for the 2-Sample test is: %.3f"% mgc.p_value(X,Y)[0])
The p-value of DCorr for the 2-Sample test is: 0.001
The p-value of MGC for the 2-Sample test is: 0.001