{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# $k$-Sample Test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we explore\n",
    "\n",
    "- The theoretical formulation of the $k$-Sample test\n",
    "- The implementation of the $k$-Sample test in `mgcpy`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Theory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The $k$-Sample test is a test for sameness of distributions. For $k = 2$, the test is written as follows.\n",
    "\n",
    "$$\\begin{align*}\n",
    "    U_1, ..., U_n &\\sim F_U \\text{ i.i.d.}\\\\\n",
    "    V_1, ..., V_n &\\sim F_V \\text{ i.i.d.}\\\\\n",
    "\\end{align*}$$\n",
    "\n",
    "We wish to test:\n",
    "\n",
    "$$\\begin{align*}\n",
    "    F_U &= F_V\\\\\n",
    "    F_U &\\neq F_V\n",
    "\\end{align*}$$\n",
    "\n",
    "Note that random variables $U$ and $V$ much be defined over the same space, usually $\\mathbb{R}^p$ for the test to make sense. Additionally, the sample sizes $n$ and $m$ can be different, and the samples are unpaired."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The 2-Sample Transform\n",
    "A 2-Sample test can be written as an independence test with the following transform. Let $X_i = U_i$ and $Y_i = 0$ for $i = 1, ..., n$. Similarly, let $X_i = V_{i-n}$ and $Y_i = 1$ for $i = n+1, ..., n+m$. We now have a sample $\\{(X_i, Y_i)\\}_{i=1}^{n+m}$, for which to run an independence test. The intuition is that if the samples of $U$ and $V$ are dependent with their sample label, then they are from different distributions [[1]](https://arxiv.org/abs/1806.05514)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generalization to $k$-Samples\n",
    "The $k$-Sample problem is a natural extension. In this scenario, we have for $k = 1, ..., K$:\n",
    "$$U^{(k)}_1, ..., U^{(k)}_{n_k} \\sim F_{U^{(k)}} \\text{ i.i.d.}$$\n",
    "\n",
    "We wish to test:\n",
    "$$\\begin{align*}\n",
    "    F_{U^{(k)}} &= F_{U^{(j)}} \\text{ for all } j \\neq k\\\\\n",
    "    F_{U^{(k)}} &\\neq F_{U^{(j)}} \\text{ for some } j \\neq k\n",
    "\\end{align*}$$\n",
    "\n",
    "The $k$-Sample transform is computed similarly, by concatenating the individual samples into an $N = \\sum_k n_k$ size data set, with labels $Y_i$ taking values in $\\{1, ..., k\\}$. The final transformed dataset $\\{(X_i, Y_i)\\}_{i=1}^N$ can be run through an independence test."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using $K$-Sample Transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from mgcpy.hypothesis_tests.transforms import k_sample_transform\n",
    "from mgcpy.benchmarks.simulations import w_sim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we simulate W-shaped data to form one sample, and rotate it to form another sample. We then convert the data into an input for an independence test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of U is: (60, 2)\n",
      "The shape of V is: (40, 2)\n"
     ]
    }
   ],
   "source": [
    "n_U = 60\n",
    "n_V = 40\n",
    "Q = np.array([[0, -1], [1, 0]]) # Rotation matrix.\n",
    "\n",
    "# Simulate 2 dimensional data and rotate it 90 degrees.\n",
    "u1, u2 = w_sim(num_samp = n_U, num_dim = 1, noise = 1)\n",
    "U = np.concatenate((u1,u2), axis = 1)\n",
    "V = np.dot(U, Q)[range(n_V),:]\n",
    "print(\"The shape of U is:\", U.shape)\n",
    "print(\"The shape of V is:\", V.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of X is:  (100, 2)\n",
      "The shape of Y is:  (100, 1)\n"
     ]
    }
   ],
   "source": [
    "X, Y = k_sample_transform(U, V)\n",
    "print(\"The shape of X is: \", X.shape)\n",
    "print(\"The shape of Y is: \", Y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, many of the independence tests in `mgcpy` can be used on this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The p-value of DCorr for the 2-Sample test is: 0.001\n",
      "The p-value of MGC for the 2-Sample test is: 0.001\n"
     ]
    }
   ],
   "source": [
    "from mgcpy.independence_tests.dcorr import DCorr\n",
    "from mgcpy.independence_tests.mgc import MGC\n",
    "\n",
    "dcorr = DCorr(which_test='biased')\n",
    "mgc = MGC()\n",
    "\n",
    "print(\"The p-value of DCorr for the 2-Sample test is: %.3f\" % dcorr.p_value(X,Y)[0])\n",
    "print(\"The p-value of MGC for the 2-Sample test is: %.3f\"% mgc.p_value(X,Y)[0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}