{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Kendall's $\\tau$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we explore\n", "\n", "- The theory behind the Kendall test statistic and p-value\n", "- The features of the implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following description is adapted from [[1]](https://arxiv.org/abs/1907.02088):\n", "\n", "To formulate Kendall [[2]](https://academic.oup.com/biomet/article-abstract/30/1-2/81/176907?redirectedFrom=fulltext), define $(x_i, y_i)$ and $(x_j, y_j)$ as concordant if the ranks agree: $x_i > x_j$ and $y_i > y_j$ or $x_i < x_j$ and $y_i < y_j$. They are discordant if the ranks disagree: $x_i > x_j$ and $y_i < y_j$ or $x_i < x_j$ and $y_i > y_j$. If $x_i = x_j$ and $y_i = y_j$, the pair is said to be tied. Let $n_c$ and $n_d$ be the number of concordant and discordant pairs respectively and $n_0 = n (n - 1) / 2$. In the case of no ties, the test statistic is defined as\n", "\n", "$$\\mathrm{Kendall}_n = \\frac{n_c - n_d}{n_0},$$\n", "\n", "Further define\n", "\n", "$$n_1 = \\sum_i \\frac{t_i (t_i - 1)}{2},$$\n", "$$n_2 = \\sum_j \\frac{u_j (u_j - 1)}{2},$$\n", "$$t_i = \\mathrm{number\\ of\\ tied\\ values\\ in\\ the}\\ i \\mathrm{th\\ group\\ of\\ ties\\ in\\ the\\ first\\ quantity\\, and},$$\n", "$$u_j = \\mathrm{number\\ of\\ tied\\ values\\ in\\ the}\\ j \\mathrm{th\\ group\\ of\\ ties\\ in\\ the\\ second\\ quantity}.$$\n", "\n", "In the case of ties, the statistic is calculated as in [[3]](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001)\n", "\n", "$$\\mathrm{Kendall}_n = \\frac{n_c - n_d}{\\sqrt{(n_0 - n_1) (n_0 - n_2)}}.$$\n", "\n", "This implementation wraps `scipy.stats.kendalltau` [[4]](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html) to conform to the `mgcpy` API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Kendall's $\\tau$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before delving straight into function calls, let's first import some useful functions, to ensure consistency in these examples, we set the seed:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt; plt.style.use('classic')\n", "import seaborn as sns; sns.set(style=\"white\")\n", "\n", "from mgcpy.independence_tests.kendall_spearman import KendallSpearman\n", "from mgcpy.benchmarks import simulations as sims\n", "\n", "np.random.seed(12345678)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start, let's simulate some linear data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x, y = sims.linear_sim(num_samp=100, num_dim=1, noise=0.1)\n", "\n", "fig = plt.figure(figsize=(8,8))\n", "fig.suptitle(\"Linear Simulation\", fontsize=17)\n", "ax = sns.scatterplot(x=x[:,0], y=y[:,0])\n", "ax.set_xlabel('Simulated X', fontsize=15)\n", "ax.set_ylabel('Simulated Y', fontsize=15) \n", "plt.axis('equal')\n", "plt.xticks(fontsize=15)\n", "plt.yticks(fontsize=15)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The test statistic and p-value can be called by creating the `KendallSpearman` object and simply calling the corresponding test statistic and p-value methods. When creating the object, it is necessary to define the `which_test` parameter so that the correct test is run (Kendall in this case)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Kendall test statistic: 0.8905050505050507\n", "P Value: 2.2897821932369628e-39\n" ] } ], "source": [ "kendall = KendallSpearman(which_test=\"kendall\")\n", "kendall_statistic, independence_test_metadata = kendall.test_statistic(x, y)\n", "p_value, _ = kendall.p_value(x, y)\n", "\n", "print(\"Kendall test statistic:\", kendall_statistic)\n", "print(\"P Value:\", p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note that Kendall only operates on univariate data.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Spearman's $\\rho$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we explore\n", "\n", "- The theory behind the Spearman test statistic and p-value\n", "- The features of the implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spearman and Kendall are closely related because they both operate on univariate ranked data. The following description is adapted from [[1]](https://arxiv.org/abs/1907.02088):\n", "\n", "Spearman can be thought of as closely related to Pearson's product-moment correlation [[5]](https://www.jstor.org/stable/1412159?origin=crossref&seq=1#metadata_info_tab_contents). Suppose that $\\mathrm{rg}_{x_i}$ and $\\mathrm{rg}_{y_i}$ are the respective ranks of $n$ raw scores $x_i$ and $y_i$, $\\rho$ denotes the Pearson's coefficient but applied to rank variables, $\\hat{\\mathrm{cov}} (\\mathrm{rg}_{\\mathbf{x}}, \\mathrm{rg}_{\\mathbf{y}})$ denotes the covariance of the rank variables, and $\\hat{\\sigma}_{\\mathrm{rg}_{\\mathbf{x}}}$ and $\\hat{\\sigma}_{\\mathrm{rg}_{\\mathbf{y}}}$ denote the standard deviations of the rank variables. The statistic is\n", "\n", "$$\\mathrm{Spearman}_s = \\rho_{\\mathrm{rg}_{\\mathbf{x}}, \\mathrm{rg}_{\\mathbf{y}}} = \\frac{\\hat{\\mathrm{cov}} (\\mathrm{rg}_{\\mathbf{x}}, \\mathrm{rg}_{\\mathbf{y}})}{\\hat{\\sigma}_{\\mathrm{rg}_{\\mathbf{x}}} \\hat{\\sigma}_{\\mathrm{rg}_{\\mathbf{y}}}}.$$\n", "\n", "This implementation wraps `scipy.stats.spearmanr` [[4]](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) to conform to the `mgcpy` API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Spearman's $\\rho$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The test statistic and p-value can be called by creating the `KendallSpearman` object and simply calling the corresponding test statistic and p-value methods. When creating the object, it is necessary to define the `which_test` parameter so that the correct test is run (Spearman in this case). Using the same linear relationship as before:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Kendall test statistic: 0.982214221422142\n", "P Value: 5.3309467589776e-73\n" ] } ], "source": [ "spearman = KendallSpearman(which_test=\"spearman\")\n", "spearman_statistic, independence_test_metadata = spearman.test_statistic(x, y)\n", "p_value, _ = spearman.p_value(x, y)\n", "\n", "print(\"Kendall test statistic:\", spearman_statistic)\n", "print(\"P Value:\", p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note that Spearman only operates on univariate data.**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }