This is one of the 100+ free recipes of the IPython Cookbook, Second Edition, by Cyrille Rossant, a guide to numerical computing and data science in the Jupyter Notebook. The ebook and printed book are available for purchase at Packt Publishing.
▶ Text on GitHub with a CC-BY-NC-ND license
▶ Code on GitHub with a MIT license
Chapter 7 : Statistical Data Analysis
Statistical hypothesis testing allows us to make decisions in the presence of incomplete data. By definition, these decisions are uncertain. Statisticians have developed rigorous methods to evaluate this risk. Nevertheless, some subjectivity is always involved in the decision-making process. The theory is just a tool that helps us make decisions in an uncertain world.
Here, we introduce the most basic ideas behind statistical hypothesis testing. We will follow an particularly simple example: coin tossing. More precisely, we will show how to perform a z-test, and we will briefly explain the mathematical ideas underlying it. This kind of method (also called the frequentist method), although widely used in science, is not without flaws and interpretation difficulties. We will show another approach based on Bayesian theory later in this chapter. It is very helpful to understand both approaches.
You need to have a basic knowledge of probability theory for this recipe (random variables, distributions, expectancy, variance, central limit theorem, and so on).
Many frequentist methods for hypothesis testing roughly involve the following steps:
- Writing down the hypotheses, notably the null hypothesis, which is the opposite of the hypothesis we want to prove (with a certain degree of confidence).
- Computing a test statistic, a mathematical formula depending on the test type, the model, the hypotheses, and the data.
- Using the computed value to reject the hypothesis with a given level of uncertainty, or fail to conclude (and, consequently, accept the hypothesis until future studies reject it).
For example, to test the efficacy of a new drug, doctors may consider, as a null hypothesis, that the drug has no statistically significant effect on a group of patients compared to a control group of patients who do not take the drug. If studies reject the null hypothesis, it is an argument in favor of the efficacy of the drug (but it is not a definite proof).
Here, we flip a coin
We denote the Bernoulli distribution by
A Bernoulli variable is:
- 0 (tail) with probability
$1-q$ - 1 (head) with probability
$q$
Here are the steps required to conduct a simple statistical z-test:
- Let's suppose that after
$n=100$ flips, we get$h=61$ heads. We choose a significance level of 0.05: is the coin fair or not? Our null hypothesis is: the coin is fair ($q = 1/2$ ). We set these variables:
import numpy as np
import scipy.stats as st
import scipy.special as sp
n = 100 # number of coin flips
h = 61 # number of heads
q = .5 # null-hypothesis of fair coin
- Let's compute the z-score, which is defined by the following formula (
xbar
is the estimated average of the distribution). We will explain this formula in the next section, How it works...
xbar = float(h) / n
z = (xbar - q) * np.sqrt(n / (q * (1 - q)))
# We don't want to display more than 4 decimals.
z
2.2000
- Now, from the z-score, we can compute the p-value as follows:
pval = 2 * (1 - st.norm.cdf(z))
pval
0.0278
- This p-value is less than 0.05, so we reject the null hypothesis and conclude that the coin is probably not fair.
The coin tossing experiment is modeled as a sequence of
The following formula gives the sample mean (proportion of heads here):
Knowing the expectancy
The z-test is the normalized version of
Under the null hypothesis, what is the probability of obtaining a z-test higher (in absolute value) than some quantity
The following diagram illustrates the z-score and the p-value:
In this formula, scipy.stats.norm.cdf
. So, given the z-test computed from the data, we compute the p-value: the probability of observing a z-test more extreme than the observed test, under the null hypothesis.
If the p-value is less than five percent (a frequently-chosen significance level, for arbitrary and historical reasons), we conclude that either:
- The null hypothesis is false, thus we conclude that the coin is unfair.
- The null hypothesis is true, and it's just bad luck if we obtained these values. We cannot make a conclusion.
We cannot disambiguate between these two options in this framework, but typically the first option is chosen. We hit the limits of frequentist statistics, although there are ways to mitigate this problem (for example, by conducting several independent studies and looking at all of their conclusions).
There are many statistical tests that follow this pattern. Reviewing all those tests is largely beyond the scope of this book, but you can take a look at the reference at https://en.wikipedia.org/wiki/Statistical_hypothesis_testing.
As a p-value is not easy to interpret, it can lead to wrong conclusions, even in peer-reviewed scientific publications. For an in-depth treatment of the subject, see http://www.refsmmat.com/statistics/.
- The Getting started with Bayesian methods recipe