# A brief introduction to hypothesis testing

- A brief introduction to hypothesis testing
- Introduction
- Parametric hypothesis testing
- T-test
- Z-test
- F-test (ANOVA)
- Tukey Test
- Chi-Squared test
- Non-Parametric hypothesis testing
- Mann-Whitney U-test — Wilcoxon rank-sum test
- Wilcoxon signed-rank test
- Kruskal–Wallis test
- Foreshadowing: Advanced Topics—but for another day

# Introduction

Hypothesis testing is fundamental to any studies in statistics because it allows us to draw conclusions and inferences about populations based on a sampled dataset. For the most part, it is one of those concepts that most scientists use daily, but very few take the time to carefully understand. As such, it is subject to abuse sometimes called p-hacking which is at the heart of the replication crisis in the social sciences.

In short, provides a structured approach for evaluating claims—called hypotheses—based off the statistical evidence. Put shortly, hypothesis testing typically involves assessing the plausibility of two competing hypotheses: the null hypothesis (H0) and the alternative hypothesis (HA).

The null hypothesis is the status quo or the default assumption, which indicates no significant difference or relationship between variables, or that any observed difference is purely due to chance. On the other hand, the alternative hypothesis represents the opposite, that is there is a significant difference or relationship between variables, suggesting that the observed data is not solely attributable to chance.

To evaluate these hypotheses, we use statistical tests that measure the strength of evidence against the null hypothesis via a test statistic. This evidence is quantified by the p-value and corresponds to the probability of observing the data or more extreme results under the assumption that the null hypothesis is true. The p-value is compared to a pre-defined significance level, typically denoted as $\alpha$, which determines the threshold for accepting or rejecting the null hypothesis—usually $\alpha = 0.05$. If the p-value is lower than $\alpha$, this indicates that we reject the null hypothesis i.e. the observed result is not due to chance.

Hypothesis testing provides a framework for making data-driven decisions and drawing meaningful conclusions. By formulating clear null and alternative hypotheses and defining an appropriate significance level, we can assess the statistical evidence and make informed judgments about the relationships, differences, or effects of interest in various domains of study.

# Parametric hypothesis testing

Parametric tests will be any students introduction into hypothesis testing. Put briefly parametric hypothesis testing assumes that the underlying population that we are sampling is approximately normally distributed. There are several commonly used tests used which we will briefly cover here

### T-test

The t-test is employed when we want to compare the means of one or two independent groups, sometimes called one sample or two sample t-tests. It assumes that the population data follows a normal distribution and the number of samples is low $n \sim \mathcal{O}(10)$. For example, the one sample t-test calculates a test statistic

$t = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}$where $\bar{X}$ is the sample mean of data, $\sigma$ is the sample standard deviation $\mu$ is the population mean, and $n$ is the number of samples. The two sample is identical but we replace $\mu$ with the second sample mean, and $\sigma$ with the pooled standard deviation. By comparing the t-value to a critical value from the t-distribution or calculating the p-value, we can determine if the means are significantly different (reject the null H0). Basically this translates to “how many standard errors is the sample mean away from the population mean”.

It’s very easy to implement in python assuming we have two datasets for men and women heights (with known population means)

```
from scipy.stats import ttest_1samp, ttest_ind
# calculate one-sample t-test
t_statistic_1, p_value_1 = ttest_1samp(mens_height, womens_population_mean)
# calculate two-sample t-test
t_statistic_2, p_value_2 = ttest_ind(mens_height, women_height)
```

### Z-test

The z-test is a similar test to the t-test but is used when the population is approximately normally distributed standard deviation is known (when you have a large sample size $n\sim \mathcal{O}(100)$). The z-test also permits the both a one-sample and two-sample test in exactly the same fashion as the t-test.

Again, it is very straight forward to implement in python using the statsmodels package

```
from statsmodels.stats.weightstats import ztest
# calculate one sample z-test
z_statistic_1, p_value_1 = ztest(mens_height, value=mean_womens_height)
# calculate two sample z-test
z_statistic_2, p_value_2 = ztest(mens_height, x2=women_height)
```

### F-test (ANOVA)

The F-test, also known as analysis of variance (ANOVA), is used to compare means across multiple groups. For example, suppose we have $K$ different groups, the null hypothesis represents the belief that all $K$ groups have the same mean, whereas the null represents that *one or more *of the means is different. The F-test calculates the ratio of between-group variability to within-group variability defined

where $\sigma_{\rm explained}^{2} = \sum_{i=1}^{K}n_{i}(\bar{X}_{i} - \bar{X})^{2}/(K-1)$ where $\bar{X}$ is the mean of the whole dataset, $\bar{X}_{i}$ is the group means, and $n_{i}$ is the number of samples in each group. The $\sigma_{\rm unexplained}^{2} = \sum_{i=1}^{K}\sum_{j=1}^{n_{i}}(\bar{X}_{ij} - \bar{X}_{i})/(N-K)$ is the sum of all standard squared errors.

We can visualise this using a simple example where we create a dataset showing the distribution of car prices given their colour as shown below. In this example, red and black are very similar but clearly different from blue and green.

We can compute the z-test using the following snippet of code. In this example we would reject the null since at least one colour has a significantly different mean than the others.

```
from scipy.stats import f_oneway
# we have a dataframe df with all our data
grps = pd.unique(df['color'].values)
# retain only the price and color columns
df_anova = df[['price', 'color']]
d_data = {grp:df_anova['price'][df_anova['color'] == grp] for grp in grps}
# set index to false
F_statistic, p_value = f_oneway(d_data['black'], d_data['red'], d_data['blue'], d_data['green'])
```

### Tukey Test

Tukey's test, also referred to as the Tukey-Kramer test, is a post hoc test performed after an ANOVA to determine which specific group means differ significantly from each other. It compares all possible pairs of means and calculates a test statistic based on the t-test. By comparing the test statistic to the critical value or calculating the p-value, we can identify significant pairwise differences.

Using the same example above, we can compute the pairwise p-values as shown in the correlation matrix above. Implementing this in python, is agains very straightforward.

```
from scipy.stats import tukey_hsd
# compute tukey test
m_comp = tukey_hsd(*d_data.values())
# create pairwise tuples from the colour values
pairs = [(i, j) for i in d_data.keys() for j in d_data.keys()]
# create a dataframe with one column as the colour pairs and another column as the p-values
p_values = np.round(m_comp.pvalue.flatten(),6)
tukey_summary = pd.DataFrame(data=p_values, index=pairs, columns=['p_value'])
```

### Chi-Squared test

Lastly, we will now consider the Chi-squared test which is used to determine if there is a significant difference between two categorical variables. For example, you might consider voting patters between men and women and whether they tend to vote left or right. Here the Null hypothesis is that there is no difference between the two variables, however the alternative hypothesis is that there is a difference between the two variables; that is to say that sex and voting patterns are not independent and are correlated. In short, you can then think of a chi squared test determining whether the covariance between two categorical variables is significantly different from zero.

# Non-Parametric hypothesis testing

Where Parametric hypothesis testing assumes the underlying distribution of the population is normal, non-parametric hypothesis testing is used when the population probability distribution is not assumed to have a specific form. Here I will cover a couple of the most commonly used tests!

### Mann-Whitney U-test — Wilcoxon rank-sum test

The Mann-Whitney U test— sometimes called Wilcoxon rank-sum test—assumes the null hypothesis that two independent probability distributions $P_{1}(x)$ and $P_{2}(x)$ are the same. It does not assume any specific distributional shape for the data. The test ranks all the observations from both groups together, that is it collates the data into a single list and ranks the data. It then separates the data back into the two groups and calculates the sum of ranks for each group $S_{1}$ and $S_{2}$, and compares these sums. The idea is pretty simple, if the two distributions are equivalent, then these two sums should come out to be equal. The test statistic is then computed as

$U = {\rm min}(U_{1},U_{2})\,,\qquad U_{i} = n_{1}n_{2} + \frac{n_{i}(n_{i}+1)}{2} - S_{i}$where $n_{i}$ is the number of samples in either of the subsets. The final test statistic is then given by

$z = \frac{U - \mu_{U}}{\sigma_{U}}\,, \quad \mu_{U} = \frac{n_{1}n_{2}}{2}\,, \quad \sigma_{U} = \sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}}$There is a way to deal with the case when you have tied ranks, but we won’t cover this here. Armed with this test statistic, we can then compute the p-values accordingly.

We can implement this for our example in my Random Forest Page where we computed compared the underlying distributions between a training dataset and a test dataset of house prices. Here we wanted to check that our Random Forest model was recreating the underlying distribution of house prices given a test set that was similar in structure to the training set

We can easily test this in python via the ranksums method (returns $z$, or the

```
from scipy import stats
# ranksums
z_stat, p_val = stats.ranksums(np.exp(train_y), np.exp(test_y), alternative='two-sided')
# Mann-Whitney
u_stat, p_val = stats.mannwhitneyu(np.exp(train_y), np.exp(test_y), alternative='two-sided')
```

The result from this test will lead us to accept the Null that these two distributions are the same.

### Wilcoxon signed-rank test

The Wilcoxon signed-rank test, is used to assess whether there is a significant difference between paired observations in a sample. It focuses on the differences between paired values and tests and then ranks the these, before computing the median. If the median is significantly different from zero, then we would reject the null. It takes into account the signs of the differences and ranks the absolute differences.

It’s most illuminating to compare how this differs from the rank-sum test which on face value sounds similar. Suppose you had a two treatments, if you gave both treatments (in a randomised order) to every patient and then recorded the results, then you would use the signed-rank test. If on the other hand, you split the patients into two groups and gave one group one treatment, and the other group the other treatment, then you would use the rank-sum test. This is because the signed-rank handles distributions that are potentially dependent, whereas rank-sum test assumes independence.

We can implement this also very easily in python

```
from scipy import stats
# here df is a dataframe where every patient received both treatements
u_stat, p_value = stats.wilcoxon(df['treatment_1_results'], df['treatment_2_results'])
```

**Kruskal–Wallis test**

The Kruskal-Wallis test is to the Mann-Whitney U-test as the F-test is to the t-test, that is it is a non-parametric one-way ANOVA test. It is suitable when the data doesn’t meet the assumptions of normality. The test ranks the observations from all groups combined, calculates the sum of ranks for each group, and compares these sums. The test statistic is the H-value, which measures the variability between the groups relative to the variability within the groups. The test assumes that the groups have similar shapes, except for the difference in medians.

You would use it in effectively the same way as an ANOVA test provided that you knew you’re underlying distribution’s weren’t normal. In fact, the distributions we used in our car example above for the ANOVA, were actually Log-Normal distributions so we really should have been using the Kruskal test. We can implement this in python again using scipy as

```
from scipy.stats import kruskal
# we have a dataframe df with all our data
grps = pd.unique(df['color'].values)
# retain only the price and color columns
df_anova = df[['price', 'color']]
d_data = {grp:df_anova['price'][df_anova['color'] == grp] for grp in grps}
# set index to false
F_statistic, p_value = kruskal(d_data['black'], d_data['red'], d_data['blue'], d_data['green'])
```

# Foreshadowing: Advanced Topics—but for another day

This post is getting a bit long, so I’ll wrap it up here by concluding that there are some more advanced topics. For example some of the advanced topics I might cover such as multiple hypothesis testing (e.g., Bonferroni correction), power analysis, and Bayesian hypothesis testing. The latter is actually very interesting to incorporate with probabilistic programming such as PYMC. But to briefly summarise, multiple hypothesis testing methods adjust the significance level to account for the increased chance of Type I errors when conducting multiple tests. Power analysis helps determine the required sample size to detect a desired effect size, ensuring studies have sufficient statistical power. Bayesian hypothesis testing incorporates prior beliefs and updates them based on observed data, providing a framework for quantifying evidence and refining hypotheses. I’ll definitely do another post on this in the future.

But to briefly summarise, this post provided a pretty brief introduction to hypothesis testing in statistics, including the null and alternative hypotheses, statistical tests, and significance levels. It covers parametric tests such as the t-test and F-test, and the chi-squared test, as well as non-parametric tests such as the Mann-Whitney U-test and Kruskal-Wallis test. I also added a few short snippets of code and examples to implement these in python!