Student's Paired (Dependent or Repeated Measures) t-Test for the Difference in Means with SciPy, statsmodels, and pingouin

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than dive deep into the math behind these tests, I'll try to focus on the code from an end to end business solution perspective as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Paired_T_Test_for_Means.ipynb

OUTLINE

The Student's Paired (Dependent or Repeated Measures) t-Test for the Difference in Means
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Student's Paired t-Test (SciPy)
Student's Paired t-Test (pingouin)
Difference in Means and 95% Confidence Interval
Effect Size
Visualization
Further Reading & Resources

Student's Paired t-Test for the Difference in Means

Student's Paired t-Test, also known as the Dependent or Repeated Measures t-Test, is a statistical test used to determine whether there is a significant difference in means between one group measured twice (once before and after an intervention). This is a flexible test perfect for smaller data that is symmetrical and approximates a Normal distribution. When the data becomes large the results will resemble a Z-Test, which relies on the Normal distribution. The variance of the group before and after the intervention should also be roughly equal.

This is a very common test with a rich history and many different flavors (Unpaired/Paired, One/Two Sample). The test was devised by William Gosset who was working for the Guinness Brewery in Dublin, Ireland in 1908. He was trying to model the chemical processes of barley and devised this distribution to work for smaller sample sizes. Rather than publish his findings under his real named he decided to use the pen name "Student" so that Guinness' competitors wouldn't know they were using this distribution. Cheers!

Here, I'll be using the test to compare the test scores between one group of students measured twice (once before an intervention and once after) who have received two different teaching methods.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.

Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.

One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.

Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.

Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.

Setting Up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is available only in SciPy and pingouin, but I'll use statsmodels for the Power Analysis.

Power Analysis

Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the Dependent Paired t-Test I'll be using the TTestPower() function from statsmodels, first to plot the samples needed for various effect sizes. Then I'll circle back and choose an Effect Size and calculate the appropriate sample sizes needed per group.

The variables in the equation are:

alpha = The significance level or probability of rejecting the Null Hypothesis when it's actually true.

A value between 0.0 and 1.0, typically 0.05 or 5% is used.

power = The statistical power or probability of correctly rejecting the Null Hypothesis when it's false.

A value between 0.0 and 1.0, typically 0.8 or 80% is used.

effect size = The magnitude of the difference between groups that we'd like to measure (Cohen's h).

A value between 0.0 and 1.0, Cohen's h = 0.2 is a small effect, 0.5 is a moderate, and 0.8 is a large effect

samples = The number of samples needed per group.

One of these variables must be None to solve the equation. To solve for the number of samples needed, you should set that variable to None, or simply omit it as I've done here.

And finally, once we understand the relationship between effect size and sample size, we can estimate the sample size we'll need for our experiment.

Synthetic Data

Here I'll use numpy to generate synthetic student test data with slightly different probabilities for before the intervention and after. Even though the data only needs to be semi-Normal, I'll use the Normal distribution to generate some baseline data and then add data from another Normal distribution to mimic an increase in test scores. Did you know that if you add data from two Normal distributions the resulting data is also Normal? And we already know how many samples we'd like to gather from our Power Analysis (787).

Check Assumptions

Here's some code to check the assumptions of the Dependent Paired t-Test. Since the assumptions of Student's t-Test require the data to only approximate a Normal distribution the p-values should be taken with a grain of salt. So it might be worth increasing the alpha to 0.1, or just use your best judgement.

To ensure that the data is appropriate for the t-test exact test, you should consider the following:

Approximate Normality: The data must be approximately Normal
Homogeneity of Variances: The group before and after the intervention should have roughly the same variance
Dependence of Observations: The samples should be dependent of each other

t-Test (SciPy)

Now for the fun stuff. This is the SciPy implementation of the paired t-Test. The ttest_rel() function from SciPy has all of its parameters set up for the Dependent Two-Sided t-Test, so you really only need to pass in the data for the two groups. The function outputs the p-value, test statistic, and dof, which we can use to compare against our pre-determined alpha. We can also call the confidence_interval() method to calculate the 95% confidence interval for the difference in means.

t-Test (pingouin)

This is the pingouin implementation of the paired t-Test is by far my favorite. There's a lot of bang for the buck here; who doesn't love a good one-liner? Simply set the paired parameter to True to run a paired t-Test. This implementation returns all of the relevant info you'd need for your study. Degree's of freedom, confidence intervals, effect size, Baye's factor, and power. Whoa!

Difference in Means and 95% Confidence Interval

Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the difference in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. The pingouin package has already given us this information, but we can also code it up ourselves. Using the t.ppf() function from SciPy to access the t-distribution distribution, we can see that the difference in means between our 2 groups is 5.28 and lies in the 95% confidence interval [5.07, 5.49]. A quick check validates both the SciPy and pingouin result as well as our code here.

Effect Size

We know there is a significant difference between the group's before and after means, but the Effect Size will tell us how strong the association is of the intervention. The appropriate Effect Size to use for this test is Cohen's d which when calculated yields 0.52. There are a few different rules of thumb for interpreting Cohen's d, but here I'm using a granular interpretation. This translates to a moderate difference in means, and we already know it is statistically significant.

Visualization

I'll end with an appropriate visualization for comparing the means between 2 groups - an overlayed KDE (Kernel Density Estimation) distribution plot for each group. I've clipped these mean scores at 0 and 100 since they are test scores - no extra credit! And also a vertical line where the mean for each group helps to compare them easily.