Student's Unpaired (Independent or Two-Sample) t-Test for the Difference in Means with SciPy, statsmodels, and pingouin

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than dive deep into the math behind these tests, I'll try to focus on the code from an end to end business solution perspective as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Unpaired_T_Test_for_Means.ipynb

OUTLINE

The Student's Unpaired (Independent or Two Sample) t-Test for the Difference in Means
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Student's Unpaired t-Test (SciPy)
Student's Unpaired t-Test (statsmodels)
Student's Unpaired t-Test (pingouin)
Difference in Means and 95% Confidence Interval
Effect Size
Visualization
Further Reading & Resources

Student's Unpaired t-Test for the Difference in Means

Student's Unpaired t-Test, also known as the Two-Sample or Independent t-Test, is a statistical test used to determine whether there is a significant difference in means between two independent groups. This is a flexible test perfect for smaller data that is symmetrical and approximates a Normal distribution. When the data becomes large the results will resemble a Z-Test, which relies on the Normal distribution. The variance of the two groups should also be roughly equal.

This is a very common test with a rich history and many different flavors (Unpaired/Paired, One/Two Sample). The test was devised by William Gosset who was working for the Guinness Brewery in Dublin, Ireland in 1908. He was trying to model the chemical processes of barley and devised this distribution to work for smaller sample sizes. Rather than publish his findings under his real named he decided to use the pen name "Student" so that Guinness' competitors wouldn't know they were using this distribution. Cheers!

Here, I'll be using the test to compare the test scores between two independent groups of similar students who have recieved two different teaching methods.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.

Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.

One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.

Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.

Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.

Setting Up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is avaiable in SciPy, statsmodels, and pingouin.

Power Analysis

Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the Independent Unpaired t-Test I'll be using the TTestIndPower() function from statsmodels, first to plot the samples needed for various effect sizes. Then I'll circle back and choose an Effect Size and calculate the appropriate sample sizes needed per group.

The variables in the equation are:

alpha = The significance level or probability of rejecting the Null Hypothesis when it's actually true.

A value between 0.0 and 1.0, typically 0.05 or 5% is used.

power = The statistical power or probability of correctly rejecting the Null Hypothesis when it's false.

A value between 0.0 and 1.0, typically 0.8 or 80% is used.

effect size = The magnitude of the difference between groups that we'd like to measure (Cohen's h).

A value between 0.0 and 1.0, Cohen's h = 0.2 is a small effect, 0.5 is a moderate, and 0.8 is a large effect

samples = The number of samples needed per group.

One of these variables must be None to solve the equation. To solve for the number of samples needed, you should set that variable to None, or simply omit it as I've done here.

And finally, once we understand the relationship between effect size and sample size, we can estimate the sample size we'll need for our experiment.

Synthetic Data

Here I'll use numpy to generate synthetic data with slightly different probabilities for each user group. Even though the data only needs to be semi-Normal, I'll use the Normal distribution to generate the data. And we already know how many samples we'd like to gather from our Power Analysis (1571 per group).

Check Assumptions

Here's some code to check the assumptions of the Independent Unpaired t-test. We can use a few standard statistical tests to help us, however testing for Independence is difficult. Since the assumptions of Student's test require the data to only approximate a Normal distribution the p-values should be taken with a grain of salt. So it might be worth increasing the alpha to 0.1, or just use your best judgement.

To ensure that the data is appropriate for the t-test exact test, you should consider the following:

Approximate Normality: The data must be approximately Normal
Homogeneity of Variances: The two groups should have roughly the same variance
Independence of Observations: The samples should be independent of each other

t-Test (SciPy)

Now for the fun stuff. This is the SciPy implementation of the t-Test. The ttest_ind() function from SciPy has all of it's parameters set up for the Independent Two-Sided t-test, so you really only need to pass in the data for the two groups. There is an optional parameter, equal_var, which is defaulted to True but when set to False will perform Welch's t-test for unequal variances. The function outputs the p-value and test statistic, which we can use to compare against our pre-determined alpha.

t-Test (statsmodels)

The statsmodels implementation of the t-Test is similar to SciPy but offers more interesting parameter options. Like SciPy, statsmodels offer's a usevar parameter which defaults to 'pooled'. If set to 'unequal' this becomes Welch's t-test. There are also options to weight each group as well as set the default difference in means to be something other than 0. I won't use these options here. The statsmodels output, along with the p-value and test statistic, also returns the degrees of freedom, which can be used later on for confidence intervals and effect size.

t-Test (pingouin)

This is the pingouin implementation of the t-Test is by far my favorite. There's a lot of bang for the buck here; who doesn't love a good one-liner? This implementation returns all of the relevant info you'd need for your study. Degree's of freedom, confidence intervals, effect size, Baye's factor, and power. Whoa!

Difference in Means and 95% Confidence Interval

Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the difference in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. The pingouin package has already given us this information, but we can also code it up ourselves. Using the t.ppf() function from SciPy to access the t-distribution distribution, we can see that the difference in means between our 2 groups is -5.29 and lies in the 95% confidence interval [-5.98, -4.61]. A quick check validates both the pingouin result as well as our code here.

Effect Size

We know there is a significant difference between the 2 groups means, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's d which when calculated yields 0.46. There are a few different rules of thumb for interpreting Cohen's d, but here I'm using a granular interpretation. This translates to a small difference in means, and we already know it is statistically significant.

Visualization

I'll end with an appropriate visualization for comparing the means between 2 groups - an overlayed KDE (Kernel Density Estimation) distribution plot for each group. I've clipped these mean scores at 0 and 100 since they are test scores - no extra credit! And also a vertical line where the mean for each group helps to compare them easily.