The Z-Test for Proportions with SciPy and statsmodels
For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than just provide code for the statistical test, I'll try to also provide some code that will be useful from end to end as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).
GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Z_Test.ipynb
OUTLINE
The Z-Test for Proportions
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Z-test (statsmodels)
Difference in Proportions and 95% Confidence Interval
Effect Size
Odds Ratio
Visualization
Further Reading & Resources
The Z-Test for Proportions
The Z-Test is a hypothesis test specifically designed for binary (success/failure) data. It is used to determine if the proportion of successes in a binary outcome significantly differ between two groups. This test works well for large sample sizes and only works with two distinct independent groups (unlike the Chi-Squared Test which can be run with multiple independent groups). At the heart of this test is the Normal distribution, which explains why we need a large sample size to run this test. This is not that common of a test to run. Since the Chi-Squared distribution is just the sum of the squares of two Normal distributions, this test will return the same p-value as a Chi-Squred test for proportions under the two-tailed use case for two independent groups. This test is also the same test as a Z-test for means when the mean data is encoded into binary variables, however it doesn't require knowledge about the population's standard deviation!
Here, I'll be using the test to compare proportions of the Purchase Rate between two independent groups.
Other Important Considerations
For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.
Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.
Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.
One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.
Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.
Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.
Setting up the Python Environment
Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is only available through statsmodels, but I'll also use SciPy a little bit for working with the Normal distribution.
Power Analysis
Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the Z-Test I'll be using the NormalIndPower() function from statsmodels, first to plot the samples needed for various effect sizes. Then I'll circle back and choose an Effect Size and calculate the appropriate sample sizes needed per group. This test can be used for varying sample sizes so I'm going to estimate that I'll be able to grab 1.2 more samples from Group A.
The variables in the equation are:
alpha = The significance level or probability of rejecting the Null Hypothesis when it's actually true.
A value between 0.0 and 1.0, typically 0.05 or 5% is used.
power = The statistical power or probability of correctly rejecting the Null Hypothesis when it's false.
A value between 0.0 and 1.0, typically 0.8 or 80% is used.
effect size = The magnitude of the difference between groups that we'd like to measure (Cohen's h).
A value between 0.0 and 1.0, Cohen's h = 0.2 is a small effect, 0.5 is a moderate, and 0.8 is a large effect
samples = The number of samples needed per group.
One of these variables must be None to solve the equation. To solve for the number of samples needed, you should set that varaible to None.
So in order to measure an Effect Size of 0.05 (very small) I'll need to gather about 5,755 samples for Group A and 4,795 for Group B, given a 0.05 alpha, and 0.8 statistical power. Or in other words in order to measure a very small difference (Effect Size) between groups, I'll need to gather 5,755 samples for Group A and 4,795 for Group B, given a 5% probability of saying there is a difference in proportions when there really is none, and an 80% probability of saying there is a difference in proportions when there really is.
Synthetic Data
Here I'll use numpy to generate binary data with slightly different probabilities for each group, and we already know how many samples we'd like to gather. I'll fill the empty rows with Nulls since the sample sizes are uneven.
Check Assumptions
Here we'll just check that there are at least 30 data points in each of the categories and that the data is binary. The Independence assumption is a little more difficult to check programatically.
The Z-Test (statsmodels)
Now for the fun stuff. This is the statsmodels implementation of the Z-Test. The proportions_ztest() function requires the total number of successes for Group A and B and the total number of samples for Group A and B - both as numpy arrays. You can see in the results the proportions are likely significantly different for our two groups, due to the low p-value which is as expected.
Difference in Proportions and 95% Confidence Interval
Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the differrence in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. Using the norm() function from SciPy to access the Normal distributio, we can see that the difference in proportions between our 2 groups is -0.054 or -5.4% and lies in the 95% confidence interval [ -0.07, -0.038].
Effect Size
We know there is a significant difference between the 2 groups proportions, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's h which when calculated yields 0.09. This translates to a no difference in proportions, which is contradictory to our Z-test results, perhaps rather than interpret this as "no difference" or "no effect" if we use a different rule of thumb it might be seen as a "very small" effect size.
Odds Ratio
The Odds Ratio is another Effect Size that is appropriate for binary data. Since our data is binary we can use it here to get a relative distance between the two groups. And we can interpret it as Group B is 1.25 times as likely to Purchase than Group A.
Visualization
I'll end with an appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group. I'm not sure I love this one since the two groups have different sample sizes, but you can see clearly that we're comparing the proportions of two sample groups with different sample sizes.