The Z-Test for Proportions with SciPy and statsmodels

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than just provide code for the statistical test, I'll try to also provide some code that will be useful from end to end as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook:



The Z-Test for Proportions

The Z-Test is a hypothesis test specifically designed for binary (success/failure) data. It is used to determine if the proportion of successes in a binary outcome significantly differ between two groups. This test works well for large sample sizes and only works with two distinct independent groups (unlike the Chi-Squared Test which can be run with multiple independent groups). At the heart of this test is the Normal distribution, which explains why we need a large sample size to run this test. This is not that common of a test to run. Since the Chi-Squared distribution is just the sum of the squares of two Normal distributions, this test will return the same p-value as a Chi-Squred test for proportions under the two-tailed use case for two independent groups. This test is also the same test as a Z-test for means when the mean data is encoded into binary variables, however it doesn't require knowledge about the population's standard deviation

Here, I'll be using the test to compare proportions of the Purchase Rate between two independent groups.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Setting up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is only available through statsmodels, but I'll also use SciPy a little bit for working with the Normal distribution.

Power Analysis

Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the Z-Test I'll be using the NormalIndPower() function from statsmodels, first to plot the samples needed for various effect sizes. Then I'll circle back and choose an Effect Size and calculate the appropriate sample sizes needed per group. This test can be used for varying sample sizes so I'm going to estimate that I'll be able to grab 1.2 more samples from Group A. 

The variables in the equation are:

A value between 0.0 and 1.0, typically 0.05 or 5% is used.

A value between 0.0 and 1.0, typically 0.8 or 80% is used. 

A value between 0.0 and 1.0, Cohen's h = 0.2 is a small effect, 0.5 is a moderate, and 0.8 is a large effect

One of these variables must be None to solve the equation. To solve for the number of samples needed, you should set that varaible to None

So in order to measure an Effect Size of 0.05 (very small)  I'll need to gather about 5,755 samples for Group A and 4,795 for Group B, given a 0.05 alpha, and 0.8 statistical power. Or in other words in order to measure a very small difference (Effect Size) between groups, I'll need to gather 5,755 samples for Group A and 4,795 for Group B, given a 5% probability of saying there is a difference in proportions when there really is none, and an 80% probability of saying there is a difference in proportions when there really is. 

Synthetic Data

Here I'll use numpy to generate binary data with slightly different probabilities for each group, and we already know how many samples we'd like to gather. I'll fill the empty rows with Nulls since the sample sizes are uneven.

Check Assumptions

Here we'll just check that there are at least 30 data points in each of the categories and that the data is binary. The Independence assumption is a little more difficult to check programatically. 

The Z-Test (statsmodels)

Now for the fun stuff. This is the statsmodels implementation of the Z-Test. The proportions_ztest() function requires the total number of successes for Group A and B and the total number of samples for Group A and B - both as numpy arrays. You can see in the results the proportions are likely significantly different for our two groups, due to the low p-value which is as expected. 

Difference in Proportions and 95% Confidence Interval

Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the differrence in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. Using the norm() function from SciPy to access the Normal distributio, we can see that the difference in proportions between our 2 groups is -0.054 or -5.4% and lies in the 95% confidence interval [ -0.07, -0.038].

Effect Size

We know there is a significant difference between the 2 groups proportions, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's h which when calculated yields 0.09. This translates to a no difference in proportions, which is contradictory to our Z-test results, perhaps rather than interpret this as "no difference" or "no effect" if we use a different rule of thumb it might be seen as a "very small" effect size. 

Odds Ratio

The Odds Ratio is another Effect Size that is appropriate for binary data. Since our data is binary we can use it here to get a relative distance between the two groups. And we can interpret it as Group B is 1.25 times as likely to Purchase than Group A.


I'll end with an appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group. I'm not sure I love this one since the two groups have different sample sizes, but you can see clearly that we're comparing the proportions of two sample groups with different sample sizes.