The Binomial Test for Proportions with SciPy, statsmodels, and pingouin

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than just provide code for the statistical test, I'll try to also provide some code that will be useful from end to end as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Binomial_Test.ipynb

OUTLINE

The Binomial Test for Proportions
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Binomial Test (SciPy)
Binomial Test (statsmodels)
Binomial Test - Bayes Factor (pingouin)
Difference in Proportions and 95% Confidence Interval
Effect Size
Odds Ratio
Visualization
Further Reading & Resources

The Binomial Test for Proportions

The Binomial Test is a hypothesis test specifically designed for binary (success/failure) data. It is used to determine if the proportion of successes in a binary outcome significantly differ from that of a known historical or hypothesized value. This test works well for small sample sizes (unlike the the Z-test and Chi-Squared Test) and only works with 2 distinct independent groups (unlike the Chi-Squared Test which can be run with multiple independent groups).

Here, I'll focus on using the test to compare proportions in an A/B testing situation; to compare the Click Through Rate (CTR) from a new ad campaign compared to the known Historical CTR.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.

Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.

One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.

Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.

Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.

Setting up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. As I'll point out later, there is an error loading the binom_test() from statsmodels due to a dependency error with SciPy. You might want to try an earlier version of SciPy.

Power Analysis

Since the main idea behind the Binomial Test is testing whether a new success probability is different than a known historical proportion, I'll start by first defining a Historical Click Through Rate.

Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the Binomial Test I'll be using the Normal approximation implementation in a custom function that I've made - you won't find this anywhere else on the internet. Since the one benefit of using a Binomial Test over the Z-Test is that it can hanlde small sample sizes it seems silly to perform a Power Analysis to determine the number of samples needed, because if you have the ability to collect many samples you'd most likely use a Z-Test anyways. I found this example in an old textbook section; Hypothesis Testing: Categorical Data - Estimation of Sample Size and Power for Comparing Two Binomial Proportions in Bernard Rosner's Fundamentals of Biostatistics book. I've updated his formula to run on numpy arrays, allowing the exploration of multiple differences at once. This implementation also allows for different sample sizes per group. This situation mimics real life when you have way more historical data and new data can either cost too much or take too long to collect. I've set this to generate twice as many historical samples.

So according to the above chart, the greater the difference in proportions I'm expecting the less samples I'll need to collect. A similar function will allow me to plug in my anticipated proportion for the New Ad Campaign group. If the Historical group's Click Through Rate was 14%, I'll anticipate a 6% lift to 20%. Plugging this in yields that I'll need 910 Historical samples and 455 New Ad Campaign samples.

Synthetic Data

Here I'll use numpy to generate binary data with slightly different probabilities for each group, and we already know we'd like to see 910 samples for the Historical group and 455 for the New Ad Campaign. I'll fill the empty rows with Nulls since the samples are uneven.

Check Assumptions

Here we'll just check that there are at least 5 data points in each of the categories and that the data is binary. The Independence assumption is a little more difficult to check programatically.

The Binomial Test (SciPy)

Now for the fun stuff. This is the SciPy implementation of the Binomial Test. The binomtest() function requires the total number of successes for the New Ad Campaign, the total number of samples for the New Ad Campaign, and the Historical Click Through Rate. You can see in the results the proportions are likely significantly different for our two groups, due to the low p-value which is as expected.

The Binomial Test (statsmodels)

It seems that statsmodels is having some trouble loading the correct binomtest() function from SciPy, it's trying to load binom_test(), which I'm assuming is an older version of this function. Since it simply loads the SciPy implementation I'm going to skip over this one since I don't think working through the dependency error will add anything to this notebook.

The Binomial Test - Bayes Factor (pingouin)

This is a cool one. Rather than provide a typical hypothesis test, pingouin provides a test measuring the Bayes factor between the two hypotheses. The Bayes Factor quantifies the evidence in favor of the Alternative Hypothesis, where the null hypothesis is that the random variable is binomially distributed with base probability p (the Historical Click Through Rate). An interesting conclusion that differs from the SciPy test.

Difference in Proportions and 95% Confidence Interval

This is another Normal approximation implementation, but this time of a 95% confidence interval around the difference of proportions to quantify the uncertainty. We can see that the difference in proportions between our 2 groups is 0.0165 or 1.65% and lies in the 95% confidence interval [ -0.019, 0.052]. Notice that 0 is contained in the interval, which signals that there might not be any significant difference in proportions. This is more in line with our conclusion from the pingouin implementation, rather than the SciPy implementation.

Effect Size

We don't really think there is a significant difference between the 2 groups proportions, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's h which when calculated yields 0.044. This translates to no difference in proportions between the two groups.

Odds Ratio

The Odds Ratio is another Effect Size that is appropriate for binary data. Since our data is binary we can use it here to get a relative distance between the two groups. And we can interpret it as Group A is 1.125 times as likely to Click than Group B.

Visualization

I'll end with a semi-appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group. I'm not sure I love this one since the two groups have different sample sizes, but you can see clearly that we're comparing the proportions of two sample groups with different sample sizes.