The Chi-Squared Test for Proportions with SciPy, statsmodels, and pingouin
For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than just provide code for the statistical test, I'll try to also provide some code that will be useful from end to end as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).
GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Chi_Squared_Test.ipynb
OUTLINE
The Chi-Squared Test for Proportions (and Independence and Goodness of Fit)
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Chi-Squared Test (SciPy)
Chi-Squared Test (statsmodels)
Chi-Squared Test (pingouin)
Difference in Proportions and 95% Confidence Interval
Effect Size
Odds Ratio
Visualization
Further Reading & Resources
The Chi-Squared Test for Proportions (and Independence and Goodness of Fit)
The Chi-Squared Test is a very versatile test and can be used for comparing the difference in proportions, testing for independence, and testing for goodness of fit. When comparing proportions this test can be useful for A/B Test situations. When testing for Independence it can be useful for feature selection and importance. When testing for goodness of fit it can be useful for testing whether observed data follows a known distribution. At the heart of the test is the Chi-Squred distribution. The Chi-Squred distribution with k degrees of freedom is the distribution of the sum of the squares of k independent standard Normal distributions.
Here, I'll focus on using the test to compare proportions in an A/B testing situation; to compare the satisfaction levels (Satisfied/Not Satisfied) of two user groups experiencing two different versions of a website. When used this way the test is intended to be used to compare the proportions of categorical data between two or more independent groups. The data is typically entered as a contingency table and the Chi-Squared Test requires that each cell in the contigency table has atleast 5 data points, but the test is more generally appropriate when the data is large. For smaller data sets Fisher's Exact test is often used. Another popular test for proportions is the Z-test which also requires large sample sizes. The Z-test is appropriate for binary data, rather than categorical, however it wouldn't be difficult to transform binary data into categorical data. The one assumption for the Z-test that is absent for the Chi-Squared Test is that in order to use the Z-test the proportions should follow a Normal distribution. The Chi-Squared test can also be applied to scenarios when there are more than 2 groups again making it a very versatile test.
Other Important Considerations
For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.
Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.
Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.
One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.
Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.
Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.
Setting up the Python Environment
Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways.
Power Analysis
Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are differenct equations for different statistical tests. For the Chi-Squared test I'll be using the GofChisquarePower() function from statsmodels. The variables in the equation are:
alpha = The significance level or probability of rejecting the Null Hypothesis when it's actually true.
A value between 0.0 and 1.0, typically 0.05 or 5% is used.
power = The statistical power or probability of correctly rejecting the Null Hypothesis when it's false.
A value between 0.0 and 1.0, typically 0.8 or 80% is used.
effect size = The magnitude of the difference between groups that we'd like to measure (Cramer's V, Cohen's w, Cohen's h, or Phi).
A value between 0.0 and 1.0, Phi = 0.1 is a small effect, 0.3 is a moderate, and 0.5 is a large effect
samples = The number of samples needed per group.
One of these variables must be None to solve the equation. To solve for the number of samples needed, you should set that varaible to None. However to make visualize this equation, I'll create an array of Effect Sizes to use as an x-axis. This way I'll be able to determine how many samples I'll need inorder to measure each corresponding Effect Size. You can see that to measure a small effect, you'll need more samples.
So in order to measure an Effect Size of 0.1 I'll need to gather about 785 samples per group, given a 0.05 alpha, and 0.8 statistical power. Or in other words in order to measure a small difference (Effect Size) between groups, I'll need to gather 785 samples per group, given a 5% probability of saying there is a difference in proportions when there really is none, and an 80% probability of saying there is a difference in proportions when there really is.
Here's the SciPy implementation:
And here's the statsmodels implementation:
And here's the pingouin implementation:
Once we know the sample size, we can repurpose the Power Analysis equation to look at the trade off between statistical power and the Effect Size.
Synthetic Data
Here I'll use numpy to generate categorical data with slightly different probabilities for each group, and we already know we'd like to see 785 samples per group. In a real A/B testing scenario, you'd want to make sure that you are using random assignment to create these groups in order to reduce bias and help ensure that the groups are similar in all ways except the treatment.
Check Assumptions
Kind of silly to do this for this test, but I'm trying to build a template with similar steps for multiple statistical tests. Here we'll just check that there are at least 5 data points in each cell of the 2x2 contingency table. The Independence assumption is a little more difficult to check programatically.
The Chi-Squared Test (SciPy)
Now for the fun stuff. This is the SciPy implementation of the Chi-Squared test. The chi2_contingency() function requires the data to be in a contingency table, which I've built here with numpy arrays. The correction paramter is for the Yate's Continuity Correction, acounting for the fact that our data is discrete. Typically the rule of thumb for this parameter is that it's needed when any cell in the contingency table is lesss than 5. You can see in the results the proportions are likely significantly different for our two groups, due to the low p-value which is as expected.
The Chi-Squared Test (statsmodels)
A very similar implementation with statsmodels (which uses SciPy under the hood). Here, rather than pass a contingency table you can simply pass a list of positive values for each group and the number of observations per group separately. Again, we get a low p-value indicating a significant difference in proportions between the two groups.
The Chi-Squared Test (pingouin)
A slightly different implementation with pingouin. First I'll melt the dataframe before passing in the data. But that's the only real difference. The summary table provides a variety of Chi-Squared test interpretations with varying values of lambda. They also provide Cramer's V Effect Size as well as the statistcal power of the test, all done without asking.
Difference in Proportions and 95% Confidence Interval
Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the differrence in proportions actually is. The next move would be to calcualte a 95% confidence interval around this difference to quantify the uncertainty. Using the Chi-Squared statistic gathered from our modeling, along with a critical value we can see that the difference in proportions between our 2 groups is -0.113 or 11.3% and lies in the 95% confidence interval [ -0.153, -0.076]. I'm also noting that 0 is not contained in the interval, which signlas that the difference is truly negative.
Effect Size
We already know there is a significant difference between the 2 groups, but the Effect Size will tell us how strong the associate is between the 2 groups. There are a few different Effect Sizes we can use for the Chi-Squared test and for a 2x2 contingency table the Phi statistic is appropriate. We can use the Chi-Squared statistic to calulate Phi. A Phi of 0.19 is a small to moderate effect or association between the 2 groups.
We can also calculate Cohen's h, which is a measure of distance between two proportions. We get the same conclusion that there is a small difference between the two groups.
Odds Ratio
The Odds Ratio is another Effect Size that is appropriate for binary data. Since our Chi-Squred test only has two categories we can use it here to get a relative distance between the two groups. And we can interpret it as Group A is twice (2.0175 times) as likely to be satisfied than Group B.
Visualization
I'll end with an appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group.