McNemar's Test for Proportions with pingouin and statsmodels

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than just provide code for the statistical test, I'll try to also provide some code that will be useful from end to end as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/McNemars_Test.ipynb

OUTLINE

The McNemar's Test for Proportions
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
McNemar's Test (statsmodels)
McNemar's Test (pingouin)
Difference in Proportions and 95% Confidence Interval
Effect Size
Odds Ratio
Visualization
Further Reading & Resources

The McNemar's Test for Proportions

McNemar's Test is a statistical test used to determine whether there is a significant difference in proportions between two related or dependent groups, typically for categorical data in a binary outcome scenario. It is specifically designed for paired data, where each observation in one group is paired with an observation in the other group. This type of paired test is appropriate for before and after A/B testing scenarios, where there is typically some type of intervention done on a sample group.

There are two versions of this test; either the "exact" Binomial version for smaller sample sizes or the "approximate" Chi-Squared version for larger sample sizes (these labels in quotes are refering to the p-values generated).

Here, I'll be using the test to compare proportion of students who pass an exam with the traditional teaching method and once again after a new teaching method has been implemented for the same students.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.

Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.

One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.

Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.

Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.

Setting up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is only available through statsmodels and pingouin.

Power Analysis

Before collecting your data often a first step will be to determine the number of samples you'll need to collect for each group. This is done through balancing an equation in what's called a Power Analysis. There are different equations for different statistical tests. For the McNemar's Test there have been a few proposed algorithms and equations over the years. I've found this one from Connor R. J. 1987. Sample size for testing differences in proportions for the paired-sample design. Biometrics 43(1):207-211. page 209 and I'm going to modify it slightly so we can better visualize how the required sample sizes change with the changing proportions.

The calculation is built off of the difference in the proportion of successes in Group A and failures in Group B with the proportion of successes in Group B and failures in Group A. If that's a little hard to wrap your head around it might be better to first visualize a contingency table. The first table below has the raw counts per distinct category. The second contingency table has been normalized and the two proportions that I'll use for the Power Analysis are in the bottom left and upper right corners. These two corners of a contingency table are also refered to as discordants. Specifically I'll be looking for the difference in these and the greater the difference the more number of samples I'll need to preserve the statistical power and significance of the test.

Below you can see the greater the difference in these two proportions the greater number of samples we'll need. In order to find out the number of samples needed it will be necessary to estimate these two proportions in the contingency table. I'll estimate 25% for the success of the New Teaching Method but failure of the Traditional Teaching Method and 15% for the success of the Traditional Teaching Method but the failure of the New Teaching Method - a 10% difference. This equates to 246 samples, given a 5% probability of saying there is a difference in proportions when there really is none, and an 80% probability of saying there is a difference in proportions when there really is.

Synthetic Data

Here I'll use numpy to generate test score data with slightly different probabilities for each paired group, and we already know how many samples we'd like to gather. Since these are continuous test scores and not binary, I'll then convert them to binary according to some pre-determined passing score threshold. Then we will be able to indentify the pass rate of each student both pre and post intervention. Notice that each row is piared data and the first column represents the unique Student ID.

60 seems like an appropriate passing score.

And we can take a look at the data as contingency tables. The Null hypothesis of McNemar's test is that the two discordants are equal, and we can see that 23.98% is quite a ways from 15.85%, but is this difference statistically significant?

Check Assumptions

Here we'll just check that there are at least 5 data points in each of the cells of the contingency table and that the data is binary.

McNemar's Test (statsmodels)

Now for the fun stuff. This is the statsmodels implementation of the McNemar's Test. The mcnemar() function requires the data to be entered as a contingency table, so I'll first set my pre-determined alpha value and then use pandas crosstab() to get the data in it's needed format. Then I'll use some basic conditional logic to choose the right parameters, whether we have enough data to use the approximate Chi-Squred version or the approximate Binomial version of the test. And also if our data is small enough to qualify for the continuity correction proposed by Edwards (Edwards A. 1948).

Looks like the p-value from the test came in at just below our alpha so we can conclude a significant difference between the two proportions.

McNemar's Test (pingouin)

Now a pingouin implementation with similar logic for the continuity correction. Pingouin by default gives both the exact and approximate p-value which is nice.

Difference in Proportions and 95% Confidence Interval

Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the differrence in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. Since the Null and Alternative Hypothesis for McNemar's Test is focused around the discordant proportions, I'll focus on the difference between the two discordant proportions. If you were interested in the difference in overall proportions you could easily update the code for that. Using the norm() function from SciPy to access the Normal distribution, we can see that the difference in proportions between our 2 groups is 0.0813 or 8.13% and lies in the 95% confidence interval [ 0.011, 0.1515].

Effect Size

We know there is a significant difference between the 2 groups proportions, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's h which when calculated yields 0.2045. This translates to a small difference in proportions, and we already know it is statistically significant. And again I'm refering to the difference in discordant proportions inline with the Null and Alternative Hypothesis of McNemar's Test. Here I'm using the proportion_effectsize() from statssmodels but you can also calulate it directly if needed.

Odds Ratio

The Odds Ratio is another Effect Size that is appropriate for binary data. Since our data is binary we can use it here to get a relative distance between the two groups. And we can interpret it as Group A (the new method) is 1.5 times as likely to Pass than Group B (the traditional method).

Visualization

I'll end with an appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group.