Fisher's Exact Test (and Barnard's and Boschloo's) for Proportions with SciPy

For a long time now I've wanted to catalogue some of the more common (and less common) statistical tests used for hypothesis testing; primarily focusing on tests used in the context of Experimental Design and A/B Testing. Rather than dive deep into the math behind these tests, I'll try to focus on the code from an end to end business solution perspective as well as compare implementations from various packages. Please have a look in my GitHub repo for all of the relevant code and other statistical tests (a work in progress).

GitHub Repo & Notebook: https://github.com/sam-tritto/statistical-tests/blob/main/Fishers_Exact_Test.ipynb

OUTLINE

The Fisher's Exact Test for Proportions
Other Important Considerations
Setting up the Python Environment
Power Analysis
Synthetic Data
Check Assumptions
Fisher's Exact Test (SciPy)
Barnard's Exact Test (SciPy)
Boschloo's Exact Test (SciPy)
Difference in Proportions and 95% Confidence Interval
Effect Size
Odds Ratio
Visualization
Further Reading & Resources

Fisher's Exact Test for Proportions

Fisher's Exact Test is a statistical test used to determine whether there is a significant difference in proportions between two independent groups, with categorical data. This test is similar to the Chi-Square Test, but the Chi-Squared Test uses a sampling distribution to calculate the p-value. It only approximates the correct distribution, resulting in more accurate p-values as the cell counts in the table increase. This limitation makes chi-square p-values invalid when you have small cell counts. Fisher’s exact test, in contrast, doesn't rely on a sampling distribution. Instead, it calculates all possible contingency tables with the same row and column totals as the observed table, then determines the p-value by measuring the proportion of possible tables that are more extreme than the observed one. Although suitable for all sample sizes, Fisher's test is typically used for smaller sample sizes because the number of possible tables grows exponentially and quickly becomes computationally intense.

There are also two other exact tests available from SciPy, Barnard's and Boschloo's. Both of these tests are considered improvements to Fisher's test. While Fisher's test uses the marginal distributions of the contigency table, Barnard's uses probabilities allowing more flexibility. Boschloo's test uses Fisher's test as a base and incorporates a conditioning aspect to increase the statistical power of the test. To sum it up:

Use Fisher's Exact Test when working with small sample sizes and when the exact distribution of probabilities is required.
Use Barnard's Exact Test when you need more statistical power and can handle computational complexity without constraining marginal totals.
Use Boschloo's Exact Test for greater power and flexibility, particularly in cases where Fisher's test might produce conservative results due to its constraints on marginal totals.

Fisher's Exact Test is appropriate for before and after A/B testing scenarios, where there is typically some type of intervention done on a sample group. Here, I'll be using the test to compare the proportion of website users who are satisfied with two versions of the same website.

Other Important Considerations

For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.

Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.

Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.

One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.

Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.

Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.

Setting up the Python Environment

Sometimes setting up a python environment with different packages can be tricky. Here's a snippet of the packages I'll be using and their pip install commands. If you're comfortable with it you should use a virtual environment. I'm using an Anaconda environment but chose to pip install anyways. This test is only available through SciPy but I'll use statsmodels for the effect size calculation.

Power Analysis

For Fisher's exact test, power analysis can be challenging because the test is typically used for small sample sizes, and there isn't a direct analytical solution like with other statistical tests. The test calculates exact probabilities based on the specific arrangement of the data in a 2x2 contingency table. This exact nature makes power analysis more complex than for tests like t-tests or Z-tests.

However, you can use simulation-based approaches to estimate the power for Fisher's exact test. This involves simulating data with different sample sizes to determine how many samples are needed to achieve a certain power level at a given significance level. Here's a simplified approach to conduct a simulation-based power analysis for Fisher's exact test in Python.

First, we can create the function to simulate Fisher's test.

Then, we can use this function to visualize the sample sizes needed for various effect sizes. We'll need the historical or baseline proportion from our first control group, our desired alpha, and desires statistical power.

And finally, once we understand the relationship between effect size and sample size, we can use the historical or baseline proportion along with an estimation of what we think the new proportion will be to estimate the sample size we'll need for our experiment.

Synthetic Data

Here I'll use numpy to generate website satisfaction data with slightly different probabilities for each user group. We already know how many samples we'd like to gather from our Power Analysis (160 but I'll use 200), but this test is typically used when you have some sort of sampling limitation and cannot gather many samples per group (or even if one group has a very small sample size).

One common mistake I see often is using pd.concat(), to get contingency tables. If you use pd.concat() it is essentially treating your data as if it is paired samples (dependent data), but we want our data to be independent samples. So don't use pd.concat(), you can use pd.melt() instead to get the counts per group independently.

Another hot tip is to reorder the contingency table so that your control/historical group is on top. That will help the SciPy function, fisher_exact(), get the directionality correct - especially the Odds Ratio it returns.

Check Assumptions

Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables in a contingency table. It is typically used for small sample sizes and does not rely on large sample assumptions like some other statistical tests (e.g., the chi-squared test). Fisher's exact test has no specific assumptions regarding the underlying distribution of the data, which is why it is a robust test for categorical data. However, there are general assumptions to consider when preparing the data for this test.

To ensure that the data is appropriate for Fisher's exact test, you should consider the following:

Data Structure: The data must be in a contingency table, usually 2x2.
Independence of Observations: The samples should be independent of each other.

Fisher's Test (SciPy)

Now for the fun stuff. This is the SciPy implementation of Fisher's Test. The fisher_exact() function from SciPy requires the data to be entered as a contingency table, however make sure to see my note above and use pd.melt() rather than pd.concat(). The function outputs the exact p-value and odds ratio, which we can use to compare against our pre-determined alpha.

Barnard's Test (SciPy)

This is the SciPy implementation of Barnard's Test. The barnard_exact() function from SciPy requires the data to be entered as a contingency table, again make sure to see my note above and use pd.melt() rather than pd.concat(). The function outputs the exact p-value and Wald test statistic, which we can use to compare against our pre-determined alpha. With this function you can specify whether or not you'd like to use the pooled variance to compute the Wald test statistic; (as in Student’s t-test for example compared with the unpooled variance as in Welch’s t-test). Here, I'll assume that both test groups have the same variance and use the pooled variance.The n parameter is the number of sampling points to be used to construct the test. The higher the number the more computationally expensive this function will be. 32 is the recommended minimum value.

Boschloo's Test (SciPy)

This is the SciPy implementation of Fisher's Test. The boschloo_exact() function from SciPy requires the data to be entered as a contingency table, again make sure to see my note above and use pd.melt() rather than pd.concat(). The function outputs the exact p-value and test statistic, which we can use to compare against our pre-determined alpha. The n parameter is the number of sampling points to be used to construct the test. The higher the number the more computationally expensive this function will be. 32 is the recommended minimum value.

Difference in Proportions and 95% Confidence Interval

Ok... so now that we know there is a significant difference, what else can we gather? A natural place to start is investigating what the difference in proportions actually is. After that we can calculate a 95% confidence interval around this difference to quantify the uncertainty. Using the norm() function from SciPy to access the Normal distribution, we can see that the difference in proportions between our 2 groups is 0.135 or 13.5% and lies in the 95% confidence interval [0.05, 0.22].

Effect Size

We know there is a significant difference between the 2 groups proportions, but the Effect Size will tell us how strong the association is. The appropriate Effect Size to use for this test is Cohen's h which when calculated yields 0.313. This translates to a small difference in proportions, and we already know it is statistically significant. Here I'm using the proportion_effectsize() from statsmodels but you can also calculate it directly if needed.

Odds Ratio

The Odds Ratio is another Effect Size that is appropriate for categorical or binary data. This is already given by default from the fisher_exact() function, however we can compute it ourselves. Since our data is categorical we can use it here to get a relative distance between the two groups. And we can interpret it as Group A (the new group) is 2 times as likely to be satisfied than Group B (the control group).

Visualization

I'll end with an appropriate visualization for comparing the proportions between 2 groups - a stacked percentage bar for each group.