GitHub Repo:
You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/Mann_Whitney_U_Test_for_Groups.ipynb
OUTLINE
Introduction to Mann-Whitney U-test
Other Important Considerations
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Non-Assumptions
Running Mann-Whitney U-test (SciPy)
Running Mann-Whitney U-test (pingouin)
Effect Size
Visualization
In the realm of statistical hypothesis testing, we often encounter situations where we need to compare two independent groups of data. When the assumptions of parametric tests, such as the t-test, are not met (e.g., data is not normally distributed or the variances are unequal), non-parametric alternatives become invaluable. One such powerful and widely used non-parametric test is the Mann-Whitney U-test (also known as the Wilcoxon rank-sum test but NOT the Wilcoxon signed-rank test).
Overview:
The Mann-Whitney U-test is a non-parametric test that assesses whether two independent samples come from the same distribution. Instead of focusing on the means (like the t-test), it examines the ranks of the data points across both groups. Essentially, it determines if one group tends to have larger values than the other. The test calculates a U statistic (or equivalently, a W statistic in some implementations) which is then used to determine the p-value. A significant p-value suggests that there is a statistically significant difference in the distributions of the two groups.
When to Use the Mann-Whitney U-Test:
The Mann-Whitney U-test is particularly useful in the following scenarios:
Ordinal Data: When your data is ordinal, meaning it has a natural order but the intervals between values are not necessarily equal (e.g., satisfaction levels like "Very Dissatisfied," "Dissatisfied," "Neutral," "Satisfied," "Very Satisfied").
Non-Normally Distributed Data: When your data in one or both groups significantly deviates from a normal distribution. Parametric tests rely on the assumption of normality, and violating this assumption can lead to unreliable results.
Small Sample Sizes: While parametric tests can sometimes be robust to deviations from normality with larger sample sizes (due to the Central Limit Theorem), the Mann-Whitney U-test is a more appropriate choice when dealing with small datasets where normality cannot be reasonably assumed.
Unequal Variances: Unlike the independent samples t-test which has a variant (Welch's t-test) to handle unequal variances, the Mann-Whitney U-test doesn't make assumptions about the equality of variances (homoscedasticity).
Assumptions of the Mann-Whitney U-Test:
While being non-parametric, the Mann-Whitney U-test still has a few key assumptions:
Independent Samples: The observations in one group must be independent of the observations in the other group.
Independent Observations Within Each Sample: The observations within each group must be independent of each other.
Ordinal or Continuous Data: The data should be at least ordinal, meaning the values can be ranked. It can also be used with continuous data.
Identical Distribution Under the Null Hypothesis: The null hypothesis assumes that the two populations have the same distribution. The test detects if one population tends to have values that are systematically larger or smaller than the other.
Example Use Case: Comparing User Satisfaction Levels
Let's imagine we are evaluating the satisfaction levels of users with two different versions of a software application (Version A and Version B). We collected satisfaction ratings on a 5-point Likert scale:
1: Very Dissatisfied
2: Dissatisfied
3: Neutral
4: Satisfied
5: Very Satisfied
We have the following satisfaction scores from a sample of users for each version:
Version A Users: 4, 3, 2, 5, 4, 3, 4, 2
Version B Users: 5, 4, 5, 3, 5, 4, 4, 5, 3
Our goal is to determine if there is a statistically significant difference in the satisfaction levels between users of Version A and Version B using the Mann-Whitney U-test. Since the satisfaction levels are ordinal and we might not be able to assume a normal distribution, this test is a suitable choice.
In your tutorial, you can then proceed to demonstrate how to perform this Mann-Whitney U-test in Python using libraries like scipy.stats. This example provides a clear context for understanding the application of the test.
Other Important Considerations
For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.
Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.
Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.
One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.
Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.
Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.
As Professor Allen Downey would say... There is only one test! - If you can mimic the data generating process with a simulation then there's no real need for a statistical test. http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html
Import Libraries
There are a few libraries we can use for Mann-Whitney U-Test. I'll show an implementation using SciPy (a tried and true statistical library) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). These are the imports and versions you'll need to follow along.
Determining Sample Size (Power Analysis)
Why Calculate Sample Size?
Before conducting an experiment or study where you plan to use Welch's t-test (or any hypothesis test), it's crucial to estimate the required sample size. This process, often called a priori power analysis, helps ensure your study has a good chance of detecting a statistically significant result if a true effect of a certain magnitude exists.
Statistical Power: This is the probability (1−β) of correctly rejecting the null hypothesis when the alternative hypothesis is true. In simpler terms, it's the probability of finding a significant difference if a real difference of a specific size actually exists. Commonly desired power levels are 80% (0.8) or 90% (0.9).
Significance Level (α): This is the probability of making a Type I error – rejecting the null hypothesis when it's actually true (a "false positive"). This is typically set at 5% (0.05).
Effect Size: This quantifies the magnitude of the difference you expect or want to be able to detect between the two groups (e.g., the difference between mean1 and mean2, often standardized).
Data Variability: The sample size of each group need not be the same for this test, but often can be.
Calculating sample size helps you balance resources: avoiding studies that are too small (underpowered) and thus likely to miss real effects, and avoiding studies that are unnecessarily large (overpowered), wasting resources or potentially exposing more participants than needed to experimental conditions.
Challenges for Mann-Whitney U-Test:
This function and plot are based on an approximation using t-test power analysis and the true sample size requirements for the Mann-Whitney U-test can be more complex and might require more specialized methods or software. This might not be the most appropriate way to calculate sample sizes when the sample sizes are very small or when the distributions are vastly different. A more precise method might involve simulations.
Exploring the Sample Size and Effect Size Relationship:
When you run this code, you will see a plot showing how the approximate required sample size per group increases as the target effect size decreases. This visually demonstrates the inverse relationship between effect size and the sample size needed to achieve a certain level of statistical power. Let's explore this relationship assuming equal sample sizes.
Calculating Sample Size with the Desired Effect:
Playing around with these expected values can help you determine how many samples you will need to collect per group. Utilizing a new function that takes in the desired sample size and expected means we can generate approximately how many samples we will need for our statistical significance and power.
Synthetic Data
Use Case Context: Before we can perform the Mann-Whitney U-test in Python, we need data to work with! Often in tutorials, we generate synthetic data. This is useful because it allows us to:
Control the exact characteristics of our sample data (like the mean, standard deviation, and sample size).
Specifically create a scenario that highlights the strengths of the test we're demonstrating. In this case, we want to create data where the two groups likely have are non-Normal and have possibly unequal variances, making the Mann-Whitney U-test the appropriate choice over other tests.
Code: This code block simulates collecting data for two independent groups (A and B) from populations that are not normally distributed but have different means and possibly different variances. I've weighted the scores differently for each group. The data is then organized into a convenient Pandas DataFrame, ready for analysis. We can use the sample size we found in the previous section for each group.
Validating Assumptions
There aren't many assumptions that the Mann-Whitney U-test makes. The data basically needs to be Independent and Identically Distributed (or I.I.D.). It also should be continuous or ordinal. So we can talk about the assumptions the test does NOT make.
No Assumption of Normality: This is the primary reason for using the Mann-Whitney U-test when the normality assumption of the t-test is violated.
No Assumption of Homoscedasticity (Equal Variances): Unlike the independent samples t-test, the Mann-Whitney U-test does not require the variances of the two groups to be equal.
The Python code snippet below implements functions to assess the non-assumption of normality for each group:
This Python code defines a function run_mann_whitney that performs a Mann-Whitney U-test from SciPy. It prints the test statistic and p-value, then interprets the result by comparing the p-value to a specified significance level (alpha, defaulting to 0.05) to determine if there's a statistically significant difference between the groups.
Interpreting the Results:
We then interpret the obtained p_value in relation to a chosen significance level (α, again, typically 0.05).
If the p-value is less than α: This provides evidence against the null hypothesis. We reject the null hypothesis and conclude that there is a statistically significant difference between the means of Group A and Group B. In other words, the observed difference between the sample means is unlikely to have occurred by random chance alone.
If the p-value is greater than or equal to α: We fail to reject the null hypothesis. This means that based on our data, there is not enough statistical evidence to conclude that there is a significant difference between the means of Group A and Group B. It's important to note that this does not necessarily mean that the means are equal, only that we haven't found sufficient evidence to say they are different.
The pingouin library offers a streamlined and statistically rich way to conduct various statistical tests, including the independent samples t-test with Welch's correction. The benefit of using pingouin is that it computes the confidence intervals, effect size, bayes factor along with the p-value. These are all very handy to have and would have required much more code if using SciPy or statsmodels. The run_mann_whitney_u_test_pingouin function encapsulates this process:
Interpreting the P-Value:
p_value = results['p-val'].iloc[0]: This line extracts the p-value from the results DataFrame. The p-value is typically located in the first row of the 'p-val' column.
The subsequent if statement performs the hypothesis test:
if p_value < alpha: If the p-value is less than the chosen significance level (alpha), we reject the null hypothesis. This suggests that there is a statistically significant difference between the means of Group A and Group B.
else: If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis. This indicates that there is not enough statistical evidence to conclude that the 1 means of the two groups are significantly different.
You can measure the effect size or the magnitude of the difference between the two groups after performing a Mann-Whitney U-test using several methods. Here are some common approaches and how you can implement them in Python:
1. Rank-Biserial Correlation (r):
This is a commonly reported effect size measure for the Mann-Whitney U-test. It estimates the correlation between the group membership and the ranked outcome variable.
It ranges from -1 to +1, where values closer to the extremes indicate a larger effect.
2. Common Language Effect Size (CLES) or Probability of Superiority:
CLES represents the probability that a randomly chosen value from one group will be greater than a randomly chosen value from the other group.
A CLES of 0.5 indicates no difference, while values closer to 0 or 1 indicate a larger separation between the groups.
3. Cohen's d Approximation:
If your sample size is reasonably large (n > 20 in each group), the Mann-Whitney U statistic can be approximated by a normal distribution. You can calculate a standardized z-score and then relate it to a Cohen's d-like effect size.
Caution: This conversion to Cohen's d is an approximation and should be interpreted with care, as the Mann-Whitney U-test is fundamentally based on ranks, not means and standard deviations.
Choosing the Right Effect Size Measure:
Rank-Biserial Correlation (r): Directly related to the U statistic and provides a correlation-like measure.
CLES: Offers an intuitive probability-based interpretation of the difference between the groups.
Cohen's d approximation: Should be used with caution as it tries to map a non-parametric test result onto a parametric effect size metric.
A great way to visually compare the distribution of responses between 2 groups is side by side bar plot, one color for each group (a percentage bar chart is another great option). First we need to gather the data in a way that's easy to visualize.