GitHub Repo:
You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/Wilcoxon_Signed_Rank_Test_for_Medians.ipynb
OUTLINE
Introduction to Wilcoxon Signed-Rank Test
Other Important Considerations
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Assumptions
Wilcoxon Signed-Rank Test (SciPy)
Wilcoxon Signed-Rank Test (pingouin)
Confidence Intervals
Visualization
The Wilcoxon Signed-Rank test is a powerful non-parametric statistical method for comparing two related samples. Unlike its parametric counterpart, the paired t-test, the Wilcoxon Signed-Rank test does not assume that your data follows a normal distribution. This makes it a valuable tool when analyzing paired data that may violate the normality assumption, or when dealing with ordinal data where precise interval measurements are not available.
The Wilcoxon Signed-Rank test assesses whether there is a statistically significant difference between two paired groups by considering both the magnitude and the direction of the differences within each pair. It does this by:
Calculating the difference between each pair of observations.
Ranking the absolute values of these differences.
Assigning the sign of the original difference to each rank.
Summing the ranks for the positive differences and the ranks for the negative differences.
The test statistic (often denoted as W) is typically the smaller of these two sums (or sometimes defined differently depending on the software). This statistic is then compared to a critical value or used to calculate a p-value to determine if the observed difference between the paired groups is statistically significant.
While less restrictive than parametric tests, the Wilcoxon Signed-Rank test does have several key assumptions that should be considered:
Paired Data: The data must consist of paired observations. This means that each measurement in one group has a direct corresponding measurement in the other group (e.g., measurements taken from the same subject at two different time points, or matched pairs).
Ordinal Scale: The dependent variable (or the differences between the paired observations) should be measured on at least an ordinal scale, allowing for ranking of the differences.
Symmetry of Differences (around the median): The distribution of the differences between the paired observations is assumed to be symmetric around its median. While the test is more robust to violations of normality than the paired t-test, significant asymmetry in the differences can affect the test's power and the interpretation of the p-value as a test of the median difference.
Independence of Pairs: Each pair of observations must be independent of every other pair. The measurements within a pair are related, but the pairs themselves should not influence each other.
The Wilcoxon Signed-Rank test is particularly useful in the following situations:
Paired Data: When you have measurements from the same subjects or matched pairs under two different conditions.
Non-Normal Data: When the distribution of the differences between the paired observations significantly deviates from a normal distribution, making the paired t-test inappropriate.
Ordinal Data: When your data is on an ordinal scale (ranked data) but not necessarily interval or ratio, and you want to assess if there's a consistent direction of difference.
Small Sample Sizes: While the paired t-test can be used with small samples if normality is met, the Wilcoxon Signed-Rank test provides a robust alternative when normality cannot be assumed in small datasets.
Imagine a company wants to evaluate the effectiveness of a new training program designed to improve employee performance on a specific task. They measure the performance scores of 20 employees before they undergo the training and then measure their scores again after they have completed the program. The data collected for each employee represents a pair of scores (before training, after training).
In this scenario, we have paired data (each employee has two scores). We might suspect that the performance scores are not perfectly normally distributed, or we might want to use a non-parametric test as a more conservative approach. The Wilcoxon Signed-Rank test would be an appropriate statistical method to determine if there is a significant change in performance scores after the training program, considering both the direction and the magnitude of the improvement (or decline) for each employee.
In the subsequent sections of this tutorial, we will use Python to analyze such a dataset and perform the Wilcoxon Signed-Rank test to draw meaningful conclusions about the training program's impact.
Other Important Considerations
For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.
Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.
Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.
One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.
Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.
Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.
As Professor Allen Downey would say... There is only one test! - If you can mimic the data generating process with a simulation then there's no real need for a statistical test. http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html
Import Libraries
There are a few libraries we can use for Wilcoxon Signed-Rank. I'll show an implementation using SciPy (a well known and robust statistical library) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). These are the imports and versions you'll need to follow along with.
Determining Sample Size (Power Analysis)
Before conducting an experiment or study where you plan to use the Wilcoxon Signed-Rank test (or any hypothesis test), it's crucial to estimate the required sample size. This process, often related to a priori power analysis, helps ensure your study has a good chance of detecting a statistically significant result if a true effect of a certain magnitude exists in the paired differences.
Statistical Power: This is the probability (1−β) of correctly rejecting the null hypothesis when the alternative hypothesis is true. In simpler terms, it's the probability of finding a significant difference between the paired conditions if a real difference of a specific size actually exists in the population of paired differences. Commonly desired power levels are 80% (0.8) or 90% (0.9).
Significance Level (α): This is the probability of making a Type I error – rejecting the null hypothesis when it's actually true (a "false positive"). This is typically set at 5% (0.05).
Effect Size: For the Wilcoxon Signed-Rank test, effect size quantifies the magnitude of the difference you expect or want to be able to detect between the paired conditions. While Cohen's d is commonly used for parametric tests, for non-parametric tests like the Wilcoxon Signed-Rank test, standardized effect size measures based on ranks (e.g., Cliff's delta or rank-biserial correlation) are more appropriate. However, for power analysis approximations, a standardized mean difference (like Cohen's d of the differences) is often used as a proxy.
Number of Pairs: The sample size for the Wilcoxon Signed-Rank test refers to the number of pairs of observations in your study.
Calculating sample size helps you balance resources: avoiding studies that are too small (underpowered) and thus likely to miss real effects in the paired differences, and avoiding studies that are unnecessarily large (overpowered), wasting resources or potentially exposing more participants than needed to experimental conditions.
Challenges for Wilcoxon Signed-Rank Test Sample Size Estimation:
Direct analytical sample size calculation for the Wilcoxon Signed-Rank test can be complex and is not as straightforward as it is for parametric tests like the t-test. The function provided here uses an approximation based on the power analysis for a paired t-test applied to the differences.
This approximation relies on the idea that the Wilcoxon Signed-Rank test is often used as a non-parametric alternative to the paired t-test. Therefore, we can estimate the required sample size by considering the standardized effect size of the differences we want to detect.
Important Considerations:
This method is an approximation and might not be perfectly accurate, especially for very small sample sizes or when the distribution of the differences is highly non-normal or asymmetric.
More precise sample size calculations for non-parametric tests often involve simulations or specialized statistical software designed for power analysis of non-parametric tests.
The effect size used in this approximation (often a proxy like Cohen's d of the differences) needs to be carefully considered based on prior research, theoretical expectations, or a pilot study.
Despite these challenges, using a paired t-test power analysis on the expected differences provides a useful starting point for estimating the sample size needed for a study employing the Wilcoxon Signed-Rank test. Remember to interpret the results as an approximation and be aware of the potential limitations.
Exploring the Sample Size and Effect Size Relationship:
When you run this code, you will see a plot showing how the approximate required sample size per group increases as the target effect size decreases. This visually demonstrates the inverse relationship between effect size and the sample size needed to achieve a certain level of statistical power. Let's explore this relationship.
Calculating Sample Size with the Desired Effect:
Playing around with these expected values can help you determine how many samples you will need to collect. Now we can generate approximately how many samples we will need for our statistical significance and power given the effect we'd anticipate to measure.
Synthetic Data
Use Case Context: Before we can perform the Wilcoxon Signed-Rank Test in Python, we need data to work with! Often in tutorials, we generate synthetic data. This is useful because it allows us to:
Control the exact characteristics of our sample data (like the mean, standard deviation, and sample size).
Specifically create a scenario that highlights the strengths of the test we're demonstrating. In this case, we want to create data where the groups before and after differences are likely non-Normal, making the Wilcoxon Signed-Rank Test the appropriate choice over other tests.
Code: This code block simulates collecting data for one group before and after an intervention (A and B) from a population that is not necessarily normally distributed. We can use the sample size we found in the previous section for each group. You can see that I'm using the Exponential distribution to skew the data as well as clip them at 0 and 100 since this represents a score 1-100.
Validating Assumptions
The key assumptions for the Wilcoxon Signed-Rank test are:
Paired Data: The data must come from paired observations (e.g., before-and-after measurements on the same subjects, or matched pairs). This is a design assumption and cannot be checked with Python code; it's about how your data was collected.
Ordinal Scale: The dependent variable (or the differences between pairs) should be measured on at least an ordinal scale. This is also a measurement assumption.
Symmetry of Differences (around the median): The distribution of the differences between the paired observations should be symmetric. This is the most important statistical assumption for the Wilcoxon Signed-Rank test. If the distribution of differences is highly skewed, the test might not be robust, and its interpretation (as a test of the median difference) could be misleading.
Independence of Pairs: Each pair of observations must be independent of every other pair. This is another design assumption.
Of these, the symmetry of differences is the one you can most directly investigate using Python on your actual data. While the Wilcoxon Signed-Rank test is often considered robust to minor deviations from symmetry, severe asymmetry can affect its validity.
The Python code snippet below implements functions to assess the non-assumption of normality for each group:
This code defines a function that takes paired data, calculates the differences between the pairs, performs the Wilcoxon Signed-Rank test using scipy.stats.wilcoxon, prints the test statistic and p-value, and then interprets the result based on a significance level of 0.05.
Interpreting the Results:
We then interpret the obtained p_value in relation to a chosen significance level (α, again, typically 0.05).
If the p-value is less than α: This provides evidence against the null hypothesis. We reject the null hypothesis and conclude that there is a statistically significant difference between the medians of Group A and Group B. In other words, the observed difference between the sample medians is unlikely to have occurred by random chance alone.
If the p-value is greater than or equal to α: We fail to reject the null hypothesis. This means that based on our data, there is not enough statistical evidence to conclude that there is a significant difference between the medians of Group A and Group B. It's important to note that this does not necessarily mean that the medians are equal, only that we haven't found sufficient evidence to say they are different.
This Python function run_wilcoxon_signed_rank_test_pingouin takes two paired datasets (data_before, data_after) and an optional alternative hypothesis. It uses the pingouin.wilcoxon function to perform the Wilcoxon Signed-Rank test, displaying the results (including the W-statistic, p-value, and effect sizes like RBC and CLES). It then interprets the p-value against a significance level of 0.05, printing whether to reject or fail to reject the null hypothesis of no difference between the paired conditions. The method parameter set to "auto" let's the library decide whether or not to apply a continuity correction (if the data is too small). You can see here that we get the same results from SciPy. Once of the things that makes this library so nice is that the effect sizes (RBC and CLES) are computed automatically for you. If we were relying soley on SciPy, we'd have to compute those next. The one thing pingouin left out here are the confidence intervals which we will look at next. Pingouin typically has those for other tests but not for Wilcoxon.
Sometimes it's nice to know your bounds. This code snippet calculates a bootstrapped confidence interval for the median of the differences between two paired observations.
Why is this needed?
While the Wilcoxon Signed-Rank test tells you if there's a statistically significant difference between paired groups, it doesn't directly quantify the range of plausible values for the median of that difference in the population. A confidence interval provides this crucial information.
Bootstrapping is used here as a non-parametric method to estimate this confidence interval. It works by repeatedly resampling the observed differences with replacement and calculating the median for each resampled set. The confidence interval is then derived from the distribution of these bootstrapped medians. The function has many flexible options such as mean, correlations, etc. But for the median we can pass in a custom function to compute, np.median().
This approach is particularly useful when the distribution of the differences might not be normal, as it doesn't rely on parametric assumptions. It gives a more robust estimate of the uncertainty around the median difference compared to methods that assume a specific distribution. The paired=True argument in pg.compute_bootci is important here as it correctly handles the dependency between the 'before' and 'after' measurements during the resampling process (though for a univariate function like the median of the differences, the direct application to the differences array already captures this pairing). The confidence=1-alpha sets the confidence level (e.g., 0.95 for a 95% interval).
A great way to visually compare the distribution of before and after observations from continuous data is a KDE plot. First we need to gather the data in a way that's easy to visualize.