The One-Way ANOVA Test for Comparing Multiple Means with SciPy and pingouin

GitHub Repo:

You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/One_Way_ANOVA.ipynb

OUTLINE

Introduction to the One-Way ANOVA Test for Comparing Multiple Means
Formulating the Hypotheses
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Assumptions
The One-Way ANOVA Test
Post-Hoc Tests
Visualization

Introduction to ANOVA for Comparing Multiple Means

ANOVA is a powerful statistical tool that allows us to compare the means of two or more independent groups to determine if there are any statistically significant differences between them. In the context of A/B/C/D testing, this becomes invaluable for understanding which variation of a particular element (like a website landing page, advertisement, or email subject line) performs significantly better than others.

Use Case: Revealing Performance Differences in A/B/C/D Testing

Imagine you're running an A/B/C/D test. This common practice in data-driven decision-making involves presenting different versions of something to distinct groups of users and then measuring their responses to see which version yields the best results. Our focus here is on scenarios where the outcome we're measuring is a continuous variable, such as the time spent on a website, conversion rates (which can be analyzed with ANOVA under certain transformations or using other methods, but we'll focus on continuous metrics for this introduction), or revenue generated.

ANOVA shines in situations where we have more than two variations to compare. While a simple t-test can compare the means of two groups, ANOVA provides a more efficient and statistically sound way to analyze the differences when we have three or more groups. By analyzing the variance within each group and the variance between the group means, ANOVA helps us determine if the observed differences are likely due to a real effect of the variations or simply random chance.

In our A/B/C/D testing context, ANOVA will help us answer the crucial question: Is there a statistically significant difference in the average performance metric (e.g., time spent on site) across the four different variations (A, B, C, and D)? Then we will dive into Post-hoc tests to determine which group is significantly different from the rest, and also how different it is.

Formulating the Hypotheses

In statistical hypothesis testing, we set up two competing statements: the null hypothesis and the alternative hypothesis.

Null Hypothesis (H₀): The null hypothesis states that there is no significant difference in the population means of the groups being compared. In our landing page example, the null hypothesis would be that the average time spent on the site is the same for all four landing page variations (A, B, C, and D). Mathematically, this can be represented as: μA = μB = μC = μD where μA, μB, μC, and μD represent the population mean time spent on the site for landing pages A, B, C, and D, respectively.
Alternative Hypothesis (H₁): The alternative hypothesis states that there is a significant difference in the population means of at least one pair of groups. In our example, this means that at least one of the landing page variations has a different average time spent on the site compared to one or more of the other variations. Mathematically, this can be represented as: At least one μᵢ ≠ μⱼ for some i ≠ j, where i, j ∈ {A, B, C, D} It's important to note that the alternative hypothesis doesn't specify which groups are different, only that at least one difference exists.

ANOVA will help us decide whether to reject the null hypothesis in favor of the alternative hypothesis based on the observed data.

Import Libraries

There are a few libraries we can use for ANOVA. I'll show an implementation using SciPy (a well known and robust statistical library) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). These are the imports and versions you'll need to follow along with.

Determining Sample Size (Power Analysis)

First I'm going to be performing a power analysis to strategically plan our ANOVA test. This is a crucial step to determine the necessary sample size for your study before you collect any data. The goal is to ensure that if a real difference exists between the groups you are comparing, your test will have a high probability of detecting it (this probability is your desired "power").

My approach here involves exploring how the required sample size changes depending on the magnitude of the effect we are trying to find. By calculating the necessary sample size for a range of potential effect sizes and visualizing this relationship, we can make a more informed decision about how much data we'll need to collect for our ANOVA test to be meaningful and reliable. This helps us balance the desire to detect even small effects with the practical limitations of data collection.

Cohen's f

Cohen's f is essentially a standardized measure of the variability of the group means around the overall mean.

It's analogous to Cohen's d, which measures the standardized difference between two group means. Cohen's f extends this concept to situations with more than two groups.

A larger Cohen's f indicates a greater degree of difference between the group means relative to the variability within each group. This suggests a stronger "treatment effect" or a more substantial association between the independent variable (group membership) and the dependent variable.

Jacob Cohen (the statistician who developed this measure) provided widely used guidelines for interpreting the magnitude of Cohen's f:

Small effect: f = 0.10
Medium effect: f = 0.25
Large effect: f = 0.40

Exploring the Sample Size and Effect Size Relationship:

When you run this code, you will see a plot showing how the approximate required sample size per group increases as the target effect size decreases. This visually demonstrates the inverse relationship between effect size and the sample size needed to achieve a certain level of statistical power. Let's explore this relationship.

Calculating Sample Size with the Desired Effect:

Playing around with these expected values can help you determine how many samples you will need to collect. Now we can generate approximately how many samples we will need for our statistical significance and power given the effect we'd anticipate to measure.

Better Approaches or Considerations:

Define a Meaningful Effect Size: Instead of using a generic "medium" effect size, try to estimate the smallest effect size that would be practically significant for your business. This might involve looking at historical data, considering the cost of implementing changes, or the potential uplift in your key metric.
Consider Variability: The effect size in ANOVA power analysis is often expressed in terms of Cohen's f, which relates the standard deviation of the means to the common within-group standard deviation. A more informed power analysis might involve estimating the expected variability within your groups based on past data.
Iterative Power Analysis: You could perform a sensitivity analysis by trying different effect sizes and desired power levels to see how they impact the required sample size. This can help you understand the trade-offs.
Power Analysis for Post-Hoc Tests: If the overall ANOVA is significant, you'll likely want to perform post-hoc tests to see which specific groups differ. Power analysis can also be considered for these pairwise comparisons, although it's often more complex and might involve adjusting the alpha level (e.g., using Bonferroni correction).
Sequential Testing: For online A/B/C/D tests, you might consider sequential testing methods that allow you to stop the experiment earlier if a significant result is achieved, potentially reducing the overall sample size needed. However, these methods require careful planning and analysis to control error rates.
Bayesian Methods: Bayesian A/B testing offers an alternative framework that focuses on the probability of one variation being better than another, rather than strict null hypothesis testing. Power analysis in a Bayesian context is different and often involves assessing the probability of reaching a desired level of certainty.

Synthetic Data

Example Scenario: Optimizing Landing Page Engagement

Consider a company aiming to improve user engagement on their website. They hypothesize that the design and content of their landing page significantly influence how long visitors stay on the site. To test this, they create four distinct versions of the landing page:

Variation A: A minimalist design with a clear call to action.
Variation B: A more visually rich page with multiple images and interactive elements.
Variation C: A long-form page with detailed information and customer testimonials.
Variation D: A video-centric page with a prominent introductory video.

They randomly assign incoming website visitors to one of these four variations. After a week of running the test, they collect data on the time (in seconds) each user spent on the landing page they were exposed to. The goal is to use ANOVA to determine if there's a statistically significant difference in the average time spent on the site across these four different landing page variations.

Validating Assumptions

Before applying ANOVA, it's crucial to understand and check if the underlying assumptions of the test are reasonably met. Violations of these assumptions can affect the validity of the ANOVA results. The key assumptions are:

Normality: The data within each of the groups being compared should be approximately normally distributed. This means that if you were to plot the distribution of the data for each variation, it should resemble a bell-shaped curve. While ANOVA is somewhat robust to minor deviations from normality, significant departures can impact the test's accuracy, especially with small sample sizes. We can use statistical tests (like the Shapiro-Wilk test) or visual inspections (like histograms and Q-Q plots) to assess normality.
Homogeneity of Variances (Homoscedasticity): The variance (a measure of the spread or dispersion) of the data should be roughly equal across all the groups being compared. If the variances are significantly different (heteroscedasticity), it can lead to an increased risk of Type I or Type II errors. We can assess this assumption using statistical tests like Levene's test or Bartlett's test.
Independence of Observations: The observations within each group should be independent of each other, and the observations between different groups should also be independent. This means that one user's time spent on landing page A should not influence another user's time spent on landing page A or any other page. Random sampling and assignment of users to different variations help ensure this independence.

It's important to note that in real-world scenarios, these assumptions might not always be perfectly met. Understanding the degree of violation and its potential impact on the ANOVA results is crucial for drawing valid conclusions.

Normality (SciPy)

The QQ Plots can help us determine here if any group is not Normally distributed. Here's the SciPy implementation.

Normality (pingouin)

The QQ Plots can help us determine here if any group is not Normally distributed. Here's the pingouin implementation. Notice here the results are in a nice pandas dataframe by default.

Variances (SciPy)

Levene's Test can help us determine here if any group's variance is unlike the rest. Here's the SciPy implementation.

Variances (pingouin)

And here's the pingouin implementation.

Independence Assumption

Checking for independence is usually done by considering the experimental design:

Random Assignment: Were participants or experimental units randomly assigned to each variation? Random assignment helps ensure that the groups are comparable and that observations across groups are independent.
Within-Group Independence: Are the observations within each group independent of each other? For example, if you are measuring the time spent by users on a website, each user's session should ideally be independent of other users' sessions. If there are dependencies (e.g., repeated measures on the same user without accounting for it), ANOVA might not be appropriate for the raw data.
No Systematic Bias: Was the data collection process free from any systematic biases that could introduce dependencies between observations?

One-Way ANOVA Test

ANOVA (SciPy)

The f_oneway() function is very straightforward. You simply pass in the test groups and in return get the p-value and test statistic.If the p-value is less than your pre-determined alpha than you can infer that at least one of the groups means is likely different from the rest. To then find out which group, you will need a post-hoc test.

ANOVA (pingouin)

The anova() function from pingouin is also simple, however first we had to stack the data for each group into one column with 'group' and 'value' headers. Then we specify these in the function. In return we can evaluate the p-value in the same way.

Post-Hoc Tests

Delving Deeper:

If the ANOVA test yields a statistically significant result (i.e., we reject the null hypothesis), it tells us that there is evidence to suggest that at least one of the group means is different from the others. However, ANOVA itself does not tell us which specific groups are different. This is where post-hoc tests come into play.

Post-hoc tests are pairwise comparisons performed after a significant ANOVA result to identify which specific pairs of group means are significantly different from each other. Several post-hoc tests are available, each with its own assumptions and strengths:

Tukey's Honestly Significant Difference (HSD) Test: A widely used post-hoc test that provides a single step-down procedure for comparing all pairwise means. It controls the family-wise error rate, meaning the probability of making at least one Type I error (falsely rejecting the null hypothesis) across all comparisons. Tukey's HSD is generally a good choice when you have equal or approximately equal sample sizes across groups.
Bonferroni Correction: A more conservative approach that adjusts the significance level (α) for each individual pairwise comparison by dividing it by the total number of comparisons being made. While it effectively controls the family-wise error rate, it can be overly conservative, potentially leading to a higher chance of Type II errors (failing to detect a real difference).
Other Post-Hoc Tests: Other options include Sidak's correction, Scheffé's method, and Dunnett's test (used when comparing multiple groups to a control group). The choice of post-hoc test depends on the specific research question, the number of comparisons, and the characteristics of the data.

In the context of our landing page experiment, if ANOVA reveals a significant difference in average time spent, we would then use a post-hoc test (like Tukey's HSD) to determine which specific landing page variations (e.g., A vs. B, B vs. C, etc.) have significantly different average engagement times. This allows us to pinpoint which landing page design is truly outperforming the others.

Tukey's HSD Test (SciPy)

Since we have equal sample sizes we can use Tukey's HSD Test. Passing the data into the tukey_hsd() function again is pretty straight forward. In return we get a nice summary with p-values and confidence intervals for each group comparison. My only complaint here is that the results are hard to iterate against since it's a summary print out.

Tukey's HSD Test (statsmodels)

For fun, there's also an implementation in statsmodels which has a nice df summary and tells you whether to accept or reject the hypotheis for each group comparison.

Tukey's HSD Test (pingouin)

And finally in pingouin, with also a nice df summary. This summary doesn't contain the confidence intervals but does include Hedge's f effect size for us which is really nice. Hedge's g tells you how many standard deviations apart the means of your two groups are, providing a standardized measure of the magnitude of the difference. Because it corrects for small sample bias, it's often preferred over Cohen's d in situations with smaller sample sizes.

Interpretation of Hedge's g:

Around 0.2: Small effect size - the difference between the group means is about 0.2 standard deviations.
Around 0.5: Medium effect size - the difference is about 0.5 standard deviations. This is often considered a practically visible difference.
Around 0.8: Large effect size - the difference is about 0.8 or more standard deviations, indicating a substantial difference between the groups.

Visualization

A great way to visually compare the distributions of means from different groups is a KDE plot.

More Projects

Page updated

Google Sites

Report abuse