The Mixed ANOVA Test for Comparing Group Means Over Time (Longitudinal Studies) with SciPy, statsmodels, & pingouin

GitHub Repo:

You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/Mixed_ANOVA.ipynb

OUTLINE

Introduction to the Mixed ANOVA
Formulating the Hypotheses & Interpreting Results
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Assumptions
Mixed ANOVA Test
Post-Hoc Tests
Visualization

Introduction to Mixed ANOVA for Longitudinal Studies (Comparing Grouped Means Across Time)

The Mixed ANOVA (Analysis of Variance) is a statistical test used to compare means when you have at least one within-subjects factor (repeated measures on the same subjects) and at least one between-subjects factor (independent groups). This allows us to examine how different groups respond over time or across different conditions. Sometimes we want to show that the effect is truly a result of the interaction between a group and time, not a result of time alone (sometimes metrics just trend favorably) or to a high performing group (winners are gonna win). This would be a great test to convince yourself and stakeholders of any such effects.

Use Case: Evaluating User Satisfaction with Repeated Survey Exposure

Imagine a user experience research team wants to understand how user satisfaction with a new software feature changes over time as users are repeatedly exposed to a survey about the feature. They also want to see if different user segments (e.g., novice vs. experienced users) show different patterns of satisfaction change.

Factors:
- Within-Subjects Factor (Repeated Measures): Time of Survey Exposure (e.g., Survey 1, Survey 2, Survey 3). Each user completes the survey at all three time points.
- Between-Subjects Factor: User Experience Level (e.g., Novice Users, Experienced Users). Users are assigned to one of these two groups and remain in that group throughout the study.
Dependent Variable: Continuous metric (e.g., User Satisfaction Score on a scale of 1 to 10).

When to Use this Test:

When you have at least one categorical between-subjects independent variable (groups that are independent of each other).
When you have at least one categorical within-subjects independent variable (the same subjects are measured multiple times under different conditions).
When your dependent variable is continuous.
When you want to examine the main effects of both between-subjects and within-subjects factors, as well as their interaction on the dependent variable.

Formulating the Hypothesis & Interpreting Results

Null Hypotheses (H0):

Main Effect of Between-Subjects Factor: There is no significant difference in the mean satisfaction scores between novice and experienced users, averaged across all time points. H0(Between):μNovice=μExperienced
Main Effect of Within-Subjects Factor: There is no significant difference in the mean satisfaction scores across the different time points, averaged across both user groups. H0(Within):μTime1=μTime2=μTime3
Interaction Effect: There is no significant interaction between user experience level and time of survey exposure on user satisfaction scores. This means the pattern of change in satisfaction over time is the same for both novice and experienced users.

Alternative Hypotheses (H1):

Main Effect of Between-Subjects Factor: There is a significant difference in the mean satisfaction scores between novice and experienced users, averaged across all time points. H1(Between):At least one μExperienceLevel is different
Main Effect of Within-Subjects Factor: There is a significant difference in the mean satisfaction scores across at least two of the different time points, averaged across both user groups. H1(Within):At least one μTime is different
Interaction Effect: There is a significant interaction between user experience level and time of survey exposure on user satisfaction scores. This means the pattern of change in satisfaction over time differs between novice and experienced users.

Interpretation:

After conducting the Mixed ANOVA, you will obtain p-values for each of the main effects and the interaction effect:

P-value for the Between-Subjects Factor (Experience Level):
- If p<α (your chosen significance level, e.g., 0.05), you reject the null hypothesis. This suggests that there is a significant difference in average satisfaction scores between novice and experienced users. Post-hoc tests (e.g., t-tests with Bonferroni correction) can be used to determine which specific groups differ.
P-value for the Within-Subjects Factor (Time):
- If p<α, you reject the null hypothesis. This suggests that there is a significant change in average satisfaction scores over the different time points. Follow-up analyses (e.g., paired t-tests with Bonferroni correction) can identify which specific time points differ significantly. Remember to check for and address potential violations of sphericity.
P-value for the Interaction Effect (Experience Level × Time):
- If p<α, you reject the null hypothesis. This is the most crucial finding in a Mixed ANOVA. It indicates that the way satisfaction changes over time is different for novice and experienced users. To understand this interaction, you would typically examine plots of the means over time for each group and potentially conduct further analyses (e.g., simple effects analyses) to see where the groups diverge.

Import Libraries

There are a few libraries we can use for Mixed ANOVA. I'll show an implementation using statsmodels (a well known and robust statistical library based on SciPy) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). Typically I'd rely on statsmodels and SciPy for the power analysis and testing assumptions, but pingouin has a robust suite of tools for these as well so in this tutorial I'll work with pingouin mostly. These are the imports and versions you'll need to follow along with.

Determining Sample Size (Power Analysis)

Before conducting a Mixed ANOVA, a power analysis is crucial for determining the minimum sample size needed to detect a statistically significant effect if one truly exists. Statistical power is the probability of correctly rejecting a false null hypothesis. In simpler terms, it's the ability of your study to find a real effect. Without sufficient power, your study might fail to detect a meaningful difference, leading to a Type II error (a false negative). By performing a power analysis a priori (before data collection), you can estimate the necessary number of participants to achieve a desired level of power (typically 0.80 or higher) for a given significance level (alpha, usually 0.05) and an expected effect size. This proactive step helps researchers avoid underpowered studies that waste resources and may lead to inconclusive or misleading results.

For a Mixed ANOVA the power analysis is tricky becasue of there between both between and within factors. My approach here involves focusing on the within-subjects factor (the repeated measures). This is a sensible strategy because repeated measures designs are generally more powerful than between-subjects designs for detecting within-group changes or effects over time due to the reduction of error variance associated with individual differences.

First we need to define some parameters, our desired power and alpha. The correlation is a unique and important factor in repeated measures designs, as higher correlations between measurements reduce the required sample size for a given level of power. We could estimate this with historical data pre-test.

Exploring the Sample Size and Effect Size Relationship:

When you run this code, you will see a plot showing how the approximate required sample size per group increases as the target effect size decreases. This visually demonstrates the inverse relationship between effect size and the sample size needed to achieve a certain level of statistical power using pingouins repeated measures ANOVA function. Let's explore this relationship.

Eta-squared (η2) is a common measure of effect size in ANOVA. It quantifies the proportion of variance in the dependent variable that is explained by an independent variable or a factor in your model. In simpler terms, it tells you how much of the total variability in your outcome can be attributed to the effect you are investigating.

Partial Eta-Squared (ηp2) is the effect size measure typically reported by statistical software for factorial ANOVA designs (like Mixed ANOVA). It represents the proportion of variance attributable to a factor out of the variance that remains after accounting for the other factors in the model.

Here's how to interpret the values of eta-squared or partial eta-squared, generally following Cohen's guidelines:

η2 < 0.01: Negligible effect. The independent variable explains less than 1% of the variance in the dependent variable.
0.01 ≤ η2 < 0.06: Small effect. The independent variable explains between 1% and 5.9% of the variance.
0.06 ≤ η2 < 0.14: Medium effect. The independent variable explains between 6% and 13.9% of the variance.
η2 ≥ 0.14: Large effect. The independent variable explains 14% or more of the variance in the dependent variable.

Calculating Sample Size with the Desired Effect:

Playing around with these expected values can help you determine how many samples you will need to collect. Now we can generate approximately how many samples we will need for our statistical significance and power given the effect we'd anticipate to measure.

Better Approaches or Considerations:

This approach however doesn't explicitly model the power needed to detect the interaction effect between the within- and between-subjects factors. Power analysis for interactions in Mixed ANOVA can be complex and might require more specialized techniques or software. We are implicitly assuming that by having sufficient power to detect within-subject effects, we'll also have reasonable power for other effects, but this isn't always guaranteed, especially if the interaction effect is small.

Synthetic Data

This Python code synthetically generates data suitable for a Mixed ANOVA. It simulates satisfaction scores of participants with two experience levels (Novice, Experienced) across three time points (2023, 2024, 2025).

In essence, this code creates a fictional dataset where:

Participants are measured repeatedly over time.
There are two distinct groups of participants (based on experience).
The satisfaction scores are designed to show different patterns of change over time for the two experience groups, making it suitable for testing for main effects of experience and time, as well as their interaction using a Mixed ANOVA.

Validating Assumptions

Mixed ANOVA has several key assumptions that should be met for the results to be valid:

Normality: The residuals (the difference between the observed values and the values predicted by the model) are normally distributed for each combination of the levels of the between-subjects and within-subjects factors.
Homogeneity of Variances (Between-Subjects): The variances of the residuals are equal across the levels of the between-subjects factor.
Homogeneity of Covariances (Sphericity - Within-Subjects): The variances of the differences between all pairs of repeated measures are equal. If sphericity is violated, corrections like Greenhouse-Geisser or Huynh-Feldt can be applied.
Independence: Observations between different subjects are independent of each other. However, the observations within the same subject are, by design, related.

Normality

In summary, this code segment fits a Mixed ANOVA model to the data, correctly accounting for the repeated measures. Then it obtains the residuals from this model. Next it uses the Shapiro-Wilk test from pingouin to formally assess whether the distribution of these residuals significantly deviates from a normal distribution. The test results indicate whether the normality assumption is likely met or potentially violated.

We can also visualize the residuals in a histogram and QQ Plot to asses Normality. This data looks really good.

Homogeneity of Variances (Between-Subjects)

This assumption requires that the variance of the residuals should be roughly equal across the different groups of your between-subjects factor. Levene's test is a statistical test used to assess the equality of variances for a variable calculated for two or more groups. It's less sensitive to departures from normality compared to other tests like Bartlett's test, making it a more robust choice when the normality assumption might be violated. I'm using pingouin's homoscedastistiy() function with the method set to "levene".

Homogeneity of Covariances (Sphericity - Within-Subjects)

This code snippet focuses on validating the assumption of sphericity, which is a crucial requirement for repeated measures ANOVA (and therefore the within-subjects part of a Mixed ANOVA) when there are more than two levels of the within-subjects factor (Survey_Year). Sphericity refers to the condition where the variances of the differences between all possible pairs of within-subject conditions are equal. Here, I'm using the sphereicity() function from pingouin to check this assumption.

The code first checks if the number of unique levels in the Survey_Year column (our within-subjects factor) is greater than 2. Sphericity is only a concern when there are three or more repeated measures. If there are only two levels, the assumption of sphericity is automatically met because there's only one pairwise difference.

Independence Assumption

Checking the independence assumption in a Mixed ANOVA is more about the design of your study than a statistical test you can run on the data itself. The independence assumption states that the observations should not influence each other.

Here's how to think about and assess the independence assumption in the context of your Mixed ANOVA design:

Understanding the Data Collection Process:

Between-Subjects Factor (Experience Level): The groups in your between-subjects factor (Novice and Experienced) should be independent. This means that one participant being in the 'Novice' group should not affect whether another participant is in the 'Experienced' group, or their responses. This is typically ensured through random assignment to groups (if it's an experimental design) or by sampling distinct individuals for each group (if it's an observational design).
Within-Subjects Factor (Survey Year): The independence assumption primarily applies between participants, not within the same participant across different time points. By definition, the repeated measures from the same individual are dependent – their satisfaction scores at '2023' are likely related to their scores in '2024' and '2025'. This is why we use repeated measures ANOVA, which is designed to handle this dependency.
No Cross-Contamination: Ensure that the measurements of one participant are not influenced by the measurements or experiences of other participants. For example, if participants in different experience levels interacted and discussed the surveys, their responses might not be independent.

Mixed ANOVA (pingouin)

Here I'm using the mixed_anova() test from pingouin. It's pretty straight forward to implement. If you haven't passed your sphericity assumption then you can run the test with the correction parameter set to True. Even though our synthetic data has passed I'm using the correction here to show that the output contains informations relevant to this assumption, including the corrected p-value.

Output Table:

SS: This stands for Sum of Squares. It represents the total variability attributed to each source of variance. Larger SS values indicate more variability explained by that factor or interaction.

MS: This stands for Mean Square. It's calculated by dividing the Sum of Squares (SS) by its corresponding Degrees of Freedom (DF) (i.e., MS = SS / DF). MS represents the variance explained by each source.

F: This is the F-statistic. It's the test statistic for ANOVA, calculated by dividing the Mean Square of the effect by the Mean Square of the appropriate error term (e.g., F = MS_effect / MS_error). A larger F-statistic suggests a greater difference between the means relative to the variability within the groups.

p-unc: This is the uncorrected p-value. It represents the probability of observing the data (or more extreme data) if the null hypothesis for that effect were true.

If the p-value is less than your chosen significance level (alpha, typically 0.05), you would reject the null hypothesis and conclude that there is a statistically significant effect for that source of variance.
If the p-value is greater than alpha, you would fail to reject the null hypothesis.

np2: This stands for partial eta-squared (ηp2). As discussed earlier, this is a measure of effect size, indicating the proportion of variance in the dependent variable attributable to each factor or interaction, after partialling out the variance associated with other factors in the model. It ranges from 0 to 1, with higher values indicating a larger effect. You can use Cohen's guidelines (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14) to interpret the magnitude of the effect.

Inference:

Between-Subjects Main Effect: If the p-value for Experience_Level is significant, there's a significant overall difference in satisfaction scores between novice and experienced users. For us, most of the explained variance is coming from this variable.
Within-Subjects Main Effect: If the p-value for Survey_Year (and its corrected versions if sphericity is violated) is significant, there's a significant overall change in satisfaction scores across the three survey years, when averaging across both experience levels. For us, this is the least impactful variable. We can also see that the p-value is just past our threshold, but not by much.
Interaction Effect (Experience_Level * Survey_Year): If the p-value for the interaction (and its corrected versions if sphericity is violated) is significant, the way satisfaction scores change over the survey years is different for novice and experienced users. This is often the most interesting effect in a Mixed ANOVA. This is the variable we were hoping to show as being the most impactful, and while it is impactful, most of the explained variance comes from the Experience_Level group.

Consider the Effect Sizes (np2): Even if an effect is statistically significant, look at the partial eta-squared to understand the practical significance or magnitude of the effect. A small but significant effect might not be as meaningful as a large and significant effect.

Mixed ANOVA (statsmodels)

To see a different implementatipn, now I'm going to use the statsmodels library to perform a Mixed Linear Model (MLM) analysis, which is equivalent to a Mixed ANOVA for the design specified. It aims to examine how Satisfaction_Score is influenced by the between-subjects factor Experience_Level and the within-subjects factor Survey_Year, while accounting for the repeated measurements within each User_ID by treating User_ID as a random effect. The mixedlm() function fits this model, and its summary then displays a comprehensive statistical summary of the model's results, including fixed effects coefficients, random effects variance, and overall model fit statistics.

Output Table:

This table presents the results for the fixed effects in your model – Experience_Level and Survey_Year, as well as their interaction.

Coef.: The estimated coefficient for each predictor. These represent the change in Satisfaction_Score associated with a one-unit change in the predictor (or the difference between a level and the reference level for categorical predictors), holding other predictors constant.
- Intercept (8.088): This is the estimated mean Satisfaction_Score for the reference group. By default, statsmodels uses the first level alphabetically or as it appears in your data. Assuming 'Experienced' is the first level for Experience_Level and '2023' is the first for Survey_Year, this is the estimated mean satisfaction for Experienced users in 2023.
- C(Experience_Level)[T.Novice] (-3.296): This is the estimated difference in mean Satisfaction_Score between Novice users and the reference group (Experienced users), averaged across all survey years. Novice users score about 3.3 points lower than Experienced users.
- C(Survey_Year)[T.2024] (0.845): This is the estimated difference in mean Satisfaction_Score in 2024 compared to the reference year (2023), averaged across both experience levels. Satisfaction scores are about 0.85 points higher in 2024 compared to 2023.
- C(Survey_Year)[T.2025] (1.356): This is the estimated difference in mean Satisfaction_Score in 2025 compared to 2023, averaged across both experience levels. Satisfaction scores are about 1.36 points higher in 2025 compared to 2023.
- C(Experience_Level)[T.Novice]:C(Survey_Year)[T.2024] (-1.417): This is the interaction effect. It represents the additional difference in the effect of Survey Year 2024 (compared to 2023) for Novice users compared to Experienced users. The increase in satisfaction from 2023 to 2024 is about 1.42 points smaller for Novice users than for Experienced users.
- C(Experience_Level)[T.Novice]:C(Survey_Year)[T.2025] (-3.285): This interaction effect shows that the additional difference in the effect of Survey Year 2025 (compared to 2023) for Novice users compared to Experienced users is about -3.29 points. The increase in satisfaction from 2023 to 2025 is about 3.29 points smaller for Novice users than for Experienced users.
Std.Err.: The standard error of the estimated coefficient. It measures the precision of the estimate. Smaller standard errors indicate more precise estimates.
z: The z-statistic. This is the coefficient divided by its standard error (z=Std.Err.Coef.). It tests the null hypothesis that the coefficient is equal to zero. In large samples, this approximates a standard normal distribution.
P>|z|: The p-value associated with the z-statistic. It's the probability of observing a z-statistic as extreme as (or more extreme than) the one calculated if the true coefficient were zero.
- If the p-value is less than your chosen significance level (e.g., 0.05), you reject the null hypothesis and conclude that the coefficient is significantly different from zero.
- All the p-values here are 0.000, indicating that all the fixed effects in the model (including the intercept, main effects of Experience and Survey Year, and their interaction) are statistically significant.
[0.025 0.975]: The 95% confidence interval for the estimated coefficient. It provides a range of values within which the true coefficient is likely to lie with 95% confidence. If the confidence interval does not include zero, it's another indication that the coefficient is statistically significant at the 0.05 level. All confidence intervals here do not include zero.

Inference:

There is a significant main effect of Experience_Level on Satisfaction_Score. Novice users have significantly lower satisfaction scores than Experienced users on average.
There is a significant main effect of Survey_Year on Satisfaction_Score. Satisfaction scores significantly increased from 2023 to 2024 and further to 2025, on average across both experience levels.
There is a significant interaction effect between Experience_Level and Survey_Year. The pattern of change in satisfaction scores over the survey years is different for Novice and Experienced users. The increase in satisfaction over time is less pronounced (or even negative relative to the Experienced group's trend) for Novice users.
The estimated variance of the random intercepts for User_ID is very close to zero, suggesting minimal variability in baseline satisfaction after accounting for the fixed effects. This warrants further scrutiny to ensure it's a valid result.

Post-Hoc Tests

Now that we've run the Mixed ANOVA test, we can dive deeper into both the between-subjects effect, within-subjects effect, and interaction.

Between-Subjects - Tukey's HSD Test (statsmodels)

This test will help to further examine the significant main effect of the between-subjects factor, Experience Level. Since the Mixed ANOVA results indicated a significant overall difference in satisfaction scores between the 'Novice' and 'Experienced' groups, this Tukey's HSD test is used to determine which specific pairs of experience levels are significantly different from each other. If there were more groups this test would make better sense to use to make multiple comparisons, but we can still use it for the 2 groups we have.

The output of pairwise_tukeyhsd shows a comparison of all of the different groups (we only had 2), the difference in their mean Satisfaction Scores, a confidence interval for this difference, and whether the difference between the means is statistically significant at the specified alpha level (0.05 in this case).

In essence, this code helps pinpoint exactly where the significant differences lie within the levels of your between-subjects factor after finding an overall significant effect in the ANOVA.

Within-Subjects - Paired t-Tests with Bonferroni Corrections (SciPy / statsmodels)

Now we can make post-hoc pairwise comparisons for the within-subjects factor (Survey Year), separately for each level of the between-subjects factor (Experience Level).I'm using paired t-tests because the data for different survey years come from the same participants. To account for the increased risk of Type I errors due to multiple comparisons, I'm going to apply a Bonferroni correction to the p-values.

This post-hoc test aims to answer the question: Within each experience level (Novice and Experienced), are there significant differences in satisfaction scores between specific pairs of survey years (2023 vs 2024, 2023 vs 2025, 2024 vs 2025)? The Bonferroni correction adjusts the significance threshold to maintain an overall alpha level of 0.05 across all these comparisons. The output will show the original p-value, the Bonferroni-corrected p-value, and whether the null hypothesis of no difference between the means for each pair of survey years within each experience level is rejected after the correction.

Between-Subjects - Tukey's HSD Test (pingouin)

The pingouin library has some some nice implementations of statistical tests which come with many bells and whistles. I'll use a few for these post-hoc tests. First I'll start with the between-subjects comparisons. With this function you don't get the confidence intervals, but you do get Hedge's effect size.

Within-Subjects - Tukey's HSD Test (pingouin)

Again we could look to the Survey Year or the within-subjects effect, but it did not show as significant from the Mixed ANOVA test (I believe it was just over our alpha at 0.06), so we don't need to perform anything here. But, we can see that this function has a correction parameter where we can specify an Bonferroni correction, which is awesome.

Interaction - Pairwise t-Test (pingouin)

And finally the interaction effect where I can use the pairwise_ttest() function. Here with many comparisons it's important we use a Bonferroni correction. Notice how it knows which comparison is paired and which isn't. It will use either a piared or independent samples t-tTest when appropriate. We also get a Bayes Factor effect and Hedge's by default. Many of the comparisons are significant, but the effects size can help show magnitude.

BF10: The Bayes Factor. This provides evidence for the alternative hypothesis over the null hypothesis. A BF10 > 1 suggests evidence for the alternative, and the larger the value, the stronger the evidence.

Hedges: Hedges' g, a measure of effect size for the pairwise difference, similar to Cohen's d but with a correction for small sample sizes.

Visualization

A great way to visually compare the means from different groups in a longitudinal study is a point plot.You can clearly see the trend of each group here. The vertical lines are confidence intervals and represent one standard deviation.

More Projects

Page updated

Google Sites

Report abuse