GitHub Repo:
You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/MANOVA.ipynb
OUTLINE
Introduction to the MANOVA for Comparing Multiple Means
Formulating the Hypotheses & Interpreting Results
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Assumptions
MANOVA Test
Post-Hoc Tests
Visualization
MANOVA is a statistical test that extends the Analysis of Variance (ANOVA) to situations where there are two or more continuous dependent variables. Instead of examining the means of a single dependent variable across different groups, MANOVA simultaneously examines the means of multiple dependent variables.
When is MANOVA the Appropriate Test to Use?
MANOVA is appropriate when you have:
One or more categorical independent variables (factors): These define the groups you want to compare (e.g., different marketing campaigns, treatment types).
Two or more continuous dependent variables: These are the outcomes you are measuring and want to see if they differ across the groups (e.g., Click-Through Rate, Spending, Time on Site).
Interest in the overall effect of the independent variable(s) on the combination of dependent variables:MANOVA tests whether there are statistically significant differences between the group means on the vector of dependent variables.
What MANOVA Tells Us:
Overall Multivariate Effect: MANOVA tells us whether there is a statistically significant difference between the groups when considering all the dependent variables together. A significant result indicates that the groups differ in their mean vectors (the collection of means for all dependent variables).
What MANOVA Doesn't Tell Us:
Which Dependent Variables are Different: A significant MANOVA result does not tell you which specific dependent variables are driving the overall difference. It only indicates that there is a difference in the pattern of means across the dependent variables.
Which Groups are Different: Similarly, a significant MANOVA result with more than two groups doesn't tell you which specific pairs of groups are significantly different.
Formulating the Hypotheses & Interpreting Results
Null Hypothesis (H0) in MANOVA:
There is no significant difference in the mean vectors of the dependent variables across the different levels of Factor A, Factor B, or their interaction.
Instead of just the "means of the dependent variable," MANOVA considers the collection of means for all the dependent variables simultaneously. This collection of means for a group forms a "mean vector." The null hypothesis states that these mean vectors are the same across all the levels of each factor and their interactions.
Alternative Hypothesis (H1) in MANOVA:
At least one of the mean vectors of the dependent variables is different across the different levels of Factor A, Factor B, or their interaction.
The alternative hypothesis suggests that for at least one factor or interaction, the groups differ significantly on the pattern of means across the set of dependent variables. This doesn't necessarily mean that the groups differ on every single dependent variable, but rather on some combination of them.
Interpretation of MANOVA Results:
If the p-value associated with a multivariate test statistic (e.g., Pillai's Trace, Wilks' Lambda, Hotelling's Trace, Roy's Largest Root) is below the chosen significance level (e.g., 0.05), reject the null hypothesis.
MANOVA uses multivariate test statistics to assess the overall differences in mean vectors. The p-value associated with these statistics indicates the probability of observing the data (or more extreme data) if the null hypothesis of no difference in mean vectors were true. A low p-value suggests that the observed differences are unlikely to have occurred by chance.
Significant p-values indicate that there is evidence of a significant difference in the multivariate means, either due to Factor A, Factor B, or their interaction.
A significant MANOVA result tells you that the independent variable(s) have a statistically significant effect on the combination of dependent variables. It implies that the groups formed by the factors differ in their profiles across the measured outcomes. However, as mentioned before, it doesn't pinpoint whichspecific dependent variables or group comparisons are responsible for this overall multivariate effect. This is where post hoc tests (like univariate ANOVAs or specific multivariate post hoc tests) become necessary to further investigate the nature of these differences.
Import Libraries
There are a few libraries we can use for MANOVA. I'll show an implementation using statsmodels (a well known and robust statistical library based on SciPy) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). These are the imports and versions you'll need to follow along with.
Determining Sample Size (Power Analysis)
First I'm going to be performing a power analysis to strategically plan our MANOVA test. This is a crucial step to determine the necessary sample size for your study before you collect any data. The goal is to ensure that if a real difference exists between the groups you are comparing, your test will have a high probability of detecting it (this probability is your desired "power").
My approach here involves exploring how the required sample size changes depending on the magnitude of the effect we are trying to find. By calculating the necessary sample size for a range of potential effect sizes and visualizing this relationship, we can make a more informed decision about how much data we'll need to collect for our MANOVA test to be meaningful and reliable. This helps us balance the desire to detect even small effects with the practical limitations of data collection. Technically this is a power analysis for a One-Way ANOVA test, but we are repurposing it here for MANOVA. If the MANOVA test reveals a difference in one of the groups or metrics we will likely follow up by running an ANOVA test for each group comparison anyways so we can base our sample size off of the power for that.
Cohen's f
Cohen's f is essentially a standardized measure of the variability of the group means around the overall mean.
It's analogous to Cohen's d, which measures the standardized difference between two group means. Cohen's f extends this concept to situations with more than two groups.
A larger Cohen's f indicates a greater degree of difference between the group means relative to the variability within each group. This suggests a stronger "treatment effect" or a more substantial association between the independent variable (group membership) and some dependent variable.
Jacob Cohen (the statistician who developed this measure) provided widely used guidelines for interpreting the magnitude of Cohen's f:
Small effect: f = 0.10
Medium effect: f = 0.25
Large effect: f = 0.40
Exploring the Sample Size and Effect Size Relationship:
When you run this code, you will see a plot showing how the approximate required sample size per group increases as the target effect size decreases. This visually demonstrates the inverse relationship between effect size and the sample size needed to achieve a certain level of statistical power. Let's explore this relationship.
Calculating Sample Size with the Desired Effect:
Playing around with these expected values can help you determine how many samples you will need to collect. Now we can generate approximately how many samples we will need for our statistical significance and power given the effect we'd anticipate to measure.
Better Approaches or Considerations:
Define a Meaningful Effect Size: Instead of using a generic "medium" effect size, try to estimate the smallest effect size that would be practically significant for your business. This might involve looking at historical data, considering the cost of implementing changes, or the potential uplift in your key metric.
Consider Variability: The effect size in MANOVA / ANOVA power analysis is often expressed in terms of Cohen's f, which relates the standard deviation of the means to the common within-group standard deviation. A more informed power analysis might involve estimating the expected variability within your groups based on past data.
Iterative Power Analysis: You could perform a sensitivity analysis by trying different effect sizes and desired power levels to see how they impact the required sample size. This can help you understand the trade-offs.
Power Analysis for Post-Hoc Tests: If the overall MANOVA / ANOVA is significant, you'll likely want to perform post-hoc tests to see which specific groups differ. Power analysis can also be considered for these pairwise comparisons, although it's often more complex and might involve adjusting the alpha level (e.g., using Bonferroni correction).
Sequential Testing: For online A/B/C/D tests, you might consider sequential testing methods that allow you to stop the experiment earlier if a significant result is achieved, potentially reducing the overall sample size needed. However, these methods require careful planning and analysis to control error rates.
Bayesian Methods: Bayesian A/B testing offers an alternative framework that focuses on the probability of one variation being better than another, rather than strict null hypothesis testing. Power analysis in a Bayesian context is different and often involves assessing the probability of reaching a desired level of certainty.
Synthetic Data
Example Use Case Scenario:
A marketing team wants to evaluate the effectiveness of three different online advertising campaigns (Group A, Group B, Group C) on key website metrics. They measure the following for a sample of users exposed to each campaign:
CTR (Click-Through Rate): The percentage of users who clicked on the ad.
Spend: The amount of money (in dollars) spent by the user on the website after seeing the ad.
Time: The average time (in minutes) the user spent on the website after seeing the ad.
Validating Assumptions
MANOVA relies on several key assumptions:
Independence of Observations: The data points should be independent of each other.
Multivariate Normality: The dependent variables, when considered together, should follow a multivariate normal distribution within each group. This is a more complex assumption to check than univariate normality.
Homogeneity of Covariance Matrices: The variance-covariance matrices of the dependent variables should be equal across all groups. This is analogous to the homogeneity of variances assumption in ANOVA.
Linearity: There should be linear relationships between the dependent variables within each group.
Absence of Multicollinearity: The dependent variables should not be too highly correlated with each other, as this can make it difficult to isolate the effect of the independent variable.
The Henze-Zirkler test is a statistical test that evaluates the distance between the empirical characteristic function of the data and the characteristic function of the multivariate normal distribution. A larger distance suggests a deviation from multivariate normality. The pingouin package has an implementation and we can see the result is likely Multivariate Normal.
Homogeneity of Covariance Matrices
This code snippet specifically checks if the spread (variance) of the values is the same across the three different intervention groups (A, B, and C). We will have to run this test seperately for each metric we are monitoring. This is a crucial assumption for ANOVA and MANOVA. If this assumption is violated for a particular dependent variable, the results of the main analysis might be less reliable, and alternative methods or transformations might be considered.
Independence Assumption
Checking for independence is usually done by considering the experimental design:
Random Assignment: Were participants or experimental units randomly assigned to each variation? Random assignment helps ensure that the groups are comparable and that observations across groups are independent.
Within-Group Independence: Are the observations within each group independent of each other? For example, if you are measuring the time spent by users on a website, each user's session should ideally be independent of other users' sessions. If there are dependencies (e.g., repeated measures on the same user without accounting for it), MANOVA might not be appropriate for the raw data.
No Systematic Bias: Was the data collection process free from any systematic biases that could introduce dependencies between observations?
The MANOVA.from_formula() function is fairly straightforward. You simply pass in the linear formula in a patsy style and also the data. In the Intervention section, if any p-value is less than your pre-determined alpha than you can infer that at least one of the groups means for one of the metrics is likely different from the rest. To then find out which group and which metric, you will need a post-hoc test.
The listed multivariate test statistics – Wilks' Lambda (λ), Pillai's Trace (V), Hotelling's Trace (T-squared), and Roy's Largest Root (R) – are all used in MANOVA to assess the overall significance of the effect of the independent variables on the set of dependent variables.
Wilks' Lambda (λ):
Measures the proportion of variance in dependent variables not explained by independent variables.
Smaller values indicate a stronger effect.
Associated p-value tests the null hypothesis of no differences among group means.
Pillai's Trace (V):
Another measure of the amount of variance explained by independent variables.
Smaller values indicate a stronger effect.
Hotelling's Trace (T-squared):
Measures the overall significance of the multivariate effect.
High T-squared value with a low p-value suggests a significant effect.
Roy's Largest Root (R):
Another statistic used to test the significance of the multivariate effect.
To determine which specific dependent variables and which specific groups are significantly different after a significant MANOVA, you need to perform post hoc tests. These are follow-up tests conducted after the main analysis and can tell us a two things:
Which dependent variables are significantly affected by the independent variable. If the MANOVA is significant, it implies at least one dependent variable shows group differences. The univariate post-hocs help identify which ones.
Which specific pairs of groups have significantly different means for each of those affected dependent variables.Tukey's HSD is designed for pairwise comparisons and controls the family-wise error rate when making multiple comparisons.
Common approaches include:
Univariate ANOVAs: Performing separate ANOVAs for each dependent variable. However, you need to be cautious about inflating the Type I error rate due to multiple comparisons.
Adjusted Alpha Levels (e.g., Bonferroni correction): Reducing the significance level (α) for each univariate ANOVA to control the overall Type I error rate.
Multivariate Post Hoc Tests: Specific tests designed for MANOVA, such as:
Stepdown Analysis: Examines the dependent variables in a hierarchical order.
Pairwise Multivariate Tests: Directly compares pairs of group mean vectors, often with adjustments for multiple comparisons.
Here I'm iterating through each depedant variable and performing Tukey's HSD seperately. When we reject the Null hypothesis we are suggesting that the groups likely have different dependant variables. The nice thing here is that the confidence intervals are returned automatically.
And again itrating through the dependant variables but this time with pingouin, which has a really nice df summary. This summary doesn't contain the confidence intervals but does include Hedge's f effect size for us which is really nice. Hedge's g tells you how many standard deviations apart the means of your two groups are, providing a standardized measure of the magnitude of the difference. Because it corrects for small sample bias, it's often preferred over Cohen's d in situations with smaller sample sizes.
Interpretation of Hedge's g:
Around 0.2: Small effect size - the difference between the group means is about 0.2 standard deviations.
Around 0.5: Medium effect size - the difference is about 0.5 standard deviations. This is often considered a practically visible difference.
Around 0.8: Large effect size - the difference is about 0.8 or more standard deviations, indicating a substantial difference between the groups.
A great way to visually compare the distributions of means from different groups is a KDE plot... but of course one for each dependant variable.