GitHub Repo:
You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/statistical-tests/blob/main/Welchs_T_Test_for_Means.ipynb
OUTLINE
Introduction to Welch's t-Test
Other Important Considerations
Import Libraries
Determining Sample Size (Power Analysis)
Synthetic Data
Validating Assumptions
Running Welch's t-Test (SciPy)
Running Welch's t-Test (pingouin)
Visualization
Overview: Welch's t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two 1 independent groups. It's an adaptation of the more commonly known Student's t-test. What makes Welch's t-test unique and particularly valuable is that it does not assume the two populations from which the samples are drawn have equal variances. This makes it a more robust and often preferred alternative when comparing two means, especially when you are unsure about the equality of variances or have evidence suggesting they differ.
Use Cases & When to Use: Welch's t-test is widely applicable in fields like medicine, biology, social sciences, engineering, and business analytics – essentially anywhere you need to compare the average value of a continuous variable between two distinct, unrelated groups. You should specifically choose Welch's t-test when:
You are comparing the means of two independent groups (the observations in one group are not related to the observations in the other).
You suspect or know that the variances (a measure of spread or dispersion) of the outcome variable might be different between the two populations the groups represent.
Even if the variances are equal, Welch's t-test performs similarly to Student's t-test, making it a safe default choice in many situations.
Assumptions: For the results of Welch's t-test to be valid, certain assumptions should ideally be met:
Independence: The observations within each group must be independent of each other, and the two groups themselves must be independent.
Normality: The data within each group should be approximately normally distributed. However, Welch's t-test is reasonably robust to violations of this assumption, especially if the sample sizes in both groups are sufficiently large (e.g., > 30), thanks to the Central Limit Theorem.
No Assumption of Equal Variances: Critically, unlike Student's t-test, Welch's t-test does not require the assumption of homogeneity of variances (equal variances) between the two groups. It explicitly accounts for potential differences in variance in its calculations, primarily by adjusting the degrees of freedom.
Example Scenario: Imagine clinical researchers want to investigate if a new medication effectively lowers systolic blood pressure compared to a standard treatment. They recruit two independent groups of patients: one receives the new medication, and the other receives the standard treatment. After a set period, they measure the systolic blood pressure for all participants. The researchers might suspect that the new drug affects individuals differently, potentially leading to a wider range (higher variance) of blood pressure readings in that group compared to the standard treatment group. In this scenario, where the goal is to compare the average systolic blood pressure between the two treatment groups and there's a possibility of unequal variances, Welch's t-test is the appropriate statistical tool to use.
Other Important Considerations
For A/B Tests and much of Experimental Design the statistical tests are often only one small step of the process. Here I'll outline some other considerations that I'm not accounting for in this tutorial.
Randomization - Randomly assigning participants to the differrent groups to reduce bias and ensure the groups being compared are similar in all ways except the treatment.
Multiple Tests & Correcting p-values - If the same hypothesis is being tested multiple times or simultaneously then we need to apply corrections such as a Bonferroni Correction to control the family wise error rate and reduce the likelihood of false positives.
One-Tailed vs Two-Tailed - In a One-Tailed test we're interested in detecting an effect in a specific direction, either positive or negative, while in a Two-Tailed test we're interested in detecting any significant difference, regardless of the direction. I'll stick to using Two-Tailed tests unless otherwise noted.
Guardrail Metrics - Monitor additional metircs to ensure there are no unintended consequences of implementing the new chages. These metrics act as safegaurds to protect both the business and users.
Decision to Launch - Consider both statistical and practical significance before determining to launch the changes.
As Professor Allen Downey would say... There is only one test! - If you can mimic the data generating process with a simulation then there's no real need for a statistical test. http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html
Import Libraries
There are a few libraries we can use for Welch's t-Test. I'll show an implementation using SciPy (a tried and true statistical library) as well as using pingoiun (a slightly newer library with a lot of extra bells and whistles). These are the imports and versions you'll need to follow along.
Determining Sample Size (Power Analysis)
Why Calculate Sample Size?
Before conducting an experiment or study where you plan to use Welch's t-test (or any hypothesis test), it's crucial to estimate the required sample size. This process, often called a priori power analysis, helps ensure your study has a good chance of detecting a statistically significant result if a true effect of a certain magnitude exists.
Statistical Power: This is the probability (1−β) of correctly rejecting the null hypothesis when the alternative hypothesis is true. In simpler terms, it's the probability of finding a significant difference if a real difference of a specific size actually exists. Commonly desired power levels are 80% (0.8) or 90% (0.9).
Significance Level (α): This is the probability of making a Type I error – rejecting the null hypothesis when it's actually true (a "false positive"). This is typically set at 5% (0.05).
Effect Size: This quantifies the magnitude of the difference you expect or want to be able to detect between the two groups (e.g., the difference between mean1 and mean2, often standardized).
Data Variability: The spread of the data in each group, represented by the standard deviations (sd1, sd2). Higher variability generally requires larger sample sizes.
Calculating sample size helps you balance resources: avoiding studies that are too small (underpowered) and thus likely to miss real effects, and avoiding studies that are unnecessarily large (overpowered), wasting resources or potentially exposing more participants than needed to experimental conditions.
Challenges for Welch's t-Test:
Exact sample size calculation for Welch's t-test is complex because the test's degrees of freedom depend on the sample variances, which are unknown before collecting the data. Therefore, sample size calculations for Welch's t-test usually rely on approximations. The function provided uses one such approach, often leveraging formulas adapted from Z-tests or Student's t-tests.
Exploring the Sample Size and Effect Size Relationship:
When we expect or anticipate a large effect size we need less samples per group as the differences in their means will be more obvious. Likewise if we are expecting only a small effect size then we will need more samples to detect this effect. You can see this relationship played out below with the expected means from both groups.
Calculating Sample Size with the Desired Effect:
Playing around with these expected values can help you determine how many samples you will need to collect per group. Utilizing a new function that takes in the desired sample size and expected means we can generate approximately how many samples we will need for our statistical significance and power.
Synthetic Data
Use Case Context: Before we can perform Welch's t-test in Python, we need data to work with! Often in tutorials, we generate synthetic data. This is useful because it allows us to:
Control the exact characteristics of our sample data (like the mean, standard deviation, and sample size).
Specifically create a scenario that highlights the strengths of the test we're demonstrating. In this case, we want to create data where the two groups likely have unequal variances, making Welch's t-test the appropriate choice over Student's t-test.
Code: This code block simulates collecting data for two independent groups (A and B) from populations that are normally distributed but have different means (125 and 118) and, importantly, different variances (20 and 5). The data is then organized into a convenient Pandas DataFrame, ready for analysis using Welch's t-test. We can use the sample size we found in the previous section.
Validating Assumptions
Before applying Welch's t-test, it's prudent to check if the underlying assumptions of the test are reasonably met. While Welch's t-test is more robust to violations of the equal variance assumption compared to the Student's t-test, it still assumes that the data within each group is approximately normally distributed.
The Python code snippet below implements functions to assess these assumptions:
Let's break down what this code does:
Normality Assessment (Shapiro-Wilk Test):
The shapiro() function from the scipy.stats module is used to test the null hypothesis that the data in each group ('Group A' and 'Group B') comes from a normal distribution.
For each group, the function returns a test statistic and a p-value.
We then interpret the p-value against a chosen significance level (α, commonly 0.05).
If the p-value is less than α, we reject the null hypothesis and conclude that the data for that group is likely not normally distributed.
Conversely, if the p-value is greater than α, we fail to reject the null hypothesis, suggesting that the data 1 is approximately normally distributed.
Homogeneity of Variances Assessment (Levene's Test):
The levene() function, also from scipy.stats, is employed to test the null hypothesis that the variances of the two groups are equal.
Similar to the Shapiro-Wilk test, levene() returns a test statistic and a p-value.
We compare the p-value to our significance level (α).
If the p-value is less than α, we reject the null hypothesis, indicating that the variances of the two groups are significantly different (heterogeneous).
If the p-value is greater than α, we fail to reject the null hypothesis, suggesting that the variances are reasonably homogeneous.
Interpretation and Output:
The code prints the results of both tests, including the test statistic and the p-value.
Crucially, it provides a clear interpretation of these results in the context of the assumptions of Welch's t-test.
Important Considerations:
Normality Assumption: While Welch's t-test is more forgiving regarding unequal variances, substantial deviations from normality, especially in small sample sizes, can still affect the test's reliability. Visual inspections using histograms or Q-Q plots can complement the Shapiro-Wilk test.
Homogeneity of Variances: Welch's t-test is specifically designed to handle situations where the variances are unequal, so a rejection of the null hypothesis in Levene's test is not necessarily a reason to avoid using Welch's t-test. In fact, it's the scenario where Welch's t-test is particularly valuable.
Significance Level (α): The choice of α influences the outcome of the hypothesis tests. A common value is 0.05, but this can be adjusted based on the specific context of your analysis.
By running this assumption validation code, you gain valuable insights into the characteristics of your data and can be more confident in the appropriateness and interpretation of the results from Welch's t-test.
Once you have an understanding of your data and have (optionally) assessed the underlying assumptions, you can proceed with running Welch's independent samples t-test. This test allows us to determine if there is a statistically significant difference between the means of two independent groups, even when their variances are unequal.
The Python code snippet below demonstrates how to perform Welch's t-test using the scipy.stats module:
Let's dissect this code:
Performing the Test (ttest_ind):
The core of the Welch's t-test is performed by the ttest_ind() function from the scipy.stats module. This function is designed to calculate the t-statistic and the p-value for the comparison of two independent samples.
We pass two main arguments to this function:
data['Group A']: This provides the data for the first group.
data['Group B']: This provides the data for the second group.
The crucial parameter for running Welch's t-test specifically is equal_var=False. By setting this to False, we instruct the function to perform Welch's t-test, which does not assume equal population variances. If this parameter were set to True (or left as the default), the function would perform the Student's independent samples t-test, which does assume equal variances.
The function returns two values:
stat: This is the calculated t-statistic. It's a measure of the difference between the sample means relative to the variability within the samples.
p_value: This is the probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true (i.e., there is no real difference between the means of the two populations).
Interpreting the Results:
We then interpret the obtained p_value in relation to a chosen significance level (α, again, typically 0.05).
If the p-value is less than α: This provides evidence against the null hypothesis. We reject the null hypothesis and conclude that there is a statistically significant difference between the means of Group A and Group B. In other words, the observed difference between the sample means is unlikely to have occurred by random chance alone.
If the p-value is greater than or equal to α: We fail to reject the null hypothesis. This means that based on our data, there is not enough statistical evidence to conclude that there is a significant difference between the means of Group A and Group B. It's important to note that this does not necessarily mean that the means are equal, only that we haven't found sufficient evidence to say they are different.
Output:
The code prints the calculated t-statistic and the p-value, providing the key results of the test.
It also presents a clear interpretation of these results in the context of the null hypothesis.
By running this run_welch_t_test function with your data, you can obtain the statistical evidence needed to make inferences about the differences between the two groups you are comparing, while appropriately accounting for potential differences in their variances.
The pingouin library offers a streamlined and statistically rich way to conduct various statistical tests, including the independent samples t-test with Welch's correction. The benefit of using pingouin is that it computes the confidence intervals, effect size, bayes factor along with the p-value. These are all very handy to have and would have required much more code if using SciPy or statsmodels. The run_welch_t_test_pingouin function encapsulates this process:
Let's dissect this function step by step:
Defining the Function run_welch_t_test_pingouin:
This function takes two arguments:
data (dict): It expects a Python dictionary where the keys are the names of your groups (in this case, assumed to be 'Group A' and 'Group B') and the values are the lists or NumPy arrays containing the data for each group.
alpha (float, optional): This is the significance level for the test. It defaults to 0.05, which is a common value. This value determines the threshold for rejecting the null hypothesis.
Performing Welch's t-Test with pg.ttest():
results = pg.ttest(data['Group A'], data['Group B'], correction='auto'): This is the core of the function. It uses pingouin's ttest() function to perform the independent samples t-test.
The first two arguments, data['Group A'] and data['Group B'], provide the data for the two groups you want to compare.
The crucial argument here is correction='auto'. This tells pingouin to automatically apply Welch's correction for unequal variances if the data suggests that the variances of the two groups are significantly different. pingouin internally uses a test for homoscedasticity (like Levene's or Bartlett's) to decide whether to apply the correction. If the variances are deemed equal, it performs the standard Student's t-test. This "auto" setting is very convenient as it handles the choice for you based on the data characteristics.
Displaying the Results:
print("Welch's t-Test Results:") and display(results): These lines print a descriptive header and then display the results DataFrame. The results DataFrame from pingouin.ttest() contains a wealth of information, including:
T: The calculated t-statistic.
df: The degrees of freedom (which will be adjusted if Welch's correction is applied).
p-val: The p-value associated with the t-statistic.
CI95[lower] and CI95[upper]: The 95% confidence interval for the difference in means.
cohen-d: Cohen's d, an estimate of the effect size.
BF10: The Bayes factor, providing evidence for the alternative hypothesis.
power: The statistical power of the test (given the sample size and effect size).
Interpreting the P-Value:
p_value = results['p-val'].iloc[0]: This line extracts the p-value from the results DataFrame. The p-value is typically located in the first row of the 'p-val' column.
The subsequent if statement performs the hypothesis test:
if p_value < alpha:: If the p-value is less than the chosen significance level (alpha), we reject the null hypothesis. This suggests that there is a statistically significant difference between the means of Group A and Group B.
else:: If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis. This indicates that there is not enough statistical evidence to conclude that the 1 means of the two groups are significantly different.
A great way to visually compare the means between 2 groups is an overlayed KDE (Kernel Density Estimation) distribution plot for each group. First we need to gather the data in a way that's easy to visualize.