Analysis of Variance (ANOVA) is a statistical test used to determine if more than two population means are different. When comparing two population means, ANOVA will produce the same result as a t-test. The analysis also allows us to deal with multiple independent variables (IVs), both separately and in terms of interactions.
ANOVA compares estimates of the population variance. The “within groups” variance refers to the variance of each individual score in a group with the mean of that group. The “between groups” variance refers to the variance of each group’s mean to the grand mean. “Within groups” may be referred to by some as “error,” meaning the error in measurements of participants in a certain group. “Between groups” is sometimes referred to as “treatment,” meaning the measurement of treatment effects.
The F-statistic is obtained by dividing the mean square (sum of squares divided by degrees of freedom) “between groups” by the mean square “within groups.” The obtained value of F is compared to a critical value of F and each critical value of F is related to a p-value. We then assess the p-value and if it is less than our predetermined cutoff point (a p-value of.05 is standard, though some use .01.), then we reject the null hypothesis. Essentially, the F ratio tests how likely it is that our two estimates of the same population variance would differ so widely if our assumptions were valid.
Note: this obtained F-statistic is a “blob test,” meaning that it tells us there is a difference between means without cluing us into where the difference lies. We must conduct deeper analyses to determine which groups contain actual differences. Also note: an overall F may be insignificant, but upon deeper analysis, differences between specific groups may in fact exist. First we must understand basic ANOVA and an overall F test. Later, we will get into specific and (more important) group comparisons.
Before continuing, let’s hypothetically run an experiment determining the effects of sleep prior to a Statistics examination. Let’s say that we have three groups, each composed of 10 participants. Group 1 slept for ten hours. Group 2 slept for six hours. Group 3 slept for two hours. Each of the 30 participants took a Statistics exam and the mean scores for each group respectively were: 100, 90, and 85. We can see that 100, 90, and 85 are different scores, but we will use ANOVA to determine if these differences are statistically significant.
We begin by calculating the F-statistic which is the same as saying we are dividing the variance “between-groups” by the variance “within-groups.” This is the same as dividing the Mean Square Between-Groups by the Mean Square Within-Groups.
The MS Between-Groups is calculated by taking the sum of squares between groups and dividing by the degrees of freedom. In this case, we will calculate the grand mean – (100+90+85)/3 – and we have 91.67. The sum of squares is the difference between each group score and the grand mean, squared, and summed. (100-91.67) squared + (90-91.67) squared + (85-91.67) squared. Rounding gives us 69.39 + 2.79 + 44.49 which equals 116.67. This is the SS Between-Groups. To convert that to MS Between-Groups, we divide by degrees of freedom, which in this example is 2. (k-1=2) This provides us with the numerator: 58.33.
To get the denominator, we take the sum of squares within groups, which means that we will take every one of the 30 individual scores and subtract the difference between it and the group mean. Then we square it and sum it and divide by the degrees of freedom. This can be a lengthy calculation, so I will do it for you. SS Within-Groups in our hypothetical example comes out to 335.6. The df is 27, giving us MS Within-Groups of 12.43.
58.33/12.43 = 4.69, which is our F Calculated. Looking at the alpha = .05 chart for F Critical, viewing 2 degrees of freedom in the numerator and rounding to 26 degrees of freedom in the denominator (27 df is not listed on the chart I am viewing, and it is best to be conservative, so we round down), we see an F Critical value of 3.37. Because our F Calculated is 4.69 and greater than our F Critical, we reject the null hypothesis that states there is no statistically significant difference between the three testing groups. (Note: include F Calculated and the corresponding p-value when reporting results.)
When running an ANOVA, the first thing we do is identify the null and alternative hypotheses. The null hypothesis states that the mean of group 1 = mean of group 2 = mean of group 3, etc. The alternative hypothesis simply states that the null is incorrect.
We make three assumptions with ANOVA. We assume there is random, independent sampling from each population. There is no way to verify this assumption without looking at the experimental design. For the purposes of a Stats class, we will just have to assume this assumption is met. The other two assumptions for ANOVA we can test. They are: normal population distributions and equal variances within each population.
We test the assumption of normal population distributions by graphing or plotting the data. The most common analysis is a scatterplot (dots) or a boxplot. This will highlight any extreme scores or outliers. Generally an outlier is three or more standard deviations from the mean. However, pay attention to the data, as it is very case dependent. Next, look at the skew and kurtosis. The skew should be between -1 and +1. Kurtosis should fall within that range as well, but we are much less concerned with negative kurtosis, as only positive kurtosis indicates outliers.
Next, we test the third and final assumption of ANOVA: homogeny of variance. To test this assumption, we tend to use the Levene Statistic in SPSS. This null hypothesis for Levene’s Statistic is that the standard deviations are equal for all populations. If the critical value (listed as “Sig.” in SPSS) is less than.05, we reject the null.
If the data is not normally distributed or the variances are not equal, we should complete a transformation. Standard transformations (listed from most powerful to least powerful) involve taking the reciprocal, log, or square root (or taking the square root of the variable plus one) are best performed on data with a smooth skew. With growth data, such as biological or financial, a log transformation is best. Square root is best for frequency data. However, if there is a negative skew, any of these transformations will make the skew and kurtosis worse.
Another way to normalize distributions or variances is to trim or Winsorize the data. Trim is just as it seems – it means that we cut out data. Standard trimming may involve cutting off 20% of the data on either end of the distribution. This affects the power of our analysis, as it decreases the sample size and also the degrees of freedom. Winsorizing data simply means that we take the extreme outlier(s) and replace it/them with the next highest score. This keeps our sample size and degrees of freedom the same.
A final note on the assumptions: small to moderate departures from homogeneity do little to affect ANOVA, especially if the sample size is quite large. The “triple whammy” occurs when we have unequal within-group variances (greater than 2:1), unequal sample sizes (greater than 2:1), and at least one small sample (less than 10). If the greatest variance is in the smallest sample size, this could be quite problematic.
After assumptions are met, we can run the analysis, and determine the F statistic and the p-value (significance). The p-value interpretation is the probability that we would get these or more extreme results given the null is true and assumptions are met. p(evidence l null hypothesis) – probability of the evidence occurring given the null hypothesis.
Power is the probability that you do detect an effect in a particular study, given that you have a particular effect size. We estimate effects under a margin of error and if we want to cut that margin of error in half, we must quadruple the sample size. You will get more sensitivity with greater sample size, but it doesn’t increase the effect size itself.
B = beta error, where power = (1 – Beta error)
E = effect size
A = alpha error rate
N = sample size
A common error is to set the effect size at a value that is expected in the population rather than the minimum value that is considered meaningful.
Failure to attain statistical significance does not necessarily mean that the study was underpowered.
The sample size may have been large enough to have an acceptable probability of detecting a meaningful effect if it existed.
If you want to cut the error in half, take 4 times the sample (4n). You’ll receive twice the precision.
Alpha inflation refers to the fact that the more comparisons that you conduct, the more likely you are to make a Type I error. For example, at alpha = .05, we are likely to make a Type I error (to claim a treatment effect when there is none) 5% of the time. If we make 10 pair-wise comparisons at alpha = .05%, we have 40.1% chance of making a Type I error.
alpha* = 1 – (1 – alpha )10 = 1 – .599 = .401
Therefore, when making a series of comparisons, we need to control for alpha inflation. We have several options, depending on the situation. Below are a few of the most commonly used tests; however, this list is far from comprehensive. In general, choose the test that provides you with the most power.
Bonferroni Test: This is a very simple adjustment, made by dividing alpha (usually .05) by the number of pair-wise tests that you’re calculating. Therefore, the total alpha for all tests will be .05, rather than each individual test yielding an alpha of .05. The critical value is calculated by alpha/number of comparisons.
Tukey’s HSD (Honestly Significant Difference) Method: Probably the most commonly used of the tests, Tukey’s HSD controls family-wise error and tests all possible pair-wise comparisons.
Holm: This test is an adjustment to the Bonferroni Test, making it more liberal and more powerful. It ranks each of the pair-wise comparisons that you make and successively loosens the critical value you compare against as you go down the ordered values. The critical value is alpha/rank.
Dunnett’s Test: Dunnett’s is best used when comparing a control group against every other group, in pair-wise tests.
Scheffe Test: Scheffe’s is the most conservative post hoc test. It tests all possible sets of contrasts. Do not use this test a priori, for pair-wise tests, or in SPSS for one-way ANOVA. In this test, you will compare the calculated F to the Scheffe F.
A priori: comparisons selected before data collection
Post hoc: comparisons made after data collection
Eta-squared is a measure of strength of an experimental effect, sometimes referred to as the correlation ratio. (Eta is the cursive-looking “n.”) It is calculated by dividing SS treatment by SS total. Omega-squared is another way to measure the strength of an experimental effect. The formula is a little more complicated than eta-squared and the result is more powerful and less biased. (Omega is the cursive-looking “w.”) The major difference is that omega-squared is an estimate of the ratio of treatment variance to total variance in the population, while eta-squared is the ratio of treatment variance to total variance in the sample.
When contrasts are orthogonal, members of a set of contrasts are independent of one another (meaning knowing the outcome of one does not give us any information about the outcome of another). Given that sample sizes are equal, for contrasts to be orthogonal, each contrast must sum to zero and the number of comparisons should equal the number of df for treatments. The easiest way to do this is to break each component into pieces and keep dividing until there are no pieces left. For example, if we have (1, 2, 3, 4, 5):
- 1 df: compare (1, 2) to (3, 4, 5)
- 1 df: compare (1) to (2)
- 1 df: compare (3) to (4, 5)
- 1 df: compare (4) to (5)
A confidence interval of 95 means that the probability is .95 that the interval will include the true difference between the population means.