One of the most prominent classical statistical techniques is the Analysis of Variance (ANOVA). ANOVA is an especially important tool in experimental analysis, where it is used as an omnibus test of a null hypothesis that mean outcomes across all groups are equal (or, stated differently, that the outcome variance between groups is no larger than the outcome variance within groups). This tutorial walks through the basics of using ANOVA in R. We'll start with some fake data from an imaginary three-group experiment:
set.seed(100)
tr <- rep(1:4, each = 30)
y <- numeric(length = 120)
y[tr == 1] <- rnorm(30, 5, 1)
y[tr == 2] <- rnorm(30, 4, 2)
y[tr == 3] <- rnorm(30, 4, 5)
y[tr == 4] <- rnorm(30, 1, 2)
The principle use of ANOVA is to partition the sum of squares from the data and test whether the variance across groups is larger than the variance within groups. The function to do this in R is aov
. (Note: This should not be confused with the anova
function, which is a model-comparison tool for regression models.)
ANOVA models can be expressed as formulae (like in regression, since the techniques are analogous):
aov(y ~ tr)
## Call:
## aov(formula = y ~ tr)
##
## Terms:
## tr Residuals
## Sum of Squares 251.2 1196.4
## Deg. of Freedom 1 118
##
## Residual standard error: 3.184
## Estimated effects may be unbalanced
The default output of the aov
function is surprisingly uninformative and we should instead use summary
to see a more meaningful output:
summary(aov(y ~ factor(tr)))
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(tr) 3 297 99.0 9.98 6.6e-06 ***
## Residuals 116 1151 9.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output is precisely what we would expect. It shows the “within” and “between” sum of squares, the F-statistic, and the p-value associated with that statistic. If significant (which it is in this case), we also see some stars to the right-hand side.
Another way to see basically the same output is with the oneway.test
function. It conducts a one-way ANOVA, whereas aov
is flexible to alternative experimental designs:
oneway.test(y ~ tr)
##
## One-way analysis of means (not assuming equal variances)
##
## data: y and tr
## F = 39.38, num df = 3.00, denom df = 54.25, p-value = 1.191e-13
The oneway.test
function allows us to control whether equal variances are assumed across groups with the var.equal
argument:
oneway.test(y ~ factor(tr), var.equal = TRUE)
##
## One-way analysis of means
##
## data: y and factor(tr)
## F = 9.983, num df = 3, denom df = 116, p-value = 6.634e-06
I always feel like the F-statistic is a bit of a let down. It's a lot of calculation to be reduced to a single number (the F-statistic), which really doesn't tell you much. Instead, we need to actually summary the data - with a table or figure - in order to actually see what that F-statistic means in practice.
As a non-parametric alternative to the ANOVA, which invokes a normality assumption about the residuals, one can use the Kruskal-Wallis analysis of variance test. This does not assume normality of residuals, but does assume that the treatment group outcome distributions have identical shape (other than a shift in median). To implement the Kruskal-Wallis ANOVA, we simply use kruskal.test
:
kruskal.test(y ~ tr)
##
## Kruskal-Wallis rank sum test
##
## data: y by tr
## Kruskal-Wallis chi-squared = 36.96, df = 3, p-value = 4.702e-08
The output of this test is somewhat simpler than that from aov
, presenting us with the test statistic and associated p-value immediately.
For more details on assumptions about distributions, look at the tutorial on variance tests.
Post-hoc comparisons are possible in R. The TukeyHSD
function is available in the base stats package, but the multicomp add-on package offers much more. Other options include the psych package and the car package. In all, it's too much to cover in detail here. We'll look at the TukeyHSD
function, which estimate's Tukey's Honestly Significant Difference statistics (for all pairwise group comparisons in an aov
object):
TukeyHSD(aov(y ~ factor(tr)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = y ~ factor(tr))
##
## $`factor(tr)`
## diff lwr upr p adj
## 2-1 -0.8437 -2.963 1.2759 0.7278
## 3-1 -1.2482 -3.368 0.8714 0.4200
## 4-1 -4.1787 -6.298 -2.0590 0.0000
## 3-2 -0.4045 -2.524 1.7151 0.9595
## 4-2 -3.3350 -5.455 -1.2153 0.0004
## 4-3 -2.9304 -5.050 -0.8108 0.0026
One can always fall back on the trusty t-test (implemented with t.test
) to compare treatment groups pairwise:
t.test(y[tr %in% 1:2] ~ tr[tr %in% 1:2])
##
## Welch Two Sample t-test
##
## data: y[tr %in% 1:2] by tr[tr %in% 1:2]
## t = 1.896, df = 34.2, p-value = 0.06646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06051 1.74794
## sample estimates:
## mean in group 1 mean in group 2
## 5.029 4.185
t.test(y[tr %in% c(2, 4)] ~ tr[tr %in% c(2, 4)])
##
## Welch Two Sample t-test
##
## data: y[tr %in% c(2, 4)] by tr[tr %in% c(2, 4)]
## t = 5.98, df = 56.41, p-value = 1.6e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.218 4.452
## sample estimates:
## mean in group 2 mean in group 4
## 4.1851 0.8502
But the user should, of course, we aware of problems with multiple comparisons.
The easiest way to summarize the information underlying an ANOVA procedure is to look at the treatment group means and variances (or standard deviations). Luckily R makes it very easy to calculate this statistic on each group using the by
function. If we want the mean of y
for each level of tr
, we simply call:
by(y, tr, FUN = mean)
## tr: 1
## [1] 5.029
## --------------------------------------------------------
## tr: 2
## [1] 4.185
## --------------------------------------------------------
## tr: 3
## [1] 3.781
## --------------------------------------------------------
## tr: 4
## [1] 0.8502
The result is an output that shows the treatment level and the associated mean. We can also obtain the same information in a slightly different format using tapply
:
tapply(y, tr, FUN = mean)
## 1 2 3 4
## 5.0289 4.1851 3.7806 0.8502
This returns an object of class “table”, which is perhaps easier to work with. We can do the same for the treatment group standard deviations:
tapply(y, tr, FUN = sd)
## 1 2 3 4
## 0.702 2.334 5.464 1.970
And we could even mind them together:
out <- cbind(tapply(y, tr, FUN = mean), tapply(y, tr, FUN = sd))
colnames(out) <- c("mean", "sd")
out
## mean sd
## 1 5.0289 0.702
## 2 4.1851 2.334
## 3 3.7806 5.464
## 4 0.8502 1.970
The result is a nice matrix showing the mean and standard deviation for each group. If there was some other statistic we wanted to calculate for each group, we could easily use by
or tapply
to obtain it.
A perhaps more convenient way to see our data is to plot it. We can use plot
to produce a simply scatterplot. And we can use our out
matrix to highlight the treatment group means:
plot(y ~ tr, col = rgb(1, 0, 0, 0.5), pch = 16)
# highlight the means:
points(1:4, out[, 1], col = "blue", bg = "blue", pch = 23, cex = 2)
This nice because it shows the distribution of the data, but we can also use a boxplot summary to precisely see the locations of points on the distribution. Specifically, a boxplot will draw the five-number summary for each treatment group:
tapply(y, tr, fivenum)
## $`1`
## [1] 3.842 4.562 5.093 5.319 7.310
##
## $`2`
## [1] -0.5439 2.9554 3.9527 6.1308 7.7949
##
## $`3`
## [1] -6.3720 0.4349 3.7029 7.1900 16.9098
##
## $`4`
## [1] -1.9160 -0.5528 0.3361 1.8270 5.8914
boxplot(y ~ tr)
Another approach is to use our out
object, containing treatment group means and standard deviations to draw a dotchart. We'll first divide our standard deviations by sqrt(30)
to convert them to standard errors of the mean.
out[, 2] <- out[, 2]/sqrt(30)
dotchart(out[, 1], xlim = c(0, 6), xlab = "y", main = "Treatment group means",
pch = 23, bg = "black")
segments(out[, 1] - out[, 2], 1:4, out[, 1] + out[, 2], 1:4, lwd = 2)
segments(out[, 1] - 2 * out[, 2], 1:4, out[, 1] + 2 * out[, 2], 1:4, lwd = 1)
This plot nicely shows the means and both 1- and 2-standard errors of the mean.