Initialize R with the following commands:
options(digits=4,width=80)
options(contrasts=c("contr.sum","contr.poly") ) # set definition of contrasts
load(url("http://pnb.mcmaster.ca/bennett/psy710/datasets/tetrahymena.rda") )
load(url("http://pnb.mcmaster.ca/bennett/psy710/datasets/leprosy.rda") )
Much of the material in this section was based on Section 10.7 in Introductory Statistics with R by Peter Dalgaard.
An experiment was conducted to examine the factors that influence growth in Tetrahymena cells. What is/are Tetrahymena? From Wikipedia:
Tetrahymena is a genus of free-living ciliates that can also switch from commensalistic to pathogenic modes of survival. They are common in freshwater ponds. Tetrahymena species used as model organisms in biomedical research are T. thermophila and T. pyriformis.
The experiment grew Tetrahumena cells in two groups of cultures that
either did or did not include glucose. The primary research question
whether glucose affected the diameter of cells. At the start of the
study, the investigators measured the cell concentration (count per ml)
for each cell culture. The data are stored in the data frame
tetrahumena
, which contains the two-level factor
glucose
and the numeric variables diameter
,
conc
(cell concentration), and logConc
(
log(cell concentration) ).
glucose
on diameter
.options(contrasts=c("contr.sum","contr.poly"))
tetra.aov.01 <- aov(diameter~glucose,data=tetrahymena)
summary(tetra.aov.01)
## Df Sum Sq Mean Sq F value Pr(>F)
## glucose 1 22 22.01 7.51 0.0086 **
## Residuals 49 144 2.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glucose
on diameter
after controlling for the
effect cell concentration. Conduct two ANCOVAs: one that uses
conc
as the covariate, and another that uses
logConc
as the covariate.tetra.aov.02 <- aov(diameter~conc+glucose,data=tetrahymena)
summary(tetra.aov.02)
## Df Sum Sq Mean Sq F value Pr(>F)
## conc 1 100.7 100.7 136.3 1.3e-15 ***
## glucose 1 29.5 29.5 39.9 8.1e-08 ***
## Residuals 48 35.5 0.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tetra.aov.03 <- aov(diameter~logConc+glucose,data=tetrahymena)
summary(tetra.aov.03)
## Df Sum Sq Mean Sq F value Pr(>F)
## logConc 1 129.5 129.5 613 < 2e-16 ***
## glucose 1 26.0 26.0 123 7.6e-15 ***
## Residuals 48 10.2 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: I think the ANCOVA that uses log-transformed concentration is better because the linear association between the dependent variable and the covariate is stronger.
par(cex=1.25,mfrow=c(1,2))
plot(diameter~conc,data=tetrahymena,type="p",pch=as.numeric(tetrahymena$glucose))
abline(lm(diameter~conc,data=tetrahymena))
plot(diameter~logConc,data=tetrahymena,type="p",pch=as.numeric(tetrahymena$glucose))
abline(lm(diameter~logConc,data=tetrahymena))
Another reason for favoring the ANCOVA that uses logConc
is that the residuals from that analysis follow a normal distribution
more closely than the residuals from the other ANCOVA.
shapiro.test(residuals(tetra.aov.02))
##
## Shapiro-Wilk normality test
##
## data: residuals(tetra.aov.02)
## W = 0.97, p-value = 0.1
shapiro.test(residuals(tetra.aov.03))
##
## Shapiro-Wilk normality test
##
## data: residuals(tetra.aov.03)
## W = 0.97, p-value = 0.3
# qq plots:
par(cex=1.25,mfrow=c(1,2))
qqnorm(residuals(tetra.aov.02),main="raw covariate")
qqline(residuals(tetra.aov.02))
qqnorm(residuals(tetra.aov.03),main="log-transformed covariate")
qqline(residuals(tetra.aov.03))
Finally, Figure 4 shows that using logConc
results in
residuals that are (to a first approximation) independent of the value
of the covariate, which again suggests that the log-transformed
covariate has a more linear association with the dependent
variable.
par(cex=1.25,mfrow=c(1,2))
conc <- tetrahymena$conc
resids <- residuals(tetra.aov.02)
plot(x=conc,
y=resids,
type="p",
xlab="conc",
ylab="residual",
ylim=c(-2,2),
main="raw covariate")
abline(h=0,lty=2)
lm.01 <- lm(resids~1+conc+I(conc^2))
lines(sort(conc),fitted(lm.01)[order(conc)],col="red")
logConc <- tetrahymena$logConc
resids <- residuals(tetra.aov.03)
plot(x=logConc,
y=resids,
type="p",
xlab="logConc",
ylab="residual",
ylim=c(-1.5,1.5),
main="log covariate")
abline(h=0,lty=2)
lm.02 <- lm(resids~1+logConc+I(logConc^2))
lines(sort(logConc),fitted(lm.02)[order(logConc)],col="red")
par(mfrow=c(1,1))
summary(lm.01) # significant linear & non-linear trends in residuals
##
## Call:
## lm(formula = resids ~ 1 + conc + I(conc^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5175 -0.3277 0.0574 0.3609 1.3148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.35e-01 1.38e-01 4.60 3.1e-05 ***
## conc -1.17e-05 1.63e-06 -7.21 3.5e-09 ***
## I(conc^2) 2.09e-11 2.80e-12 7.48 1.3e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.584 on 48 degrees of freedom
## Multiple R-squared: 0.539, Adjusted R-squared: 0.519
## F-statistic: 28 on 2 and 48 DF, p-value: 8.68e-09
summary(lm.02) # no significant trends in residuals
##
## Call:
## lm(formula = resids ~ 1 + logConc + I(logConc^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2369 -0.2134 -0.0265 0.2763 0.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6609 5.9451 -0.28 0.78
## logConc 0.6830 2.4362 0.28 0.78
## I(logConc^2) -0.0693 0.2470 -0.28 0.78
##
## Residual standard error: 0.459 on 48 degrees of freedom
## Multiple R-squared: 0.00164, Adjusted R-squared: -0.04
## F-statistic: 0.0394 on 2 and 48 DF, p-value: 0.961
logConc
as the
covariate) depend on the order of terms in the model? Why or why
not?summary(tetra.aov.03)
## Df Sum Sq Mean Sq F value Pr(>F)
## logConc 1 129.5 129.5 613 < 2e-16 ***
## glucose 1 26.0 26.0 123 7.6e-15 ***
## Residuals 48 10.2 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tetra.aov.04 <- aov(diameter~glucose + logConc,data=tetrahymena)
summary(tetra.aov.04)
## Df Sum Sq Mean Sq F value Pr(>F)
## glucose 1 22.0 22.0 104 1.3e-13 ***
## logConc 1 133.6 133.6 632 < 2e-16 ***
## Residuals 48 10.2 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: The results depend on the order of terms in the model because the covariate and grouping variable are not orthogonal. (N.B. The difference between the covariate in the two groups is small but non-zero.)
par(cex=1.25,mfrow=c(1,1))
boxplot(logConc~glucose,data=tetrahymena)
Answer: I would put the covariate first, because I
want to evaluate the effect of glucose
after controlling
for the covariate (which I consider to be a nuisance variable).
library(effectsize)
summary(tetra.aov.01)
## Df Sum Sq Mean Sq F value Pr(>F)
## glucose 1 22 22.01 7.51 0.0086 **
## Residuals 49 144 2.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cohens_f(tetra.aov.01) # ignore covariate
# ANCOVA includes covariate
summary(tetra.aov.03)
## Df Sum Sq Mean Sq F value Pr(>F)
## logConc 1 129.5 129.5 613 < 2e-16 ***
## glucose 1 26.0 26.0 123 7.6e-15 ***
## Residuals 48 10.2 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cohens_f(tetra.aov.03,partial=TRUE) # Cohen's f after removing variation related to covariate
cohens_f(tetra.aov.03,partial=FALSE) # Cohen's f with variation related to covariate
Answer: The values differ because they use a
different error term. When the covariate is ignored, you are estimating
effect size with an error term that includes variation associated with
the covariate (as well as pure error variance). Note that Cohen’s \(f\) is closer to the value we get when
glucose
is evaluated with ANOVA (and the covariate is
ignored). The reason partial Cohen’s \(f\) and the value obtained from the ANOVA
are not identical is because the covariate and treatment variables are
not orthogonal, so including (or not including) the covariate alters the
Sum-of-Squares assigned to glucose
.
tetra.aov.03b <- aov(diameter~logConc*glucose,data=tetrahymena)
summary(tetra.aov.03b)
## Df Sum Sq Mean Sq F value Pr(>F)
## logConc 1 129.5 129.5 604.39 < 2e-16 ***
## glucose 1 26.0 26.0 121.47 1.3e-14 ***
## logConc:glucose 1 0.1 0.1 0.36 0.55
## Residuals 47 10.1 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An experiment examined the effect of drugs on the treatment of
leprosy. A measure of the abundance of leprosy bacilli on the skin was
calculated for each participant. Participants were then randomly
assigned to one of three treatment conditions (\(n=10\) per group) that involved taking an
antibiotic or a placebo daily for several months. At the end of the
experiment, a measure of the abundance of leprosy bacilli was taken
again on each participant. The data are stored in the data frame
leprosy
which contains the numeric variables
baseline
and postTreatment
and the three-level
factor drug
. The levels in drug
represent two
antibiotics (A
and B
) and a control group
(placebo
).
drug
on postTreatment
. Use Tukey pairwise
tests to compare the three groups.options(contrasts=c("contr.sum","contr.poly"))
lep.aov.01 <- aov(postTreatment~drug,leprosy)
summary(lep.aov.01)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 294 146.8 3.98 0.03 *
## Residuals 27 995 36.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(emmeans)
emmeans(lep.aov.01,specs=pairwise~drug)
## $emmeans
## drug emmean SE df lower.CL upper.CL
## A 5.3 1.92 27 1.36 9.24
## B 6.1 1.92 27 2.16 10.04
## placebo 12.3 1.92 27 8.36 16.24
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -0.8 2.71 27 -0.295 0.9533
## A - placebo -7.0 2.71 27 -2.578 0.0403
## B - placebo -6.2 2.71 27 -2.284 0.0754
##
## P value adjustment: tukey method for comparing a family of 3 estimates
drug
on postTreatment
after controlling for
baseline
. Check the homogeneity of slopes assumption. Then
use Tukey HSD tests to evaluate pairwise differences among the adjusted
means.options(contrasts=c("contr.sum","contr.poly"))
lep.aov.02 <- aov(postTreatment~baseline+drug,leprosy)
summary(lep.aov.02)
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 803 803 50.04 1.6e-07 ***
## drug 2 69 34 2.14 0.14
## Residuals 26 417 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check homogeneity of slopes assumption:
lep.aov.03 <- aov(postTreatment~baseline*drug,leprosy)
summary(lep.aov.03)
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 803 803 48.47 3.4e-07 ***
## drug 2 69 34 2.07 0.15
## baseline:drug 2 20 10 0.59 0.56
## Residuals 24 398 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# list adjusted means
library(effects)
effect(term="drug",lep.aov.02)
##
## drug effect
## drug
## A B placebo
## 6.715 6.824 10.161
# do Tukey tests:
emmeans(lep.aov.02,specs=pairwise~drug)
## $emmeans
## drug emmean SE df lower.CL upper.CL
## A 6.71 1.29 26 4.07 9.36
## B 6.82 1.27 26 4.21 9.44
## placebo 10.16 1.32 26 7.46 12.87
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -0.109 1.80 26 -0.061 0.9980
## A - placebo -3.446 1.89 26 -1.826 0.1809
## B - placebo -3.337 1.85 26 -1.800 0.1893
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Answer: Note that TukeyHSD
is not
appropriate here because it performs pairwise tests on
non-adjusted means.
TukeyHSD(lep.aov.02,which="drug") # NOT done on adjusted means!
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = postTreatment ~ baseline + drug, data = leprosy)
##
## $drug
## diff lwr upr p adj
## B-A 0.03132 -4.420 4.483 0.9998
## placebo-A 3.04678 -1.405 7.498 0.2240
## placebo-B 3.01546 -1.436 7.467 0.2305
drug
was significant in the
ANOVA but not the ANCOVA.boxplot(baseline~drug,data=leprosy)
lep.aov.03 <- aov(baseline~drug,leprosy)
summary(lep.aov.03)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 73 36.4 1.66 0.21
## Residuals 27 593 22.0
lep.emm <- emmeans(lep.aov.03,specs=~drug)
contrast(lep.emm,method=list(c(-1,-1,2)))
## contrast estimate SE df t.ratio p.value
## c(-1, -1, 2) 6.5 3.63 27 1.791 0.0846
Answer: The mean value of the covariate differed
significantly across groups. So the original group differences in our
dependent variable were due (in large part) to group differences in the
covariate, not to the effects of drug
.