1 Initialization

Initialize R with the following commands:

options(digits=4,width=80)
options(contrasts=c("contr.sum","contr.poly") )  # set definition of contrasts
load(url("http://pnb.mcmaster.ca/bennett/psy710/datasets/tetrahymena.rda") )
load(url("http://pnb.mcmaster.ca/bennett/psy710/datasets/leprosy.rda") )

2 Tetrahymena Data

Much of the material in this section was based on Section 10.7 in Introductory Statistics with R by Peter Dalgaard.

An experiment was conducted to examine the factors that influence growth in Tetrahymena cells. What is/are Tetrahymena? From Wikipedia:

Tetrahymena is a genus of free-living ciliates that can also switch from commensalistic to pathogenic modes of survival. They are common in freshwater ponds. Tetrahymena species used as model organisms in biomedical research are T. thermophila and T. pyriformis.

Figure 1: Tetrahymena thermophila. (Taken from Wikipedia)

Figure 1: Tetrahymena thermophila. (Taken from Wikipedia)

The experiment grew Tetrahumena cells in two groups of cultures that either did or did not include glucose. The primary research question whether glucose affected the diameter of cells. At the start of the study, the investigators measured the cell concentration (count per ml) for each cell culture. The data are stored in the data frame tetrahumena, which contains the two-level factor glucose and the numeric variables diameter, conc (cell concentration), and logConc ( log(cell concentration) ).

  1. Conduct an analysis of variance to evaluate the effect of glucose on diameter.
options(contrasts=c("contr.sum","contr.poly"))
tetra.aov.01 <- aov(diameter~glucose,data=tetrahymena)
summary(tetra.aov.01)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## glucose      1     22   22.01    7.51 0.0086 **
## Residuals   49    144    2.93                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Conduct an analysis of covariance to evaluate the effect of glucose on diameter after controlling for the effect cell concentration. Conduct two ANCOVAs: one that uses conc as the covariate, and another that uses logConc as the covariate.
tetra.aov.02 <- aov(diameter~conc+glucose,data=tetrahymena)
summary(tetra.aov.02)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## conc         1  100.7   100.7   136.3 1.3e-15 ***
## glucose      1   29.5    29.5    39.9 8.1e-08 ***
## Residuals   48   35.5     0.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tetra.aov.03 <- aov(diameter~logConc+glucose,data=tetrahymena)
summary(tetra.aov.03)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## logConc      1  129.5   129.5     613 < 2e-16 ***
## glucose      1   26.0    26.0     123 7.6e-15 ***
## Residuals   48   10.2     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Regarding the previous question: Which ANCOVA is better? Why?

Answer: I think the ANCOVA that uses log-transformed concentration is better because the linear association between the dependent variable and the covariate is stronger.

par(cex=1.25,mfrow=c(1,2))
plot(diameter~conc,data=tetrahymena,type="p",pch=as.numeric(tetrahymena$glucose))
abline(lm(diameter~conc,data=tetrahymena))
plot(diameter~logConc,data=tetrahymena,type="p",pch=as.numeric(tetrahymena$glucose))
abline(lm(diameter~logConc,data=tetrahymena))
Figure 2: Linear regressions on tetrahymena data.

Figure 2: Linear regressions on tetrahymena data.

Another reason for favoring the ANCOVA that uses logConc is that the residuals from that analysis follow a normal distribution more closely than the residuals from the other ANCOVA.

shapiro.test(residuals(tetra.aov.02))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(tetra.aov.02)
## W = 0.97, p-value = 0.1
shapiro.test(residuals(tetra.aov.03))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(tetra.aov.03)
## W = 0.97, p-value = 0.3
# qq plots:
par(cex=1.25,mfrow=c(1,2))
qqnorm(residuals(tetra.aov.02),main="raw covariate")
qqline(residuals(tetra.aov.02))
qqnorm(residuals(tetra.aov.03),main="log-transformed covariate")
qqline(residuals(tetra.aov.03))
Figure 3: qq plots of residuals from the two ANCOVAs.

Figure 3: qq plots of residuals from the two ANCOVAs.

Finally, Figure 4 shows that using logConc results in residuals that are (to a first approximation) independent of the value of the covariate, which again suggests that the log-transformed covariate has a more linear association with the dependent variable.

par(cex=1.25,mfrow=c(1,2))
conc <- tetrahymena$conc
resids <- residuals(tetra.aov.02)
plot(x=conc,
     y=resids,
     type="p",
     xlab="conc",
     ylab="residual",
     ylim=c(-2,2),
     main="raw covariate")
abline(h=0,lty=2)
lm.01 <- lm(resids~1+conc+I(conc^2))
lines(sort(conc),fitted(lm.01)[order(conc)],col="red")

logConc <- tetrahymena$logConc
resids <- residuals(tetra.aov.03)
plot(x=logConc,
     y=resids,
     type="p",
     xlab="logConc",
     ylab="residual",
     ylim=c(-1.5,1.5),
     main="log covariate")
abline(h=0,lty=2)
lm.02 <- lm(resids~1+logConc+I(logConc^2))
lines(sort(logConc),fitted(lm.02)[order(logConc)],col="red")
Figure 4: residuals vs. covariate for the two ANCOVAs.

Figure 4: residuals vs. covariate for the two ANCOVAs.


par(mfrow=c(1,1))

summary(lm.01) # significant linear & non-linear trends in residuals
## 
## Call:
## lm(formula = resids ~ 1 + conc + I(conc^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5175 -0.3277  0.0574  0.3609  1.3148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.35e-01   1.38e-01    4.60  3.1e-05 ***
## conc        -1.17e-05   1.63e-06   -7.21  3.5e-09 ***
## I(conc^2)    2.09e-11   2.80e-12    7.48  1.3e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.584 on 48 degrees of freedom
## Multiple R-squared:  0.539,  Adjusted R-squared:  0.519 
## F-statistic:   28 on 2 and 48 DF,  p-value: 8.68e-09
summary(lm.02) # no significant trends in residuals
## 
## Call:
## lm(formula = resids ~ 1 + logConc + I(logConc^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2369 -0.2134 -0.0265  0.2763  0.9264 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -1.6609     5.9451   -0.28     0.78
## logConc        0.6830     2.4362    0.28     0.78
## I(logConc^2)  -0.0693     0.2470   -0.28     0.78
## 
## Residual standard error: 0.459 on 48 degrees of freedom
## Multiple R-squared:  0.00164,    Adjusted R-squared:  -0.04 
## F-statistic: 0.0394 on 2 and 48 DF,  p-value: 0.961
  1. Do the results of the ANCOVA (that uses logConc as the covariate) depend on the order of terms in the model? Why or why not?
summary(tetra.aov.03)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## logConc      1  129.5   129.5     613 < 2e-16 ***
## glucose      1   26.0    26.0     123 7.6e-15 ***
## Residuals   48   10.2     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tetra.aov.04 <- aov(diameter~glucose + logConc,data=tetrahymena)
summary(tetra.aov.04)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## glucose      1   22.0    22.0     104 1.3e-13 ***
## logConc      1  133.6   133.6     632 < 2e-16 ***
## Residuals   48   10.2     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The results depend on the order of terms in the model because the covariate and grouping variable are not orthogonal. (N.B. The difference between the covariate in the two groups is small but non-zero.)

par(cex=1.25,mfrow=c(1,1))
boxplot(logConc~glucose,data=tetrahymena)
Figure 3: Covariate in each glucose group.

Figure 3: Covariate in each glucose group.

  1. If the order of the terms does affect the coefficients of the ANCOVA model, which order do you think is better? Why?

Answer: I would put the covariate first, because I want to evaluate the effect of glucose after controlling for the covariate (which I consider to be a nuisance variable).

  1. Calculate Cohen’s \(f\) and partial Cohen’s \(f\) for the effect of glucose after controlling for the effect of the covariate. Why do the two values differ?
library(effectsize)
summary(tetra.aov.01)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## glucose      1     22   22.01    7.51 0.0086 **
## Residuals   49    144    2.93                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cohens_f(tetra.aov.01) # ignore covariate
# ANCOVA includes covariate
summary(tetra.aov.03)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## logConc      1  129.5   129.5     613 < 2e-16 ***
## glucose      1   26.0    26.0     123 7.6e-15 ***
## Residuals   48   10.2     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cohens_f(tetra.aov.03,partial=TRUE) # Cohen's f after removing variation related to covariate
cohens_f(tetra.aov.03,partial=FALSE) # Cohen's f with variation related to covariate

Answer: The values differ because they use a different error term. When the covariate is ignored, you are estimating effect size with an error term that includes variation associated with the covariate (as well as pure error variance). Note that Cohen’s \(f\) is closer to the value we get when glucose is evaluated with ANOVA (and the covariate is ignored). The reason partial Cohen’s \(f\) and the value obtained from the ANOVA are not identical is because the covariate and treatment variables are not orthogonal, so including (or not including) the covariate alters the Sum-of-Squares assigned to glucose.

  1. Evaluate the homogeneity of slopes assumption.
tetra.aov.03b <- aov(diameter~logConc*glucose,data=tetrahymena)
summary(tetra.aov.03b)
##                 Df Sum Sq Mean Sq F value  Pr(>F)    
## logConc          1  129.5   129.5  604.39 < 2e-16 ***
## glucose          1   26.0    26.0  121.47 1.3e-14 ***
## logConc:glucose  1    0.1     0.1    0.36    0.55    
## Residuals       47   10.1     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3 Leprosy Data

An experiment examined the effect of drugs on the treatment of leprosy. A measure of the abundance of leprosy bacilli on the skin was calculated for each participant. Participants were then randomly assigned to one of three treatment conditions (\(n=10\) per group) that involved taking an antibiotic or a placebo daily for several months. At the end of the experiment, a measure of the abundance of leprosy bacilli was taken again on each participant. The data are stored in the data frame leprosy which contains the numeric variables baseline and postTreatment and the three-level factor drug. The levels in drug represent two antibiotics (A and B) and a control group (placebo).

  1. Use the analysis of variance to evaluate the effect of drug on postTreatment. Use Tukey pairwise tests to compare the three groups.
options(contrasts=c("contr.sum","contr.poly"))
lep.aov.01 <- aov(postTreatment~drug,leprosy)
summary(lep.aov.01)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## drug         2    294   146.8    3.98   0.03 *
## Residuals   27    995    36.9                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(emmeans)
emmeans(lep.aov.01,specs=pairwise~drug)
## $emmeans
##  drug    emmean   SE df lower.CL upper.CL
##  A          5.3 1.92 27     1.36     9.24
##  B          6.1 1.92 27     2.16    10.04
##  placebo   12.3 1.92 27     8.36    16.24
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast    estimate   SE df t.ratio p.value
##  A - B           -0.8 2.71 27  -0.295  0.9533
##  A - placebo     -7.0 2.71 27  -2.578  0.0403
##  B - placebo     -6.2 2.71 27  -2.284  0.0754
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
  1. Use the analysis of covariance to evaluate the effect of drug on postTreatment after controlling for baseline. Check the homogeneity of slopes assumption. Then use Tukey HSD tests to evaluate pairwise differences among the adjusted means.
options(contrasts=c("contr.sum","contr.poly"))
lep.aov.02 <- aov(postTreatment~baseline+drug,leprosy)
summary(lep.aov.02)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## baseline     1    803     803   50.04 1.6e-07 ***
## drug         2     69      34    2.14    0.14    
## Residuals   26    417      16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# check homogeneity of slopes assumption:
lep.aov.03 <- aov(postTreatment~baseline*drug,leprosy)
summary(lep.aov.03)
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## baseline       1    803     803   48.47 3.4e-07 ***
## drug           2     69      34    2.07    0.15    
## baseline:drug  2     20      10    0.59    0.56    
## Residuals     24    398      17                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# list adjusted means
library(effects)
effect(term="drug",lep.aov.02)
## 
##  drug effect
## drug
##       A       B placebo 
##   6.715   6.824  10.161

# do Tukey tests:
emmeans(lep.aov.02,specs=pairwise~drug)
## $emmeans
##  drug    emmean   SE df lower.CL upper.CL
##  A         6.71 1.29 26     4.07     9.36
##  B         6.82 1.27 26     4.21     9.44
##  placebo  10.16 1.32 26     7.46    12.87
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast    estimate   SE df t.ratio p.value
##  A - B         -0.109 1.80 26  -0.061  0.9980
##  A - placebo   -3.446 1.89 26  -1.826  0.1809
##  B - placebo   -3.337 1.85 26  -1.800  0.1893
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Answer: Note that TukeyHSD is not appropriate here because it performs pairwise tests on non-adjusted means.

TukeyHSD(lep.aov.02,which="drug") # NOT done on adjusted means!
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = postTreatment ~ baseline + drug, data = leprosy)
## 
## $drug
##              diff    lwr   upr  p adj
## B-A       0.03132 -4.420 4.483 0.9998
## placebo-A 3.04678 -1.405 7.498 0.2240
## placebo-B 3.01546 -1.436 7.467 0.2305
  1. Explain why the effect of drug was significant in the ANOVA but not the ANCOVA.
boxplot(baseline~drug,data=leprosy)
Figure 4: Covariate in each drug group.

Figure 4: Covariate in each drug group.

lep.aov.03 <- aov(baseline~drug,leprosy)
summary(lep.aov.03)
##             Df Sum Sq Mean Sq F value Pr(>F)
## drug         2     73    36.4    1.66   0.21
## Residuals   27    593    22.0
lep.emm <- emmeans(lep.aov.03,specs=~drug)
contrast(lep.emm,method=list(c(-1,-1,2)))
##  contrast     estimate   SE df t.ratio p.value
##  c(-1, -1, 2)      6.5 3.63 27   1.791  0.0846

Answer: The mean value of the covariate differed significantly across groups. So the original group differences in our dependent variable were due (in large part) to group differences in the covariate, not to the effects of drug.