Clear R’s workspace and set up your working directory. Then initialize R with the following commands.
options(contrasts=c("contr.sum","contr.poly"),width=100) # IMPORTANT!
In this section we download two numeric vectors and use them to
create data frames in the so-called wide and long formats. The numeric
vectors are named cog.at.risk
and cog.control
and can be loaded into R with the following command:
load(file=url('http://pnb.mcmaster.ca/bennett/psy710/datasets/cogTest.rda'))
First combine the two vectors into a single, wide-format data frame
consisting of two numeric variables: risk
and
control.
# store vectors in wide-format data frame:
cogWide.df <- data.frame(risk=cog.at.risk,control=cog.control)
summary(cogWide.df)
## risk control
## Min. : 55.6 Min. : 80.0
## 1st Qu.: 87.8 1st Qu.: 87.5
## Median : 95.9 Median : 93.8
## Mean : 94.0 Mean :100.0
## 3rd Qu.:104.2 3rd Qu.:113.2
## Max. :115.9 Max. :128.8
head(cogWide.df) # inspect first several rows
Next, we use the melt
command in the
reshape2
package to transform cogWide.df
into
a long-format data frame, cogLong.df
:
# install.packages("reshape2")
library(reshape2)
cogLong.df <- melt(cogWide.df,
measure.vars=c("risk","control"), # dependent vars
value.name="score", # new dependent var
variable.name="group") # new grouping var
# head(cogLong.df) # inspect first six rows; compare to last 6 rows
# tail(cogLong.df) # last six rows
The gather
command in tidyr
does the same
thing:
# install.packages("tidyr")
library(tidyr)
cogLong.df <- gather(cogWide.df,key="group",value="score",risk,control)
Next, compare the summaries of the wide and long data frames.
The wide data frame contains two variables, risk
and
control
, that are separated into two columns, so each row
contains two dependent measures (i.e., two test scores). The long data
frame also contains two variables: a numeric variable,
score
, which contains the test scores from the risk and
control groups, and a factor, group
, that consists of two
levels (i.e., risk
and group
). Unlike the wide
format data frame, the long data frame contains a single dependent
variable in each row.
summary(cogWide.df)
## risk control
## Min. : 55.6 Min. : 80.0
## 1st Qu.: 87.8 1st Qu.: 87.5
## Median : 95.9 Median : 93.8
## Mean : 94.0 Mean :100.0
## 3rd Qu.:104.2 3rd Qu.:113.2
## Max. :115.9 Max. :128.8
summary(cogLong.df)
## group score
## risk :20 Min. : 55.6
## control:20 1st Qu.: 87.5
## Median : 94.8
## Mean : 97.0
## 3rd Qu.:107.9
## Max. :128.8
The difference between the formats has important implications for how
you analyze and graph data. For example, the following code illustrates
how the means of risk
and control
are computed
for the two formats.
# wide format
colMeans(cogWide.df)
## risk control
## 94 100
# apply(X = cogWide.df,MARGIN=2,FUN = "mean") # same results as colMeans
# long format
with(cogLong.df,tapply(score,group,mean))
## risk control
## 94 100
# aggregate(score~group,cogLong.df,mean) # same result
During the term you will find that some things are more easily done
with wide-format data frames, but that many others are more easily done
with long-format data frames. For example, it is arguably easier to use
cor.test
and t.test
with wide-format data, but
it is much easier to perform ANOVAs with long-format data. So you should
become comfortable with both formats.
You can read more about converting between wide and long formats at the following websites
www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
www.r-bloggers.com/2019/07/how-to-reshape-a-dataframe-from-wide-to-long-or-long-to-wide-format/
www.r-bloggers.com/2016/09/reshape-r-package-reshape2-melt-and-cast/
www.r-bloggers.com/2016/06/how-to-reshape-data-in-r-tidyr-vs-reshape2/
An experiment is conducted that measures visual acuity in groups of infants ranging from 1 to 5 months of age. Load the data into R with the following command:
load(file=url('http://pnb.mcmaster.ca/bennett/psy710/datasets/acuityWide.rda'))
Tasks:
acuityWide.df
.The data from 10 infants in 5 age groups (i.e., 1-5 months) are
stored in the wide format data frame acuityWide.df
, which
contains five numeric variables named m1
,
m2
, m3
, m4
, and
m5
.
dim(acuityWide.df) # [rows,columns]
## [1] 10 5
sapply(acuityWide.df,class) # list types of variables
## m1 m2 m3 m4 m5
## "numeric" "numeric" "numeric" "numeric" "numeric"
summary(acuityWide.df) # summary of each variable
## m1 m2 m3 m4 m5
## Min. : 7.89 Min. : 8.34 Min. : 8.18 Min. : 9.66 Min. : 8.94
## 1st Qu.: 8.44 1st Qu.: 9.38 1st Qu.: 9.48 1st Qu.: 9.70 1st Qu.: 9.42
## Median : 9.96 Median : 9.75 Median :10.10 Median : 9.80 Median :10.81
## Mean : 9.45 Mean : 9.75 Mean :10.00 Mean :10.20 Mean :10.50
## 3rd Qu.:10.22 3rd Qu.:10.30 3rd Qu.:10.74 3rd Qu.:10.37 3rd Qu.:11.31
## Max. :10.76 Max. :11.09 Max. :11.12 Max. :12.29 Max. :11.96
acuityWide.df
into a long-format data frame
named acuityLong.df
. Store the five ages in a single
variable named age
and the acuity measures in a single
variable named acuity
.library(reshape2)
acuityLong.df <- melt(acuityWide.df,
measure.vars=c("m1","m2","m3","m4","m5"), # dependent vars
value.name="acuity", # new dependent var
variable.name="age") # new grouping var
names(acuityLong.df)
## [1] "age" "acuity"
summary
to get a summary of the variables in the
new data frame. Verify that age
is a factor
and acuity
is numeric
.sapply(acuityLong.df,class)
## age acuity
## "factor" "numeric"
summary(acuityLong.df)
## age acuity
## m1:10 Min. : 7.89
## m2:10 1st Qu.: 9.38
## m3:10 Median : 9.94
## m4:10 Mean : 9.98
## m5:10 3rd Qu.:10.76
## Max. :12.29
acuity
for each
age
.op <- par(no.readonly = T)
par(cex=1.5,cex.lab=1.5,cex.main=2,mar=c(5,5,3,1)+0.1,mgp=c(3,1,0))
boxplot(acuity~age,
data=acuityLong.df,
xlab="Age (months)",
names=c("1","2","3","4","5"),
main="Acuity Data")
par(op)
acuity
for each
age
.library(reshape2)
acuityLong.df <- melt(acuityWide.df,
measure.vars=c("m1","m2","m3","m4","m5"), # dependent vars
value.name="acuity", # new dependent var
variable.name="age") # new grouping var
summary(acuityLong.df)
## age acuity
## m1:10 Min. : 7.89
## m2:10 1st Qu.: 9.38
## m3:10 Median : 9.94
## m4:10 Mean : 9.98
## m5:10 3rd Qu.:10.76
## Max. :12.29
sapply(acuityLong.df,class)
## age acuity
## "factor" "numeric"
boxplot(acuity~age,data=acuityLong.df)
The analysis of variance (ANOVA) is a method for evaluating
differences among 3 or more group means. In the following examples, the
numeric dependent variable score
and an independent
variable, the four-level factor condition
, are contained in
the data frame theAOVdata
.
load(file=url('http://pnb.mcmaster.ca/bennett/psy710/datasets/theAOVdataRL3.rda'))
summary(theAOVdata)
## condition score
## c1:20 Min. : 54.4
## c2:20 1st Qu.: 87.2
## c3:20 Median : 98.8
## c4:20 Mean : 98.9
## 3rd Qu.:112.9
## Max. :147.2
Task: Plot the distributions with a box plot for each group.
par(cex=1.5,cex.lab=1.5,cex.main=2,mar=c(5,5,3,1)+0.1,mgp=c(3,1,0))
boxplot(score~condition,data=theAOVdata,xlab="Condition",ylab="Score",main="theAOVdata")
par(op)
In R, ANOVA is performed with the lm
and
aov
commands. The formula score~1+condition
defines our linear model: score
is modeled as the sum of an
intercept (1
) and an effect of condition
. Note
that the intercept is added by default, and so we would obtain the same
results if we used the formula score~condition
. The
aov
and lm
commands compute the best-fitting
(least-squares) coefficients for our model. Use aov
and
lm
to perform an ANOVA (i.e., compute the least-squares
estimates of the coefficients of our linear model), and then display the
results in a table using the anova
command.
options(contrasts=c("contr.sum","contr.poly")) # important!
# contr.sum means we are using sum-to-zero definition of an "effect"
# using aov()
aov.res.01 <- aov(score~1+condition,data=theAOVdata)
class(aov.res.01)
## [1] "aov" "lm"
anova(aov.res.01)
# summary(aov.res.01) # same as anova()
# using lm()
lm.res.01 <- lm(score~1+condition,data=theAOVdata)
class(lm.res.01)
## [1] "lm"
anova(lm.res.01)
# summary(lm.res.01) # not the same as anova() !
Question: The anova tables contain an \(F\) test and a \(p\) value for condition
. What
hypothesis is being evaluated by this \(F\) test?
Answer: The \(F\)
test is evaluating the null hypothesis that all of the
condition
means are equal or, equivalently, that
all of the effects (i.e., alpha’s) of condition
are zero. When the null hypothesis is true, all of the data are being
selected from populations with the same mean, and therefore the
variation among group means is due solely to sampling variation.
We can view the best-fitting coefficients with the coef
and dummy.coef
commands.
coef(aov.res.01) # coefficients for 3 groups listed here
## (Intercept) condition1 condition2 condition3
## 98.911 -2.647 -8.552 8.637
dummy.coef(aov.res.01) # coefficients for all 4 groups listed here... why?
## Full coefficients are
##
## (Intercept): 98.91
## condition: c1 c2 c3 c4
## -2.647 -8.552 8.637 2.562
coef(lm.res.01) # coefficients for 3 groups listed here
## (Intercept) condition1 condition2 condition3
## 98.911 -2.647 -8.552 8.637
dummy.coef(lm.res.01) # coefficients for all 4 groups listed here... why?
## Full coefficients are
##
## (Intercept): 98.91
## condition: c1 c2 c3 c4
## -2.647 -8.552 8.637 2.562
We are using the sum-to-zero constraint on our
alpha’s, which means that all of the group effects (alpha’s)
must sum to zero. Therefore, our ANOVA only estimates the
values of the intercept and three group effects (because the
fourth is given by the sum-to-zero constraint). The coef
command lists the coefficients estimated by ANOVA whereas
dummy.coef
lists the effects for all groups.
The coefficients for an lm
object, but not an
aov
object, can be submitted easily to a t
test.
summary(lm.res.01) # t test on each coefficient
##
## Call:
## lm(formula = score ~ 1 + condition, data = theAOVdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.51 -10.52 0.66 11.49 45.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.91 1.87 52.81 <2e-16 ***
## condition1 -2.65 3.24 -0.82 0.4172
## condition2 -8.55 3.24 -2.64 0.0102 *
## condition3 8.64 3.24 2.66 0.0095 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 76 degrees of freedom
## Multiple R-squared: 0.131, Adjusted R-squared: 0.0971
## F-statistic: 3.83 on 3 and 76 DF, p-value: 0.013
Question: What null hypothesis is being evaluated by each \(t\) test?
Answer: Each \(t\) test is evaluating the null hypothesis that the coefficient was drawn from a population with \(\mu=0\). Note that these \(t\) tests differ from the \(F\) test performed in ANOVA, which evaluates the hypothesis that all of the effects/coefficients are zero.
Because we are using the sum-to-zero constraint on our alpha’s, the effect of being group j is defined as the difference between the mean for group j and the unweighted mean (i.e., the average of the group means), and the intercept corresponds to the unweighted group mean. We have equal n in each group, so the unweighted mean equals the grand mean of the scores.
Task: Calculate the group means, grand mean, the
unweighted mean, and the effect for each group (i.e.,
difference between each group mean and the unweighted mean). Verify that
these alphas are the same as those estimated by aov
and/or
lm
.
dummy.coef(aov.res.01)
## Full coefficients are
##
## (Intercept): 98.91
## condition: c1 c2 c3 c4
## -2.647 -8.552 8.637 2.562
( grpMeans <- with(theAOVdata,tapply(score,condition,mean)) ) # group means
## c1 c2 c3 c4
## 96.26 90.36 107.55 101.47
( grandMean <- mean(theAOVdata$score) ) # equals Intercept
## [1] 98.91
( unweightedMean <- mean(grpMeans)) # equal n so this equals grand mean & intercept
## [1] 98.91
grpMeans - grandMean # equals alphas/coefficients
## c1 c2 c3 c4
## -2.647 -8.552 8.637 2.562
It is possible to use alternative definitions of effects. For example
we could define one of our conditions (c1
) as a baseline
condition, the others (i.e., c2
, c3
, &
c4
) as so-called treatment conditions, and the effect of a
treatment as the difference between the treatment group mean and the
baseline group mean. Using the treatment definition of effects,
which is the default definition in R, will change the
values of the best-fitting coefficients estimated by ANOVA: the
intercept equals the mean of the baseline group and the three
coefficients correspond to the difference between each treatment group
and the baseline group.
options(contrasts=c("contr.treatment","contr.poly")) # important!
aov.res.02 <- aov(score~condition,data=theAOVdata) # recalculate least-squares coefficients
coef(aov.res.02)
## (Intercept) conditionc2 conditionc3 conditionc4
## 96.264 -5.905 11.283 5.208
dummy.coef(aov.res.02)
## Full coefficients are
##
## (Intercept): 96.26
## condition: c1 c2 c3 c4
## 0.000 -5.905 11.283 5.208
grpMeans - grpMeans[1]
## c1 c2 c3 c4
## 0.000 -5.905 11.283 5.208
options(contrasts=c("contr.sum","contr.poly")) # restore sum-to-zero definition
Which definition of an effect is best? It depends on the circumstances: in some sense it is arbitrary. Fortunately, the \(F\) test in a 1-way ANOVA always evaluates the same null hypothesis (H0: \(\alpha_j = 0\) for all \(j\)) and yields the same \(p\) value regardless of how we define \(\alpha_j\). (However, this is not true in the case of an unbalanced factorial ANOVA, a point that we will return to later in the course).
anova(aov.res.02)
The ANOVA table is listed with the anova
command. The
first line in the ANOVA table lists the change in goodness-of-fit (i.e.,
SS-residuals) when we force all of the group effects to equal zero. The
\(F\) test evaluates the null
hypothesis that all of the group effects are zero or, equivalently, that
the four conditions are samples from populations with the same mean.
Essentially, we are comparing the goodness-of-fit (i.e., SS-residuals)
of our full model, which includes and intercept and an effect for each
group, and the fit of a reduced model that includes only an intercept.
We expect the full model to provide a better fit because it has more
parameters. The question is whether the change in goodness-of-fit is
unusually large assuming that the true values of the
coefficients are zero.
anova(aov.res.01) # standard anova table:
# anova(lm.res.01) # gives same results
In this case, the \(F\) test is significant \(p<0.05\) and therefore we may reject the null hypothesis that the effects are all zero for the alternative hypothesis that they are not all zero.
The data presented in an ANOVA table can be interpreted as a
comparison of the goodness-of-fit obtained by a full model, which
includes an intercept and effects of condition
, and a
reduced model that includes only an intercept (i.e., the
condition
effects are set to zero). To illustrate this
point, we will analyze theAOVdata
with a reduced linear
model that contains only an intercept (i.e., no effect of
condition
), and compare the goodness-of-fit obtained with
the full and reduced models.
# compare directly to reduced model:
aov.reduced.model <- aov(score~1,theAOVdata) # intercept only
anova(aov.reduced.model,aov.res.01) # compare 2 models (compare SS to ANOVA table)
Now calculate the difference between the sum-of-squared residuals for each model. Can you find this difference in the original ANOVA table?
# calculate sum of squared residuals from each model
SS.resid.reduced <- sum( residuals(aov.reduced.model)^2 )
SS.resid.full <- sum( residuals(aov.res.01)^2 )
SS.resid.reduced - SS.resid.full # difference
## [1] 3226
Answer: The difference between SS residuals is
listed as the SS condition
in the original ANOVA table.
Note that the Sum of Squares (SS) value for condition
in
the full ANOVA table corresponds to the difference between the SS
residuals in the full and reduced models, and therefore is the amount of
variation in our dependent variable that is accounted for by the
condition
. Also note that the degrees of freedom for
condition
corresponds to the number of coefficients for
that factor in the full model, and equals the number of levels in
condition
minus 1. The degrees of freedom for the residuals
in the full model corresponds to the total number of observations in the
data set (80) minus the number of parameters in the full model, which in
this case is four (i.e., the intercept and 3 condition effects).
Association strength and effect size are used to characterize the magnitude of the association between the independent and dependent variables that (in theory) are not dependent on sample size. Association strength refers to the proportion of variance that is accounted for by the grouping variable. One measure, eta-squared (\(\eta^2\)), is simply SS-group divided by SS-total. Another common measure, omega-squared (\(\omega^2\)), is an unbiased estimate of the value of \(\eta^2\) in the population. Omega-squared typically is less than eta-squared, though the difference is small when sample size is large.
# install.packages("effectsize") # use this if you don't have effectsize on your computer
library(effectsize)
eta_squared(aov.res.01) # in effectsize package
omega_squared(aov.res.01) # in effectsize package
Effect size is a standardized measure of differences among means. Cohen’s \(d\), for example, expresses the difference between two group menas relative to the pooled standard deviation. Cohen’s \(f\) is a generalization of Cohen’s \(d\) to cases of 3 or more means.
# install.packages("effectsize") # use this if you don't have effectsize on your computer
# library(effectsize)
cohens_f(aov.res.01)
cohens_f_squared(aov.res.01) # f-squared also is common
One interpretation of Cohen’s \(f\) is that it represents the average standardized deviation of group means from the grand mean:
MS.residuals <- 281 # from ANOVA table
( within.grp.sd <- sqrt(MS.residuals) )
## [1] 16.76
# mean( with(theAOVdata,tapply(score,condition,sd)) ) # same as sqrt(MS.residuals)
( Cohen_d_squared <- ((grpMeans-grandMean)/within.grp.sd)^2 ) # cohen d^2 for each group
## c1 c2 c3 c4
## 0.02493 0.26025 0.26544 0.02335
sqrt( mean( Cohen_d_squared ) ) # approximate Cohen's f
## [1] 0.3788
You can get measures of association strength and effect size directly from the values in an ANOVA table, which is useful when you are trying to estimate these quantities from previously published data.
( eta2 <- 3226/(3226+21328)) # values from anova table
## [1] 0.1314
sqrt(eta2 / (1-eta2)) # estimate of cohen's f
## [1] 0.3889
The analysis of variance assumes that the residuals
are distributed normally and that the variance is equal across groups.
We can evaluate those assumptions with shapiro.test
and
bartlett.test
.
shapiro.test(residuals(aov.res.01)) # H0: residuals are distributed normally
##
## Shapiro-Wilk normality test
##
## data: residuals(aov.res.01)
## W = 0.99, p-value = 0.8
bartlett.test(score~condition,theAOVdata) # H0: variance is equal across groups
##
## Bartlett test of homogeneity of variances
##
## data: score by condition
## Bartlett's K-squared = 1, df = 3, p-value = 0.8
par(cex=1.5,cex.lab=1.25,cex.main=2,mar=c(5,5,3,1)+0.1,mgp=c(3,1,0),mfrow=c(1,2))
qqnorm(residuals(aov.res.01),main=""); # graphical test of normality
qqline(residuals(aov.res.01))
boxplot(score~condition,theAOVdata) # graphical test of variance
par(op)
Use the data in acuityLong.df
(created in Section 2.1)
to answer the following questions. Make sure to use the sum-to-zero
definition of effects (
options(contrasts=c("contr.sum","contr.poly"))
).
age
on acuity
and list the ANOVA table.summary(acuityLong.df) # check variable names
## age acuity
## m1:10 Min. : 7.89
## m2:10 1st Qu.: 9.38
## m3:10 Median : 9.94
## m4:10 Mean : 9.98
## m5:10 3rd Qu.:10.76
## Max. :12.29
acuity.lm0 <- lm(acuity~age,data=acuityLong.df)
anova(acuity.lm0)
# library(knitr)
# kable(anova(acuity.lm0),col.names=c("df","SS","MS","F","p")) # print ANOVA table
The ANOVA table is shown above The null hypothesis is that the
effects of age
are all zero. Equivalently, we can say that
the population means of acuity
are equal across
age
condition
. What do
these values mean?omega_squared(acuity.lm0)
cohens_f(acuity.lm0)
Omega-squared is the proportion of variance in acuity
that is accounted for, or explained, by age
. Cohen’s \(f\) can be thought of as an estimate of the
average, standardized difference between the group means and the grand
mean.
# check normality:
# shapiro.test(acuityLong.df$acuity) # not correct!
shapiro.test(residuals(acuity.lm0)) # correct (test is on residuals not raw scores)
##
## Shapiro-Wilk normality test
##
## data: residuals(acuity.lm0)
## W = 0.97, p-value = 0.3
# qqnorm(residuals(acuity.lm0));qqline(residuals(acuity.lm0))
# check variance
bartlett.test(acuity~age,data=acuityLong.df)
##
## Bartlett test of homogeneity of variances
##
## data: acuity by age
## Bartlett's K-squared = 1.2, df = 4, p-value = 0.9
# boxplot(acuity~age,data=acuityLong.df)
# create full and restricted models
lm.full <- acuity.lm0
lm.restricted <- lm(acuity ~ 1,data=acuityLong.df) # intercept only
sum(residuals(lm.restricted)^2) - sum(residuals(lm.full)^2)
## [1] 6.53
anova(lm.restricted,lm.full)
( error.sd <- sigma(lm.full) )# extract estimated error standard deviation from fitted model
## [1] 1
( error.var <- error.sd^2 )
## [1] 1
( grpVar <- with(acuityLong.df,tapply(acuity,age,var))) # within-group variance
## m1 m2 m3 m4 m5
## 1.2 0.8 1.0 0.7 1.3
mean(grpVar) # average within-group variance
## [1] 1