The following code compares trend analysis and regression. The data
frame trendsData.rda
contains data from a hypothetical
experiment that measured performance (score
) in infants
ranging from 1 to 5 months of age. Age is contained in the numeric
variable age
, and it is rounded to the nearest the nearest
month in age.months
. Age also is represented by the
ordered-factor age.group
and the un-ordered factor
group
.
load(file=url('http://pnb.mcmaster.ca/bennett/psy710/trends/trendsData.rda'))
summary(data.df)
## score age age.group age.months group
## Min. : 69.31 Min. :0.6885 a1:10 Min. :1 g1:10
## 1st Qu.: 94.83 1st Qu.:1.9441 a2:10 1st Qu.:2 g2:10
## Median : 99.65 Median :2.9784 a3:10 Median :3 g3:10
## Mean :100.81 Mean :3.0023 a4:10 Mean :3 g4:10
## 3rd Qu.:107.20 3rd Qu.:4.1805 a5:10 3rd Qu.:4 g5:10
## Max. :132.95 Max. :5.1608 Max. :5
The following code was used to create Figure 1. The filled and open symbols represent group means and individual scores, respectively. The solid line is the linear regression line, which indicates that there is a slight upward linear trend of score across age. But the group means also exhibit a clear curvilinear trend around the regression line.
gmeans <- with(data.df,tapply(score,age.group,mean))
age.levels <- sort(unique(data.df$age.group))
plot(score~age,data=data.df,type="p",xlab="age (months)",ylab="scores")
points(x=age.levels,y=gmeans,"p",pch=19,cex=2)
#lines(x=age.levels,y=gmeans,"l")
abline(lm(score~age,data=data.df))
These observations can be evaluated quantitatively with a trend analysis, which uses the ordered factor age.group as the grouping variable.
options(contrasts=c("contr.sum","contr.poly")) # set up anova contrasts
options(digits=3) # set up anova contrasts
class(data.df$age.group)
## [1] "ordered" "factor"
contrasts(data.df$age.group) # age.group is coded with polynomial trends
## .L .Q .C ^4
## [1,] -6.32e-01 0.535 -3.16e-01 0.120
## [2,] -3.16e-01 -0.267 6.32e-01 -0.478
## [3,] -3.29e-17 -0.535 9.64e-17 0.717
## [4,] 3.16e-01 -0.267 -6.32e-01 -0.478
## [5,] 6.32e-01 0.535 3.16e-01 0.120
aov01 <- aov(score~age.group,data=data.df)
summary(aov01,split=list(age.group=list(lin=1,quad=2,cubic=3, quart=4)))
## Df Sum Sq Mean Sq F value Pr(>F)
## age.group 4 1145 286 3.62 0.0122 *
## age.group: lin 1 348 348 4.41 0.0414 *
## age.group: quad 1 630 630 7.97 0.0071 **
## age.group: cubic 1 30 30 0.38 0.5420
## age.group: quart 1 136 136 1.72 0.1960
## Residuals 45 3558 79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The trends also could be analyzed with emmeans:
require(emmeans)
lin <- contr.poly(n=5,scores=sort(unique(data.df$age.months)))[,1]
quad <- contr.poly(n=5,scores=sort(unique(data.df$age.months)))[,2]
cubic <- contr.poly(n=5,scores=sort(unique(data.df$age.months)))[,3]
quart <- contr.poly(n=5,scores=sort(unique(data.df$age.months)))[,4]
contrast.list <- list(L=lin,Q=quad,Cube=cubic,Quart=quart) # load w into list
groupMeans <- lsmeans(aov01,"age.group") # get & store group means
contrast(groupMeans,contrast.list,adjust="none") # do comparison
## contrast estimate SE df t.ratio p.value
## L 5.90 2.81 45 2.099 0.0410
## Q -7.94 2.81 45 -2.823 0.0070
## Cube -1.73 2.81 45 -0.615 0.5420
## Quart 3.69 2.81 45 1.312 0.1960
Next, we compare the results to a linear regression model. The value for the sum-of-squares associated with the linear trend is the same as the values obtained with aov and lsmeans. However, the F and p values differ from those obtained previously.
aovLinear <- aov(score~age.months,data=data.df)
summary(aovLinear)
## Df Sum Sq Mean Sq F value Pr(>F)
## age.months 1 348 348 3.84 0.056 .
## Residuals 48 4354 91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reason for this difference is that the sum-of-squares and degrees-of-freedom for the residuals differ from the values obtained previously. Why? Because the previous models contained non-linear trends that captured some of the variation in the dependent variable, and therefore resulted in a smaller SS residuals. If these non-linear terms are incorporated into a fourth-order polynomial model, the regression analysis yields the same results as before.
aovPoly4 <- aov(score~age.months+I(age.months^2)+I(age.months^3)+I(age.months^4),data=data.df)
summary(aovPoly4)
## Df Sum Sq Mean Sq F value Pr(>F)
## age.months 1 348 348 4.41 0.0414 *
## I(age.months^2) 1 630 630 7.97 0.0071 **
## I(age.months^3) 1 30 30 0.38 0.5420
## I(age.months^4) 1 136 136 1.72 0.1960
## Residuals 45 3558 79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It turns out that the cubic and quartic terms capture almost no variance, so we can delete these components of the model and get nearly the same results.
aovPoly2 <- lm(score~age.months+I(age.months^2),data=data.df)
summary(aovPoly2)
##
## Call:
## lm(formula = score ~ age.months + I(age.months^2), data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.524 -6.295 0.846 5.394 27.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.359 6.037 13.31 <2e-16 ***
## age.months 14.595 4.601 3.17 0.0027 **
## I(age.months^2) -2.121 0.752 -2.82 0.0070 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.9 on 47 degrees of freedom
## Multiple R-squared: 0.208, Adjusted R-squared: 0.174
## F-statistic: 6.17 on 2 and 47 DF, p-value: 0.00416
anova(aovPoly2,aovPoly4) # compare 2 regression models