Earlier this month I taught a two-day workshop on growth curve analysis at Georg-Elias-Müller Institute for Psychology in Göttingen, Germany. The purpose of the workshop was to provide a hands-on introduction to using GCA to analyze longitudinal or time course data, with a particular focus on eye-tracking data. All of the materials for the workshop are now available online (http://dmirman.github.io/GCA2018.html), including slides, examples, exercises, and exercise solutions. In addition to standard packages (ggplot2, lme4, etc.), we used my psy811 package for example data sets and helper functions.
At the interface of psychology, neuroscience, and neuropsychology with a focus on computational and statistical modeling.
Showing posts with label growth curve analysis. Show all posts
Showing posts with label growth curve analysis. Show all posts
Friday, March 23, 2018
Monday, June 8, 2015
A little growth curve analysis Q&A
I had an email exchange with Jeff Malins, who asked several questions about growth curve analysis. I often get questions of this sort and Jeff agreed to let me post excerpts from our (email) conversation. The following has been lightly edited for clarity and to be more concise.
Tuesday, August 5, 2014
Visualizing Components of Growth Curve Analysis
This is a guest post by Matthew Winn:
One of the more useful skills I’ve learned in the past couple years is growth curve analysis (GCA), which helps me analyze eye-tracking data and other kinds of data that take a functional form. Like some other advanced statistical techniques, it is a procedure that can be done without complete understanding, and is likely to demand more than one explanation before you really “get it”. In this post, I will illustrate the way that I think about it, in hopes that it can “click” for some more people. The objective is to break down a complex curve into individual components.
One of the more useful skills I’ve learned in the past couple years is growth curve analysis (GCA), which helps me analyze eye-tracking data and other kinds of data that take a functional form. Like some other advanced statistical techniques, it is a procedure that can be done without complete understanding, and is likely to demand more than one explanation before you really “get it”. In this post, I will illustrate the way that I think about it, in hopes that it can “click” for some more people. The objective is to break down a complex curve into individual components.
Monday, March 3, 2014
Guidebook for growth curve analysis
I don't usually like to use complex statistical methods, but every once in a while I encounter a method that is so useful that I can't avoid using it. Around the time I started doing eye-tracking research (as a post-doc with Jim Magnuson), people were starting recognize the value of using longitudinal data analysis techniques to analyze fixation time course data. Jim was ahead of most in this regard (Magnuson et al., 2007) and a special issue of the Journal of Memory and Language on data analysis methods gave as a great opportunity to describe how to apply "Growth Curve Analysis" (GCA) - a type of multilevel regression - to fixation time course data (Mirman, Dixon, & Magnuson, 2008). Unbeknownst to us, Dale Barr was working on very similar methods, though for somewhat different reasons, and our articles ended up neighbors in the special issue (Barr, 2008).
In the several years since those papers came out, it has become clear to me that other researchers would like to use GCA, but reading our paper and downloading our code examples was often not enough for them to be able to apply GCA to their own data. There are excellent multilevel regression textbooks out there, but I think it is safe to say that it's a rare cognitive or behavioral scientist who has the time and inclination to work through a 600-page advanced regression textbook. It seemed like a more practical guidebook to implementing GCA was needed, so I wrote one and it has just been published by Chapman & Hall / CRC Press as part of their R Series.
My idea was to write a relatively easy-to-understand book that dealt with the practical issues of implementing GCA using R. I assumed basic knowledge of behavioral statistics (standard coursework in graduate behavioral science programs) and minimal familiarity with R, but no expertise in computer programming or the specific R packages required for implementation (primarily lme4 and ggplot2). In addition to the core issues of fitting growth curve models and interpreting the results, the book covers plotting time course data and model fits and analyzing individual differences. Example data sets and solutions to the exercises in the book are available on my GCA website.
Obviously, the main point of this book is to help other cognitive and behavioral scientists to use GCA, but I hope it will also encourage them to make better graphs and to analyze individual differences. I think individual differences are very important to cognitive science, but most statistical methods treat them as just noise, so maybe having better methods will lead to better science, though this might be a subject for a different post. Comments and feedback about the book are, of course, most welcome.
In the several years since those papers came out, it has become clear to me that other researchers would like to use GCA, but reading our paper and downloading our code examples was often not enough for them to be able to apply GCA to their own data. There are excellent multilevel regression textbooks out there, but I think it is safe to say that it's a rare cognitive or behavioral scientist who has the time and inclination to work through a 600-page advanced regression textbook. It seemed like a more practical guidebook to implementing GCA was needed, so I wrote one and it has just been published by Chapman & Hall / CRC Press as part of their R Series.
My idea was to write a relatively easy-to-understand book that dealt with the practical issues of implementing GCA using R. I assumed basic knowledge of behavioral statistics (standard coursework in graduate behavioral science programs) and minimal familiarity with R, but no expertise in computer programming or the specific R packages required for implementation (primarily lme4 and ggplot2). In addition to the core issues of fitting growth curve models and interpreting the results, the book covers plotting time course data and model fits and analyzing individual differences. Example data sets and solutions to the exercises in the book are available on my GCA website.
Obviously, the main point of this book is to help other cognitive and behavioral scientists to use GCA, but I hope it will also encourage them to make better graphs and to analyze individual differences. I think individual differences are very important to cognitive science, but most statistical methods treat them as just noise, so maybe having better methods will lead to better science, though this might be a subject for a different post. Comments and feedback about the book are, of course, most welcome.
Tuesday, February 11, 2014
Three ways to get parameter-specific p-values from lmer
How to get parameter-specific p-values is one of the most commonly asked questions about multilevel regression. The key issue is that the degrees of freedom are not trivial to compute for multilevel regression. Various detailed discussions can be found on the R-wiki and R-help mailing list post by Doug Bates. I have experimented with three methods that I think are reasonable.
Monday, September 30, 2013
New version of lme4
If you haven't realized it yet, a new version of lme4 (version 1.0-4) was released recently (Sept. 21). For an end-user like me, there were not many changes, but there were a few:
- No more using the @ operator. After a very helpful email exchange with Ben Bolker, I came to realize that I shouldn't have been using it in the first place, but I hadn't figured out all the "accessor" methods that are available (you can get a list using methods(class = "merMod")). I had been using it in two main contexts:
- To get the fixed effect coefficients with their standard errors, etc. from the summary. A better way to do that is to use coef(summary(m))
- To get model-predicted values. A better way to do that is to use fitted(m), with the added convenience that this returns proportions for logistic models, making the model fits easier (and, I think, more intuitive) to visualize. By the way, a predict() method has now been implemented, which provides an easy way to get model predictions for new data.
- There have been some changes to the optimization algorithms and some of my models that used to run fine are now giving me some convergence warnings. This seems to happen particularly for linear models with within-subject manipulations. Using the bobyqa optimizer instead of the default Nelder-Mead optimizer seems to fix the problem. This can be done by adding control=lmerControl(optimizer = "bobyqa") to the call to lmer. A minor related point: the release notes (https://github.com/lme4/lme4/blob/master/misc/notes/release_notes.md) state that the internal computational machinery has changed, so the results will not be numerically identical, though they should be very close for reasonably well-defined fits. I have found this to be true for a reasonably large set of models that I've re-run.
- When fitting logistic models, if you use lmer(..., family="binomial"), it will call glmer() as before, but now also warns you that you should probably be using glmer() directly.
Friday, April 5, 2013
Multiple pairwise comparisons for categorical predictors
Dale Barr (@datacmdr) recently had a nice blog post about coding categorical predictors, which reminded me to share my thoughts about multiple pairwise comparisons for categorical predictors in growth curve analysis. As Dale pointed out in his post, the R default is to treat the reference level of a factor as a baseline and to estimate parameters for each of the remaining levels. This will give pairwise comparisons for each of the other levels with the baseline, but not among those other levels. Here's a simple example using the
ChickWeight
data set (part of the datasets
package). As a reminder, this data set is from an experiment on the effect of diet on early growth of chicks. There were 50 chicks, each fed one of 4 diets, and their weights were measured up to 12 times over the first 3 weeks after they were born.Friday, September 7, 2012
More on fixed and random effects: Plotting and interpreting
In a recent post I showed how plotting model fits can help to interpret higher-order polynomial terms. The key comparison there was between a model that did and did not have the higher order fixed effect terms. If you're going to use this strategy, you need to remember that fixed and random effects capture some of the same variance, so if you're going to remove some fixed effects to visualize their effect, you also need to remove the corresponding random effects. In that previous post, those higher-order random effects were not included in the model (more on this in a minute), so I could just talk about the fixed effects. Here's how it would look if I started with a full model that included all fixed and random effects and compared it to just removing the higher-order fixed effects…
Here are the models – they're the same as in the previous post, except that they include the higher-order random effects:
And here are the model fits (thinner lines represent the model without the higher-order fixed effects):m.full <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 + ot3 + ot4 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
m.exSub <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 + ot3 + ot4 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
Not much of a difference, is there? If we also drop the higher-order random effects (thicker dashed lines in the graph below), then we can again see that the higher-order terms were capturing the Early-vs-Late difference:
m.exSub2 <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
This graphical example shows that random and fixed effects capture some of the same variance and this point is also important for deciding which random effects to include in the model in the first place. This decision is somewhat complicated and I don't think the field has completely settled on an answer (I'll be revisiting the issue in future posts). I generally try to include as many time terms in the random effects as possible before running into convergence errors. An important rule-of-thumb is to always include random effects that correspond to the fixed effects that you plan to interpret. Since random and fixed effects capture some of the same variance, you can get spuriously significant fixed effects if you omit the corresponding random effects. It won't affect the actual fixed effect parameter estimates (since the random effects are constrained to have a mean of 0), but omitting random effects tends to reduce the standard error of the corresponding fixed effect parameter estimates, which makes them look more statistically significant than they should be.
Here's how that looks in the context of the Early-vs-Late example:
coefs.full <- as.data.frame(summary(m.full)@coefs)
coefs.full$p <- format.pval(2 * (1 - pnorm(abs(coefs.full[, "t value"]))))
coefs.full[grep("*objU:cond*", rownames(coefs.full), value = T), ]
## Estimate Std. Error t value p
## objU:condEarly -0.004164 0.01709 -0.2436 0.80751
## ot1:objU:condEarly 0.065878 0.07745 0.8506 0.39497
## ot2:objU:condEarly -0.047568 0.04362 -1.0906 0.27545
## ot3:objU:condEarly -0.156184 0.05181 -3.0145 0.00257
## ot4:objU:condEarly 0.075709 0.03308 2.2888 0.02209
m.ex <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
coefs.ex <- as.data.frame(summary(m.ex)@coefs)
coefs.ex$p <- format.pval(2 * (1 - pnorm(abs(coefs.ex[, "t value"]))))
coefs.ex[grep("*objU:cond*", rownames(coefs.ex), value = T), ]
## Estimate Std. Error t value p
## objU:condEarly -0.004164 0.01701 -0.2448 0.80664
## ot1:objU:condEarly 0.065878 0.07586 0.8685 0.38514
## ot2:objU:condEarly -0.047568 0.04184 -1.1370 0.25554
## ot3:objU:condEarly -0.156184 0.02327 -6.7119 1.9e-11
## ot4:objU:condEarly 0.075709 0.02327 3.2535 0.00114
As I said, the presence (m.full) vs. absence (m.ex) of the higher-order random effects does not affect the fixed effect parameter estimates, but it does affect their standard errors. Without the cubic and quartic random effects, those fixed effect standard errors are much smaller, which increases their apparent statistical significance. In this example, the cubic and quartic terms' p-values are < 0.05 either way, but you can see that the differences in the t-value were quite large, so it's not hard to imagine an effect that would look significant only without the corresponding random effect.
Wednesday, August 29, 2012
Plotting model fits
Second, and this is particularly important when using orthogonal polynomials, you want to make sure that the statistically significant effects in the model truly correspond to the “interesting” (i.e., meaningful) effects in your data. For example, if your model had a significant effects on higher-order terms like the cubic and quartic, you might want to conclude that this corresponds to a difference between early and late competition. Plotting the model fits with and without that term can help confirm that interpretation.
The first step to plotting model fits is getting those model-predicted values. If you use
lmer
, these values are stored in the eta
slot of the model object. It can be extracted using m@eta
, where m
is the model object. Let's look at an example based on eye-tracking data from Kalenine, Mirman, Middleton, & Buxbaum (2012).summary(data.ex)
## Time fixS cond obj subj
## Min. : 500 Min. :0.0000 Late :765 T:510 21 :102
## 1st Qu.: 700 1st Qu.:0.0625 Early:765 C:510 24 :102
## Median : 900 Median :0.1333 U:510 25 :102
## Mean : 900 Mean :0.2278 27 :102
## 3rd Qu.:1100 3rd Qu.:0.3113 28 :102
## Max. :1300 Max. :1.0000 40 :102
## (Other):918
## timeBin ot1 ot2 ot3
## Min. : 1 Min. :-0.396 Min. :-0.2726 Min. :-0.450
## 1st Qu.: 5 1st Qu.:-0.198 1st Qu.:-0.2272 1st Qu.:-0.209
## Median : 9 Median : 0.000 Median :-0.0909 Median : 0.000
## Mean : 9 Mean : 0.000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.:13 3rd Qu.: 0.198 3rd Qu.: 0.1363 3rd Qu.: 0.209
## Max. :17 Max. : 0.396 Max. : 0.4543 Max. : 0.450
##
## ot4
## Min. :-0.3009
## 1st Qu.:-0.1852
## Median :-0.0231
## Mean : 0.0000
## 3rd Qu.: 0.2392
## Max. : 0.4012
##
ggplot(data.ex, aes(Time, fixS, color = obj)) + facet_wrap(~cond) +
stat_summary(fun.y = mean, geom = "line", size = 2)
I've renamed the conditions "Late" and "Early" based on the timing of their competition effect: looking at fixation proportions for the related Competitor (green lines) relative to the Unrelated distractor, it looks like the “Late” condition had a later competition effect than the “Early” condition. We start by fitting the full model and plotting the model fit. For convenience, we'll make a new data frame that has the modeled observed data and the model fit:
library(lme4)
m.ex <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
data.ex.fits <- data.frame(subset(data.ex, obj != "T"), GCA_Full = m.ex@eta)
ggplot(data.ex.fits, aes(Time, fixS, color = obj)) + facet_wrap(~cond) + stat_summary(fun.data = mean_se, geom = "pointrange", size = 1) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + labs(x = "Time Since Word Onset (ms)", y = "Fixation Proportion")
The fit looks pretty good and the model seems to capture the early-vs.-late competition difference, so now we can use the normal approximation to get p-values for the object-by-condition interaction:
coefs.ex <- as.data.frame(summary(m.ex)@coefs)
coefs.ex$p <- format.pval(2 * (1 - pnorm(abs(coefs.ex[, "t value"]))))
coefs.ex[grep("*objU:cond*", rownames(coefs.ex), value = T), ]
## Estimate Std. Error t value p
## objU:condEarly -0.004164 0.01701 -0.2448 0.80664
## ot1:objU:condEarly 0.065878 0.07586 0.8685 0.38514
## ot2:objU:condEarly -0.047568 0.04184 -1.1370 0.25554
## ot3:objU:condEarly -0.156184 0.02327 -6.7119 1.9e-11
## ot4:objU:condEarly 0.075709 0.02327 3.2535 0.00114
There are significant object-by-condition interaction effects on the cubic and quartic terms, so that's where competition in the two conditions differed, but does that correspond to the early-vs.-late difference? To answer this question we can fit a model that does not have those cubic and quartic terms and visually compare it to the full model. We'll plot the data with pointrange
, the full model with thick lines, and the smaller model with thinner lines.m.exSub <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
data.ex.fits$GCA_Sub <- m.exSub@eta
ggplot(data.ex.fits, aes(Time, fixS, color = obj)) + facet_wrap(~cond) + stat_summary(fun.data = mean_se, geom = "pointrange", size = 1) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + stat_summary(aes(y = GCA_Sub), fun.y = mean, geom = "line", size = 1) + labs(x = "Time Since Word Onset (ms)", y = "Fixation Proportion")
Well, it sort of looks like the thinner lines have less early-late difference, but it is hard to see. It will be easier if we look directly at the competition effect size (that is, the difference between the competitor and unrelated fixation curves):
ES <- ddply(data.ex.fits, .(subj, Time, cond), summarize, Competition = fixS[obj == "C"] - fixS[obj == "U"], GCA_Full = GCA_Full[obj == "C"] - GCA_Full[obj == "U"], GCA_Sub = GCA_Sub[obj == "C"] - GCA_Sub[obj == "U"])
ES <- rename(ES, c(cond = "Condition"))
ggplot(ES, aes(Time, Competition, color = Condition)) + stat_summary(fun.y = mean, geom = "point", size = 4) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + labs(x = "Time Since Word Onset (ms)", y = "Competition") + stat_summary(aes(y = GCA_Sub), fun.y = mean, geom = "line", size = 1)
Now we can clearly see that the full model (thick lines) captures the early-vs.-late difference, but when we remove the cubic and quartic terms (thinner lines), that difference almost completely disappears. So that shows that those higher-order terms really were capturing the timing of the competition effect.
P.S.: For those that care about behind-the-scenes/under-the-hood things, this post was created (mostly) using
knitr
in RStudio.
Subscribe to:
Posts (Atom)