How to get parameter-specific p-values is one of the most commonly asked questions about multilevel regression. The key issue is that the degrees of freedom are not trivial to compute for multilevel regression. Various detailed discussions can be found on the R-wiki and R-help mailing list post by Doug Bates. I have experimented with three methods that I think are reasonable.
1. Use the normal approximation. Since the t distribution converges to the z distribution as degrees of freedom increase, this is like assuming infinite degrees of freedom. This is unambiguously anti-conservative, but for reasonable sample sizes, it appears not to be very anti-conservative (Barr et al., 2013). That is, if we take the p-value to measure the probability of a false positive, this approximation produces a somewhat (but perhaps not alarmingly) higher false positive rate than the nominal 5% at p = 0.05.
Here is an example using proportions of semantic errors in picture naming by different aphasia subtypes (the data file can be found here):
2. Use the Satterthwaite approximation, which is implemented in the lmerTest package. According to the documentation, this is based on SAS proc mixed theory. The lmerTest package overloads the lmer function, so you can just re-fit the model using exactly the same code, but the summary() will now include approximate degrees of freedom and p-values. This implementation is extremely easy to use, but can be a little maddening if you forget whether your model is a an object of type lmerMod or merModLmerTest.
3. Use the Kenward-Roger approximation to get approximate degrees of freedom and the t-distribution to get p-values, which is implemented in the pbkrtest package.
1. Use the normal approximation. Since the t distribution converges to the z distribution as degrees of freedom increase, this is like assuming infinite degrees of freedom. This is unambiguously anti-conservative, but for reasonable sample sizes, it appears not to be very anti-conservative (Barr et al., 2013). That is, if we take the p-value to measure the probability of a false positive, this approximation produces a somewhat (but perhaps not alarmingly) higher false positive rate than the nominal 5% at p = 0.05.
Here is an example using proportions of semantic errors in picture naming by different aphasia subtypes (the data file can be found here):
load("Examples.RData") require(lme4) # fit the model m.sem <- lmer(Semantic.error ~ TestTime * Diagnosis + (TestTime | SubjectID), data = NamingRecovery, REML = FALSE) # extract coefficients coefs <- data.frame(coef(summary(m.sem))) # use normal distribution to approximate p-value coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value))) coefs
## Estimate Std..Error t.value p.z ## (Intercept) 0.045767 0.007776 5.8855 3.970e-09 ## TestTime -0.008685 0.003524 -2.4644 1.372e-02 ## DiagnosisConduction -0.015149 0.010039 -1.5090 1.313e-01 ## DiagnosisWernicke -0.004899 0.010287 -0.4762 6.339e-01 ## TestTime:DiagnosisConduction 0.007308 0.004550 1.6063 1.082e-01 ## TestTime:DiagnosisWernicke 0.012854 0.004662 2.7571 5.832e-03
2. Use the Satterthwaite approximation, which is implemented in the lmerTest package. According to the documentation, this is based on SAS proc mixed theory. The lmerTest package overloads the lmer function, so you can just re-fit the model using exactly the same code, but the summary() will now include approximate degrees of freedom and p-values. This implementation is extremely easy to use, but can be a little maddening if you forget whether your model is a an object of type lmerMod or merModLmerTest.
require(lmerTest)
# re-fit model m.semTest <- lmer(Semantic.error ~ TestTime * Diagnosis + (TestTime | SubjectID), data = NamingRecovery, REML = FALSE) # get Satterthwaite-approximated degrees of freedom coefs$df.Satt <- coef(summary(m.semTest))[, 3] # get approximate p-values coefs$p.Satt <- coef(summary(m.semTest))[, 5] coefs
## Estimate Std..Error t.value p.z ## (Intercept) 0.045767 0.007776 5.8855 3.970e-09 ## TestTime -0.008685 0.003524 -2.4644 1.372e-02 ## DiagnosisConduction -0.015149 0.010039 -1.5090 1.313e-01 ## DiagnosisWernicke -0.004899 0.010287 -0.4762 6.339e-01 ## TestTime:DiagnosisConduction 0.007308 0.004550 1.6063 1.082e-01 ## TestTime:DiagnosisWernicke 0.012854 0.004662 2.7571 5.832e-03 ## df.Satt p.Satt ## (Intercept) 23.00 5.344e-06 ## TestTime 22.99 2.163e-02 ## DiagnosisConduction 23.00 1.449e-01 ## DiagnosisWernicke 23.00 6.384e-01 ## TestTime:DiagnosisConduction 22.99 1.219e-01 ## TestTime:DiagnosisWernicke 22.99 1.122e-02
3. Use the Kenward-Roger approximation to get approximate degrees of freedom and the t-distribution to get p-values, which is implemented in the pbkrtest package.
require(pbkrtest)
# get the KR-approximated degrees of freedom df.KR <- get_ddf_Lb(m.sem, fixef(m.sem))
# get p-values from the t-distribution using the t-values and approximated # degrees of freedom coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR)) coefs
## Estimate Std..Error t.value p.z ## (Intercept) 0.045767 0.007776 5.8855 3.970e-09 ## TestTime -0.008685 0.003524 -2.4644 1.372e-02 ## DiagnosisConduction -0.015149 0.010039 -1.5090 1.313e-01 ## DiagnosisWernicke -0.004899 0.010287 -0.4762 6.339e-01 ## TestTime:DiagnosisConduction 0.007308 0.004550 1.6063 1.082e-01 ## TestTime:DiagnosisWernicke 0.012854 0.004662 2.7571 5.832e-03 ## df.Satt p.Satt p.KR ## (Intercept) 23.00 5.344e-06 9.049e-06 ## TestTime 22.99 2.163e-02 2.283e-02 ## DiagnosisConduction 23.00 1.449e-01 1.468e-01 ## DiagnosisWernicke 23.00 6.384e-01 6.390e-01 ## TestTime:DiagnosisConduction 22.99 1.219e-01 1.238e-01 ## TestTime:DiagnosisWernicke 22.99 1.122e-02 1.210e-02
Not surprisingly, the Satterthwaite and Kenward-Roger approximations produce slightly more conservative p-values than the normal approximation does. However, even in this not-very-large data set (24 participants, 6-9 in each of 3 groups, 5 observations for each participant) the normal approximation is only slightly less conservative than these two options. The approximate degrees of freedom are slightly higher when using the Satterthwaite approximation (23) than the Kenward-Roger approximation (20.1), which makes the latter a bit more conservative. Of course, these are not systematic tests, but it is fairly representative of a few models that I've compared.
Thank you very much, Dan! This was incredibly useful.
ReplyDeleteYes, thanks!
ReplyDeleteYou are a god among men, thank you so much!
ReplyDeleteComes very handy ! thanks
ReplyDeleteThanks, it is great!
ReplyDeleteThank you. I need it very much.
ReplyDeleteHey, thank you for these! I also have used mixed from the afex package, and it sometimes gives rather different results. Do you have any explanation why that may be or which to trust?
ReplyDeleteThanks!
I've been meaning to try afex but haven't gotten around to it. Are you getting different parameter estimates or just different p-values?
Deleteomg thank you so much for posting this...
ReplyDeleteThank you! I tried the first one and I worked perfectly. I double checked it with a different model where I knew the p-values and the out was identical!
ReplyDeleteIf you want to know why the author didnt include the p-values, here: https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html
Hi, Thanks for your answer. Can you please tell me, how can I get p-values for random effects in lmer function? I have my model like this:
ReplyDeletemod1 = lmer(dep ~ ind1 + (1 + ind2 | ind3), data = dataset)
The variances and covariances of those random effects should be in the output of summary(mod1)
DeleteI'm not aware of a generally accepted method for testing whether a random effect is significant. You can fit a model without that random effect, then do a model comparison to evaluate whether the model with the random effect is significantly better than the one without it. However, the likelihood ratio test that you'll get from that model comparison is not really designed for evaluating random effects, so there is disagreement about whether that would be a valid test or not.
Thanks a lot! What is the full name of the Barr et al paper?
DeleteBarr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255-278.
DeleteThanks heaps Dan :)
ReplyDeleteYour service is appreciated.