Saturday, September 22, 2012

The power to see the future is exciting and terrifying

In a recent comment in Nature, Daniel Acuna, Stefano Allesina, and Konrad Kording describe a statistical model for predicting h-index. In case you are not familiar with it, h-index is a citation-based measure of scientific impact. An h-index of n means that you have n publications with at least n citations. I only learned about h-index relatively recently and I think it is a quite elegant measure -- simple to compute, not too biased by a single highly-cited paper or by many low-impact (uncited) papers. Acuna, Allesina, and Kording took publicly available data and developed a model for predicting future h-index based on number of articles, current h-index, years since first publication, number of distinct journals published in, and number of articles in the very top journals in the field (Nature, Science, PNAS, and Neuron). Their model accounted for about 66% of the variance in future h-index among neuroscientists, which I think is pretty impressive. Perhaps the coolest thing about this project is the accompanying website that allows users to predict their own h-index.

Since hiring and tenure decisions are intended to reflect both past accomplishments and expectations of future success, this prediction model is potentially quite useful. Acuna et al. are appropriately circumspect about relying on a single measure for making such important decisions and they are aware that over-reliance on a single metric to produce "gaming" behavior. So the following is not meant as a criticism of their work, but two examples jumped to my mind: (1) Because number of distinct journals is positively associated with future h-index (presumably it is an indicator of breadth of impact), researchers may choose to send their manuscripts to less appropriate journals in order to increase the number of journals in which their work has appeared. Those journals, in turn, would be less able to provide appropriate peer review and the articles would be less visible to the relevant audience, so their impact would actually be lower. (2) The prestige of those top journals already leads them to be targets for falsified data -- Nature, Science, and PNAS are among the leading publishers of retractions (e.g., Liu, 2006). Formalizing and quantifying that prestige factor can only serve to increase the motivation for unethical scientific behavior.

That said, I enjoyed playing around with the simple prediction calculator on their website. I'd be wary if my employer wanted to use this model to evaluate me, but I think it's kind of a fun way to set goals for myself: the website gave me a statistical prediction for how my h-index will increase over the next 10 years, now I'm going to try to beat that prediction. Since h-index is (I think) relatively hard to "game", this seems like a reasonably challenging goal. Acuna, D. E., Allesina, S., & Kording, K. P. (2012). Predicting scientific success. Nature, 489 (7415), 201-202. DOI: 10.1038/489201a
Liu, S. V. (2006). Top Journals’ Top Retraction Rates. Scientific Ethics, 1 (2), 91-93.

Tuesday, September 18, 2012

Aggregating data across trials of different durations

Note: This post is a summary of a more detailed technical report.

In a typical “visual world paradigm” (VWP) eye tracking study a trial ends when the participant responds, which naturally leads to some trials that are shorter than others. So we need to decide when computing fixation proportions at later time points, should terminated trials be included or not? Based on informal discussions with other VWP researchers, I think three approaches are currently in use: (1) for each time bin, include all trials and count post-response frames as non-object fixations (i.e., the participant is done fixating all objects from this trial), (2) include all trials and count post-response frames as target fixations (i.e., if the participant selected the correct object, then consider all subsequent fixations to be on that object; note that, typically, any trials on which the participant made an incorrect response are excluded from analysis), (3) include only trials that are currently on-going and ignore any terminated trials since there is no data for those trials.

The problem with the third approach is that it is a form of selection bias because trials do not terminate at random, so as the time series progresses through the time window, the data move further and further from the complete, unbiased set of trials to a biased subset of only trials that required additional processing time. This bias will operate both between conditions (i.e., more trials from a condition with difficult stimuli than from a condition with easy stimuli) and within conditions (i.e., more of the trials that were difficult than that were easy within a condition). 

Here's an analogy to clarify this selection bias: imagine that we want to evaluate the response rate over time to a drug for a deadly disease. We enroll 100 participants in the trial and administer the drug. At first, only 50% of the participants respond to the drug. As the trial progresses, the non-responders begin to, unfortunately, die. After 6 months, only 75 participants are alive and participating in the trial and the same 50 are responding to the treatment. At this point, is the response rate the same 50% or has it risen to 67%? Would it be accurate to conclude that responsiveness to the treatment increases after 6 months?

Returning to eye-tracking data, the effect of this selection bias is to make differences appear more static. So, for target fixation data, you get the pattern below: considering only on-going makes it look like there is an asymptote difference between conditions, but "padding" the post-response frames with Target fixations correctly captures the processing speed difference. (These data are from a Monte Carlo simulation, so we know that the Target method is correct). 

For competitor fixations, ignoring terminated trials makes the competition effects look longer-lasting, as in the figure on the left. These data come from our recent study of taxonomic and thematic semantic competition, so you can see the selection bias play out in real VWP data. We also randomly dropped 10% and 20% of the data points to show that the effect of ignoring terminated trials is not just a matter of having fewer data points.

Whether post-response data are considered "Target" or "Non-object" fixations does not seem to have biasing effects, though it does affect how the data do look in the same way that probability distribution curves and cumulative distribution curves show the same underlying data but in different ways. More details on all of this are available in our technical report.

Friday, September 7, 2012

More on fixed and random effects: Plotting and interpreting

In a recent post I showed how plotting model fits can help to interpret higher-order polynomial terms. The key comparison there was between a model that did and did not have the higher order fixed effect terms. If you're going to use this strategy, you need to remember that fixed and random effects capture some of the same variance, so if you're going to remove some fixed effects to visualize their effect, you also need to remove the corresponding random effects. In that previous post, those higher-order random effects were not included in the model (more on this in a minute), so I could just talk about the fixed effects. Here's how it would look if I started with a full model that included all fixed and random effects and compared it to just removing the higher-order fixed effects…

Here are the models – they're the same as in the previous post, except that they include the higher-order random effects:
m.full <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 + ot3 + ot4 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
m.exSub <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 + ot3 + ot4 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
And here are the model fits (thinner lines represent the model without the higher-order fixed effects):
plot of chunk effect-sizes

Not much of a difference, is there? If we also drop the higher-order random effects (thicker dashed lines in the graph below), then we can again see that the higher-order terms were capturing the Early-vs-Late difference:
m.exSub2 <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
plot of chunk unnamed-chunk-1

This graphical example shows that random and fixed effects capture some of the same variance and this point is also important for deciding which random effects to include in the model in the first place. This decision is somewhat complicated and I don't think the field has completely settled on an answer (I'll be revisiting the issue in future posts). I generally try to include as many time terms in the random effects as possible before running into convergence errors. An important rule-of-thumb is to always include random effects that correspond to the fixed effects that you plan to interpret. Since random and fixed effects capture some of the same variance, you can get spuriously significant fixed effects if you omit the corresponding random effects. It won't affect the actual fixed effect parameter estimates (since the random effects are constrained to have a mean of 0), but omitting random effects tends to reduce the standard error of the corresponding fixed effect parameter estimates, which makes them look more statistically significant than they should be.

Here's how that looks in the context of the Early-vs-Late example:
coefs.full <-
coefs.full$p <- format.pval(2 * (1 - pnorm(abs(coefs.full[, "t value"]))))
coefs.full[grep("*objU:cond*", rownames(coefs.full), value = T), ]
##                     Estimate Std. Error t value       p
## objU:condEarly     -0.004164    0.01709 -0.2436 0.80751
## ot1:objU:condEarly  0.065878    0.07745  0.8506 0.39497
## ot2:objU:condEarly -0.047568    0.04362 -1.0906 0.27545
## ot3:objU:condEarly -0.156184    0.05181 -3.0145 0.00257
## ot4:objU:condEarly  0.075709    0.03308  2.2888 0.02209
m.ex <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
coefs.ex <-
coefs.ex$p <- format.pval(2 * (1 - pnorm(abs(coefs.ex[, "t value"]))))
coefs.ex[grep("*objU:cond*", rownames(coefs.ex), value = T), ]
##                     Estimate Std. Error t value       p
## objU:condEarly     -0.004164    0.01701 -0.2448 0.80664
## ot1:objU:condEarly  0.065878    0.07586  0.8685 0.38514
## ot2:objU:condEarly -0.047568    0.04184 -1.1370 0.25554
## ot3:objU:condEarly -0.156184    0.02327 -6.7119 1.9e-11
## ot4:objU:condEarly  0.075709    0.02327  3.2535 0.00114

As I said, the presence (m.full) vs. absence (m.ex) of the higher-order random effects does not affect the fixed effect parameter estimates, but it does affect their standard errors. Without the cubic and quartic random effects, those fixed effect standard errors are much smaller, which increases their apparent statistical significance. In this example, the cubic and quartic terms' p-values are < 0.05 either way, but you can see that the differences in the t-value were quite large, so it's not hard to imagine an effect that would look significant only without the corresponding random effect.