Thursday, May 16, 2013

A function for comparing groups on a set of variables

I'm often in the position of needing to compare groups of either items or participants on some set of variables. For example, I might want to compare recognition of words that differ on some measure of lexical neighborhood density but are matched on word length, frequency, etc. Similarly, I might want to compare individuals with aphasia that have anterior vs. posterior lesions but are matched on lesion size, aphasia severity, age, etc. I'll also need to report these comparisons in a neat table if/when I write up the results of the study.  This means computing and collating a bunch of means, standard deviations, and t-tests. This is not particularly difficult, but it is somewhat laborious (and boring), so I decided to write a function that would do it for me. Details after the jump.

The function (compareGroups) takes a data frame and the name of the grouping variable and returns a data frame with rows corresponding to each of the numeric variables in the original data frame and columns corresponding to the means, standard deviations, and t- and p-values for the t-test comparing the groups. There is also a row for the number of observations in each group. It should be easy to tweak the function to handle more than 2 groups, but then it would need a different statistical test and the 2-group case is the most common one for me.

Here's an example of the function in action, generating the results for Table 1 from our recent paper investigating the neural basis of semantic and phonological neighborhood effects in picture naming (Mirman & Graziano, in press):

> summary(SND)
##       word    SemNear_Cond    numNear           NOF         lnFreqHAL    
##  anchor : 1   few :36      Min.   : 0.00   Min.   : 9.0   Min.   : 6.16  
##  apple  : 1   many:36      1st Qu.: 0.00   1st Qu.:13.0   1st Qu.: 7.90  
##  ball   : 1                Median : 0.50   Median :14.0   Median : 8.75  
##  balloon: 1                Mean   : 1.65   Mean   :14.7   Mean   : 8.74  
##  banana : 1                3rd Qu.: 2.00   3rd Qu.:16.0   3rd Qu.: 9.60  
##  bed    : 1                Max.   :19.00   Max.   :22.0   Max.   :12.16  
##  (Other):66                                                              
##     logfreq          NPhon            nd           cohdens      
##  Min.   :0.363   Min.   :2.00   Min.   : 0.51   Min.   :  0.59  
##  1st Qu.:0.586   1st Qu.:3.00   1st Qu.: 2.45   1st Qu.: 13.58  
##  Median :0.952   Median :4.00   Median : 7.34   Median : 34.61  
##  Mean   :1.057   Mean   :4.17   Mean   :13.00   Mean   : 46.73  
##  3rd Qu.:1.389   3rd Qu.:5.00   3rd Qu.:22.90   3rd Qu.: 64.52  
##  Max.   :2.347   Max.   :7.00   Max.   :49.29   Max.   :157.30

> source("compareGroups.R")
> compareGroups(SND, "SemNear_Cond")
##    variable  few.M many.M  few.SD many.SD       t      p
## 1         N 36.000 36.000      NA      NA      NA     NA
## 2   numNear  0.000  3.306  0.0000  3.8606 -5.1374 <1e-04 
## 3       NOF 14.389 14.917  2.2963  2.3949 -0.9544  0.343
## 4 lnFreqHAL  8.708  8.779  1.2564  1.4250 -0.2255  0.822
## 5   logfreq  1.005  1.108  0.4982  0.5217 -0.8602  0.393
## 6     NPhon  4.167  4.167  1.1832  1.4442  0.0000      1
## 7        nd 13.297 12.711 13.1659 13.7660  0.1845  0.854
## 8   cohdens 50.346 43.106 42.6230 40.7654  0.7365  0.464

As reported in the paper, we had two groups of 36 words that differed in terms of number of near semantic neighbors (numNear) and were matched on number of features (NOF), HAL word frequency (lnFreqHAL), ANC word frequency (logfreq), number of phonemes (NPhon), phonological neighborhood density (nd), and cohort density (cohdens).

I/You will still have to pull together the data for these comparisons, but at least the comparison step will be easy. In my first foray into github, I've posted the code for compareGroups as a gist and here it is embedded:

No comments:

Post a Comment