Saturday, March 2, 2013

Why I use ggplot


For the last few years I have been using the ggplot2 package to make all of my figures. I had used Matlab previously and ggplot takes some getting used to, so this was not an easy switch for me. Joe Fruehwald's Penn R work group was a huge help (and more recently, he posted this excellent tutorial). Now that I've got the hang of it, there are two features of ggplot that I absolutely can't live without.

First, it can compute summary statistics "on the fly". This means that you can plot individual participant data, or overall average data, or group averages, or group averages excluding an outlier participant, etc., etc. all from a single master data frame. In most other graphing programs you'd have to create separate data frames or write extremely complex plot commands. Such complex plot commands are just begging for minor errors that would result in your graphs showing something other than what you think, or want, them to be showing. Which brings me to the second great feature.

The "gg" in ggplot stands for "grammar of graphics". Without getting into the history or technicalities of that term, the idea is that you set up a set of mappings between variables in your data and visual properties of your graph. This is actually how we usually talk about graphs: "I want time on the x-axis and accuracy on the y-axis and the conditions to be different colors". In ggplot, the syntax is very much like that
 (...x=Time, y=Accuracy, color=Condition...), and the resulting graph systematically follows that description. As a result, you don't end up with graphs like this one, which is a real figure from a manuscript that I read:
I've cropped out the axis labels and legend to make it anonymous (the scientists who produced this graph are excellent researchers and this graphical typo should not be held against them; my point is just that ggplot helps you avoid such typos). The study was a 2x2 design, so you should be able to show the four conditions with two "aesthetics" (in the ggplot vernacular): for example, color (black vs. gray) and line type (solid vs. dashed). Instead, the graph has three aesthetics (color, line type, and point shape: square vs. triangle), but their assignment does not follow the 2x2 design -- the top and bottom lines are the same, so we can't tell which corresponds to which cell in the design; the two middle lines differ from the two outer lines in both color and shape, which is redundant (though redundancy is not necessarily bad); and line type only distinguishes one pair of lines and not the other. This is a really easy mistake to make when making graphs in Excel or Matlab, but the ggplot syntax forces a systematic mapping 
(...color=Variable1, linetype=Variable2...) 
so this sort of problem is virtually eliminated.

If you're already using R, then you probably know about ggplot -- it's quickly becoming the dominant graphing tool R. If you don't use R, then consider this yet another reason to start using it. Once you learn ggplot, I guarantee that your data exploration will be faster and the graphs in your presentations and papers will be a lot more beautiful.

No comments:

Post a Comment