Statistical Intervals and Visualizations: Difference Between Means
Geoff Cumming’s 2014 article, “The New Statistics, Why and How,” in Psychological Science, 2014, Vol. 25(1) 7–29, outlines an eight-step statistical strategy for improving the reproducibility of research studies. Chief among Cumming’s points is that point estimates are not sufficiently rigorous or informative and that null-hypothesis testing is prone to failure to reproduce. He includes several data visualizations that include intervals such as error bars, and in a series of articles, I show you how to reproduce these visualizations. This article takes on the Differences Between Means visualization, Cumming’s “Figure 1”.
This graphing code is very complex. If you want to see it built up bit by bit, starting with point estimates, and gradually adding more elements, that will definitely help you with ggplot2. Go check it out at my Rpubs page for this article.
If you’d rather just download a package that handles this, I wrote one. Skip ahead and find out more about that!
Feeling gutsy? Run the whole thing below!
Let’s take a look at the first of Cumming’s visualizations:
Cumming’s first figure is a demonstration of the statistical principles underlying what confidence intervals are: most intervals shown contain the actual mean, but a couple do not. While this particular plot does not apply to research data (in which the actual population difference is unknown and the inference of the difference is the whole point), the visualization has lots of elements that can be very helpful. Here’s how to create a plot like this in R, using ggplot2. We’ll start with creating two sets of fake data – one which imitates the mean differences between two populations (the top graph) and one which mimics a perfect normal distribution (the bottom graph). Once we have our fake data generated, we can plot it!
Generate Fake Data (Top Plot)
Let’s generate some fake data and draw some samples, shall we?
Now we have 25 samples of n=32 from cases and 25 samples of n=32 from controls, each drawn out of normally distributed datasets with mean difference between them of 10 and standard deviation of 20.
Point and Interval Estimates (Top Plot)
How can we plot our sample point estimates (i.e. mean) and 95% CI? Well, first, let’s calculate them!
I’m assuming a T distribution for calculating our standard error, even though we happen to know the population standard deviation. This is so odd and unlike typical research that I’d rather just do a T test, since that’s what we do in real life.
Still, you could do the Z if you wanted to.
I’ll start by making an estimates
data frame. We have 25 samples of cases and 25 samples of controls, so I’ll do pairwise comparisons – we’ll have 25 rows in estimates
.
- I’ll set “upper” and “lower” – which will be the edges of my 95% confidence interval – to just have NA values (for now)
- “meanDiff” will be the sample mean differences from corresponding rows of cases and controls (so mean of cases minus mean of controls for each row)
- “caseSSE” will be the sum of squared errors from the cases in each sample
- “controlSSE” will be the sum of squared errors from the controls
- “sd”, blank for now, will be the pooled sample standard deviation
Now that I’ve got my SSE in place, I can come up with the pooled sd quite easily:
Now I have to figure out the standard error for each sample.
Here I choose my significance level to figure out how many se’s I want to go out in either direction.
I’ll also set a “problem” variable in my dataset which will be TRUE if the upper CI limit is below the true mean or the lower CI limit is above the true mean. Otherwise, it’ll be false.
Significance (Top Plot)
Let’s also consider what our p-values are. There are a couple of ways to do this. Lets use t.test() and extract the p-values from the list (a bit messy, but worth it, because it handles the two-tailed thing easily, unlike some other R commands!). Keep in mind that our null hypothesis is that the mean is 0. Since our actual population mean is 10 (not that we know this in real life), we expect that we’ll find some evidence to discard our null hypothesis on at least some samples. Let’s see if that’s the case.
Let’s also add significance indicators:
To plot each sample on its own line, we’ll note which sample is which by adding a sampleNum column in our data frame.
Generate Fake Data (Bottom Plot)
Let’s plot a “perfect” sampling distribution of what we’d expect if we sampled not 25, but infinitely many samples of size 32 out of an infinite pair of populations, each with a sd of 20 and a difference of population means of 10.
In actuality, we’ll do a random collection of one million. That’s pretty darn close!
Graph it!
This is what the final visualization should look like:
Using Joy’s Package to do Point + Interval Means Estimate Plotting
I know that the above code is hard to implement and remember, so I wrote a very simple R package that will minimally plot your point and interval estimates, for one or more samples and a given confidence level. You can optionally color points and/or error bars as well as flip the y axis. Let’s try it!
First, we’ll load my package, called ciPlotter. You’ll need to have installed the devtools
package.
There’s only one (public) function there, plotEstimates. Let’s see what it does. First, let’s plot the sample estimates for “cases”, with no other parameters specified, then using the additional parameters to tailor the plot’s appearance.
Note that in this case, we are passing plotEstimates a list that consists of numerical vectors.
If you don’t have a list of numerical vectors, but just have one sample, you can just pass in a single numerical vector, as long as it has more than one number in it:
It’s not a full or perfect solution, but it can help you visualize your point and confidence interval estimates in a jiffy. If you like and use this package, let me know!