As part of a research project I was writing R code that samples and resamples data from a given population. I was surprised by how badly sampling error affects small samples and how easy it was to visualise that. Here, I’ve posted some pictures and R code so you can see for yourself.
Ok, sampling error means that the distribution of a sample of observations drawn from a population looks different than the population-distribution (Wikipedia says this). As a rule, the more independent observations are drawn, the more the sample data will look like the population data.
In psychology and decision-making research, participants often answer questions like ‘how happy are you’ on a 1-7 scale, where 1 = very unhappy and 7 = very happy. Usually, answers on questions like these are assumed to be normally distributed.
Using R, I drew samples from a distribution that looks like this:
So, using this code (available for copy-paste at the end and for download here) you can play around with how many samples you draw from the population above and how large those samples are. Or change the distribution. Or adjust the color of the bars. Or whatever. Anyway, if you read this on a phone, do not have R installed, or if you are just plain lazy, here are some pictures.
I drew 10 samples of 20 observations from the population and plotted them in the pictures below. I was surprised by how much variation there is between the samples and how different they look from the original distribution. I mean, look at that fourth one.
(Note: I plotted the raw observed frequencies, these are not percentages)
Quite a bit of variation, right?
I also drew 1000 samples of N = 20 and saved the means. The population mean is 4, and this is the distribution of means in those 1000 samples. Look at how often a mean that is more than .5 different from 4 is observed!
You may ask, why choose N = 20? Good question. N = 20 per condition is pretty common in the papers I read. However, if you’ve ever run a power-calculation, you know that a sample of N = 20 per condition leads typical decision-making / psychology survey studies to be underpowered.
So, to close, the following pictures show what happens when you draw samples of, let’s say, N = 100. The sample data still do not look exactly like the population data, but they are much, much closer.
Much less variation, right?
The distribution of means in 1000 samples of N = 100 looks like this. It’s clear (look at the x-axis) that the means differ much less from the true mean (4) than the N = 20 samples.
Finally, here’s the code (or dropbox link to code), have fun and let me know if you have cool adjustments / insights!
# set possible answers, here 1-7
answers <- c(1:7)
# make ‘normal distribution’ by settings weights.
weights <- c(0.05,0.1,0.2,0.3,0.2,0.1,0.05)
means = 0
# set sample size to be drawn
repetitions = 1000
for (k in 1:repetitions)
# draw N observations from distribution specified above
sim<-sample(answers, N, replace = TRUE, prob = weights)
# plot observed frequencies (note y-axis starts at 0 and limit is N / 2)
barplot(table(factor(sim, levels = 1:7)), main =”N=20″, xlab = ‘Values 1-7’, ylab = ‘Observed frequencies’, col=(“navyblue”),ylim = c(0,N/2))
# save means in k repetitions
if (k==1) means = round(mean(sim),2) else means = c(means,round(mean(sim),2))
barplot(table(means), col=(“navyblue”), main = “Observed means in 1000 samples of N =20”)