You are on page 1of 7

Demo 3: Distributions, sampling and normality

Experimental and Statistical Methods in Biological Sciences


November 3, 2016

1 Data preparations and descriptive statistics


1.1 Description of the data set
Today we will use data from a sleep deprivation study. The following variables
are included:

• ID = participant ID
• age = age group (1 = young, 2 = old)
• education = education group (1 = comprehensive school, 2 = secondary,
3 = higher)

• ht_group = hormone treatment (user, control)


• digitsymbol = digit symbol test score at times 1, 2 and 3
• bentonerror = Benton test error score at times 1, 2 and 3
This study investigated the effects of sleep deprivation on two different tests:
digit-symbol test (from WAIS-III) and Benton Visual Retention Test. Test
scores were collected on three consecutive days always in the morning: after
normal sleep, after sleep deprivation, and after normal sleep again. Some par-
ticipants receive hormone treatment (HT) (coded as ’users’), some do not (coded
as ’controls’).

1.2 Load and prepare your data


Load the data from “http://becs.aalto.fi/~heikkih3/deprivation.csv”
into a data frame deprivation.
Prepare the data as usual - check that factors are factors, and look for missing
values and check how they are coded. Useful functions: summary, head, boxplot.
(Refer to Demo 1 for help.)

1
1.3 Describe your data
Show descriptive statistics for your numeric variables in a table. Useful func-
tions: summary, describe.
Plot test scores for Digit symbol and Benton tasks in all different time points.
At this point, you can simply take box plots separately for each time point. Re-
member to adjust scales so that the scores are comparable (tip: named argument
ylim= included in your boxplot function).
Finally, attach the deprivation data using attach.
(Refer to Demo 2 for help.)

2 Primer: distributions
Next, we will run a few simulations to demonstrate how different distributions
look like. Do not worry too much about the commands we use to make simu-
lations, the important thing is to focus on the behaviour and characteristics of
different distributions.

2.1 Normal distribution


The great thing about normal distribution is that we know so much about it. We
can make inferences regarding the position of different values along the distri-
bution. For instance, we know how far from mean 95% of observations lie. This
helps us to infer which phenomena are rare and which are not. This is essential
in statistics: usually our goal is to decide whether we are confident enough to
accept or reject our hypothesis, and this decision is based on distributions.
Many things in nature are normally distributed, including for instance height
and psychological constructs such as IQ (which by definition is normally dis-
tributed). However, normal distribution is also useful in cases where the actual
values do not follow a normal distribution. This is due to the central limit
theorem which states that if you take repeated samples from a population and
calculate their averages, then these averages will be normally distributed.
To demonstrate this, simulate five uniformly distributed random numbers
between 0 and 10 and work out the average. Let’s do this 10 000 times and
look at the distribution of the 10 000 means. Distribution of the raw data is
flat-topped:

> hist(runif(10000)*10, main="")

What about the distribution of sample means, based on taking just five
uniformly distributed random numbers?

> means <- numeric(10000)


> for (i in 1:10000) {
means[i] <- mean(runif(5)*10)
}
> hist(means, ylim=c(0,1600))

2
Let’s add a normal curve:

> h <- hist(means, freq=F, ylim=c(0,0.35))


> x <- seq(min(h$breaks),max(h$breaks), by=0.1)
> y <- dnorm(x, mean=mean(means), sd=sd(means))
> lines(x,y,col="red", lty="dashed", lwd=3)

The fit is excellent - the central limit theorem works.

2.2 Student’s t-distribution


Student’s t-distribution is used instead of the normal distribution when sample
sizes are small (n < 30). Compared to the normal distribution, Student’s t-
distribution has fatter tails. The t-distribution depends on one parameter: the
sample size.
In sample sizes close to 30 and higher, Student’s t-distribution is close to
normal distribution. Let’s demonstrate this.

# create a vector
> simu <- seq(-4,4,0.01)
# draw a normal distribution:
> plot(simu, dnorm(simu), type="l", lty=2,
ylab="Probability density", xlab="Deviates")
# add different t-distributions:
> lines(simu, dt(simu, df=5), col="red") # sample size = 5
> lines(simu, dt(simu, df=10), col="green") # sample size = 10
# ETC. play around by increasing sample size.
# What happens when you get close or beyond 30?

In Section 4, we will see how t-distribution relates to statistical testing of


hypotheses.

2.3 F-distribution
Another important distribution is the F-distribution. It is especially relevant
for ANOVA, since the ratio between two quotients follows F-distribution. F-
distribution depends on two parameters, which are the degrees of freedom (DF)
from the two quotients we are dividing. We can demonstrate the behaviour of
F-distribution:

# create a vector
> simu2 <- seq(0,4,0.01)
# draw an F-distribution
> plot(simu2, df(simu2,1,1), type="l", lty=2,
ylab="Probability density", xlab="Deviates")
# try changing degrees of freedom:
> lines(simu2, df(simu2, 1,10), col="red") # DF 1 = 1, DF = 10

3
> lines(simu2, df(simu2, 10,10), col="green") # DF 1 = 10, DF = 10
# ETC. play around by increasing both degrees of freedom separately.
In Section 5, we will see how F-distribution relates to statistical testing of
hypotheses.

Question 1 Generate 100 samples that follow the normal distribution. Use
rnorm() for this, which by default generates numbers with zero mean and unit
standard deviation. Estimate the mean and the standard deviation. What do
you observe? Why?

Question 2 Repeat the same but for a normal distribution with mean equal
to 3 and standard deviation equal to 2. Check ?rnorm() if you are unsure how
to do this. Plot a histogram of the data and put the mean and the standard
deviation on the title of the figure. You can do it using:

hist(data,main=paste(mean(data)," ","( sd:",sd(data),")"))

The command paste() concatenates strings so that they appear in the same
line. What do you see? Do we really need all these decimal numbers?

Question 3 The function round() can help us solve the problem that we
faced in Question 2. For example, running round(data,2) will keep only two
decimals. Repeat the histogram by keeping only two decimals for the mean and
the standard deviation.

Question 4 We saw that the estimated mean differs from the theoretical
mean when the sample number is small. What happens when different sample
numbers are tested. Estimate the mean for 1, 10, 100, 1000 normally distributed
numbers samples. Repeat this for 1000 times and plot the histograms in a 2x2
plot.
Remember: to make a 2x2 plot layout use:
layout(matrix(c(1:4),2,2)) # Make a 2x2 plot matrix
Then to make one histogram try this:
nsamples=1 # 1 sample
ntimes=1000 # Repeated 1000 times
# Make the matrix that contains the samples
datamat<-matrix(rnorm(nsamples*ntimes),nsamples,ntimes)
#Estimate the mean over the columns
means<-apply(datamat,2,mean)
# Make the histogram with clim to have same limits
hist(means,xlim=c(-3,3))
# Repeat that for more samples
Repeat the process to make all four histograms. Add an informative title.

4
Question 5 What happens when the sample size increases?

Question 6 If you throw a 6-sided dice, you get a random number from 1 to 6
with equal probability (i.e. uniformly distributed). How is the the distribution
of the sum of 2, 10 and 50 dices? Test that by repeating 10000 times similarly
as in Question 4. You can use the following example:
nsamples=5 # Number of dices (samples)
ntimes=1000 # Number of repetitions
# Generate continuous numbers from uniform distribution from 1 to 7 ...
samples<-runif(nsamples*ntimes,1,7);
# ... and round them to the lower integer
samples2<-floor(samples)
# Make the data matrix
datamat<-matrix(samples2,nsamples,ntimes)
#Apply the sum over the columns (dices)
sums<-apply(datamat,2,sum)
# Plot the histogram with adapted limits for integers
hist(sums,breaks=seq(min(sums)-0.5,max(sums)+0.5))

Question 7 Fit a normal distribution in the histogram of Question 6 with


sample size equal to 50. If it looks like normal distribution, then the central
limit theorem works and sums of numbers from uniform distribution end up
forming a normal distribution! Remember how to fit a normal distribution:
h <-hist(sums,breaks=seq(min(sums)-0.5,max(sums)+0.5),freq=F)
x <- seq(min(h$breaks),max(h$breaks), by=0.1)
y <- dnorm(x, mean=mean(sums), sd=sd(sums))
lines(x,y,col="red", lty="dashed", lwd=3)

3 Tests for normality


There are several ways to test whether your variables are normally distributed.
First, you can simply use descriptive statistics to test for normality. Let’s
see the descriptive statistics for our numeric variables (Digit symbol and Benton
task scores):
Check out normality for digit symbol and benton scores with describe:
> describe(deprivation)
# mean, st dev, median, skewness
Pay attention to the mean and median: are they close to each other? What
about standard deviation, are the values widely scattered around the mean?
How high is the skewness - values close to 1 hint for non-normality (exact limits
depend on sample size)?
Second, while the descriptive statistics can give you hints, examining your
variables visually is a much more powerful way. Remember histograms:

5
# Plot all numeric variables in same layout matrix:
> layout(matrix(c(1,2,3,4,5,6), 3, 2))
> hist(digitsymbol_1, main="Digit symbol 1")
> hist(digitsymbol_2, main="Digit symbol 2")
> hist(digitsymbol_3, main="Digit symbol 3")
> hist(bentonerror_1, main="Benton error 1")
> hist(bentonerror_2, main="Benton error 2")
> hist(bentonerror_3, main="Benton error 3")

Also, you can take “quantile-quantile plots” of your variables. If the qq-plot
follows the line, the distribution is normal.

> qqnorm(digitsymbol_1)
> qqline(digitsymbol_1)

Third, you can also test for normality explicitly:

# Either with Shapiro-Wilk test:


> shapiro.test(digitsymbol_1) # Shapiro-Wilk

# Or with Kolmogorov-Smirnov test:


> ks.test(digitsymbol_1, "pnorm", mean=mean(digitsymbol_1), sd=sd(digitsymbol_1))

3.1 Exercises
Test for normality in all our numeric variables (Digit symbol task at time points
1, 2, and 3; Benton task at time points 1, 2, and 3):

Question 8 Take quantile-quantile plots. Present them in a 2x3 matrix layout

Question 9 Perform normality test using Shapiro-Wilk test for each numeric
variable. Report the W values that are returned by the test for each case.

Question 10 Perform normality test using Kolmogorov-Smirnov test for each


numeric variable. Report the D values that are returned by the test for each
case.

Question 11 Compare the results between the two different normality tests.

Question 12 A rough idea about differences in the distributions can be also


observed when overlaying histograms. R allows overlaying histograms with
transparency. Show the results for Benton test for days 2 and 3 using the
following example:

6
bins=seq(0,15,1)
p1 <- hist(Data1,breaks=bins)
p2 <- hist(Data2,breaks=bins)
plot( p1, col=rgb(0,0,1,1/4)) # first histogram
plot( p2, col=rgb(1,0,0,1/4),breaks=seq(0,20,2), add=T) # second

Question 13 Smoother histograms are also available using the density()


function. Make the smoother histogram of the Benton test for days 2 and 3
using the following example:

densData1 <- density(Data1)


densData2 <- density(Data2)
## calculate the range of the graph
xlim <- range(densData1$x,densData2$x)
ylim <- range(0,densData1$y, densData2$y)
#pick the colours
Col1 <- rgb(1,0,0,0.2)
Col2 <- rgb(0,0,1,0.2)
## Make the plot plot(densData1, xlim = xlim, ylim = ylim ,main="")
# Put the densities polygon(densData1, density = -1, col = Col1)
polygon(densData2, density = -1, col = Col2)
## add a legend in the corner
legend(’topright’,c(’Benton Day 2’,’Benton Day 3’),
fill = c(Col1, Col2), bty = ’n’, border = NA)

You might also like