You are on page 1of 7

DATA FRAME : data.

serial
• Consider a serialized data with 3 Sites, 3 Treatments, 4
reps and variable Y
Introduction to R: Site Trt Rep Y Site Trt Rep Y Site Trt Rep Y
Data Manipulation and Statistical A
A
1
1
1
2
3
6
B
B
1
1
1
2
3
6
C 1 1 8
C 1 2 NA
Analysis A 1 3 8 B 1 3 5 C 1 3 8
A 1 4 5 B 1 4 NA C 1 4 6
A 2 1 4 B 2 1 7 C 2 1 5
A 2 2 4 B 2 2 0
Descriptive Statistics A 2 3 6 B 2 3 8
C 2 2 4
C 2 3 4
Leilani A. Nora A 2 4 9 B 2 4 2 C 2 4 7
Assistant Scientist A 3 1 7 B 3 1 5
A 3 2 4 B 3 2 7
A 3 3 2 B 3 3 4
A 3 4 4 B 3 4 4

SUMMARY STATISTICS SUMMARY STATISTICS : summary()


• R contains all the basic tools for calculating summary • Use to obtain a descriptive statistics of a data frame
statistics. or specific variable.
• mean(), median(), sum(), var(), min(), max(), range() all
are self explanatory • Ex1. To obtain summary statistics for the variable Y
> summary(data.serial$Y)
• cor(), cov() calculate covariances and correlations
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 4.000 5.000 5.167 7.000 9.000 2.000
• mad() calculates the mean absolute deviation

• quantile() computes various quantiles of data • Output are the quartiles, min, max, median, mean and
the count of NA’s.
• summary() will be discussed on the next slide
SUMMARY STATISTICS : summary() SUMMARY STATISTICS : length()
• Ex2. To obtain summary statistics for all the columns of a
data frame • Use to obtain number of data points of a variable,
> summary(data.serial) say Y
Site Trt Rep Y > length(data.serial$Y)
A:12 Min. :1.000 Min. :1.00 Min. :0.000
B:12 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:4.000 [1] 32
C: 8 Median :2.000 Median :2.50 Median :5.000
Mean :1.875 Mean :2.50 Mean :5.167
3rd Qu.:2.250 3rd Qu.:3.25 3rd Qu.:7.000
Max. :3.000 Max. :4.00 Max. :9.000
NA's :2.000

SUMMARY STATISTICS : var() and sd() SUMMARY STATISTICS : tapply()


• var() is use to obtain the variance of Y
• tapply() applies a function to a variable in a separate
> Y.VAR <- var(data.serial$Y, na.rm=TRUE) (non-empty) groups
> Y.VAR
> tapply(X, INDEX, FUN)
[1] 4.488506
X – an object, typically a vector
• sd() is use to obtain the standard deviation of Y INDEX – list of factors, each of same length
> Y.STD <- sd(data.serial$Y, na.rm=TRUE) as X
> Y.STD FUN – function to be applied
[1] 2.118609
SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : tapply()
• Ex1. To obtain separate summary stat of Y for each Site • Ex2. To obtain separate standard deviation of Y for
each Site
> tapply(data.serial$Y, data.serial$Site,
summary) > tapply(data.serial$Y,data.serial$Site,
$A sd)
Min. 1st Qu. Median Mean 3rd Qu. Max. A B C
2.000 4.000 4.500 5.167 6.250 9.000
2.081666 2.377929 1.732051
$B
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.500 5.000 4.636 6.500 8.000 1.000

$C
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.0 4.5 6.0 6.0 7.5 8.0 1.0

SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : doBy Package


• Ex3. To obtain separate mean of Y for each Site x Trt • doBy Package is use to calculate groupwise
summary statistics in a simple way, much in the
> tapply(data.serial$Y, spirit of PROC SUMMARY of SAS system.
list(data.serial$Site,
data.serial$Trt), mean) summaryBy()
1 2 3 • Use for calculating quantities like the “mean and
A 5.500000 5.75 4.25 variance” of a variable, for each combination of two or
B 4.666667 4.25 5.00 more factors.
C 7.333333 5.00 NA
SUMMARY STATISTICS : summaryBy() SUMMARY STATISTICS : summaryBy()
• Usage • Ex1. To obtain Site x Trt summary of means for Y
> summaryBy(formula, data, FUN=mean, > library(doBy)
keep.name=FALSE, order=TRUE,na.rm=TRUE,..) > summaryBy(Y~Site+Trt, data=data.serial,
na.rm=TRUE)
# formula – a formula object, say Y~Site
Site Trt Y.mean
# data – a data frame
1 A 1 5.500000
# FUN – a list of functions to be applied. 2 A 2 5.750000
# KEEP.NAME – logical, if TRUE and if there is only ONE 3 A 3 4.250000
function in FUN, then the variables in the output will have 4 B 1 4.666667
the same name as the variables in the input. 5 B 2 4.250000
# Order – logical, if TRUE the resulting data frame is 6 B 3 5.000000
ordered according to the variables on the right hand side 7 C 1 7.333333
of the formula. 8 C 2 5.000000

SUMMARY STATISTICS : summaryBy()


• Ex2. To obtain Site x Trt summary of minimum, mean,
maximum, variance and standard deviation of Y using
predefined functions.
> summaryBy(Y~Site+Trt, data=data.serial,
FUN=c(min, mean, max, var, sd), na.rm=TRUE) HISTOGRAM
Site Trt Y.min Y.mean Y.max Y.var Y.sd
1 A 1 3 5.500000 8 4.333333 2.081666
2 A 2 4 5.750000 9 5.583333 2.362908
3 A 3 2 4.250000 7 4.250000 2.061553
4 B 1 3 4.666667 6 2.333333 1.527525
5 B 2 0 4.250000 8 14.916667 3.862210
6 B 3 4 5.000000 7 2.000000 1.414214
7 C 1 6 7.333333 8 1.333333 1.154701
8 C 2 4 5.000000 7 2.000000 1.414214
DENSITY PLOT DENSITY PLOT: seq()

> hist(data.serial$Y,main='Histogram • seq(from, to, length) generate regular sequences from


of Y', col=‘yellow2', 0 to 20 with length of 100.
border=‘tomato1',
> x <- seq(from=0, to=20, length=100)
freq = FALSE, xlab=“Y Class”,
> x
ylab=“Probability", xlim=c(0, 20))
[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010
[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222
. . .
# freq – logical, if FALSE probability densities are plotted [97] 19.3939394 19.5959596 19.7979798 20.0000000
so that histogram has a total area of one.

dnorm(x, mean, sd) DENSITY PLOT : lines()


> lines(x, y)
• dnorm() is use to obtain the probability of x, given the
values of mean and sd.

> y <- dnorm(x,


mean(data.serial$Y,na.rm=TRUE),
sd(data.serial$Y, na.rm=TRUE)))
> y
> lines(x, y)
[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010
[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222
. . .
[97] 19.3939394 19.5959596 19.7979798 20.0000000
HISTOGRAM WITH DENSITY PLOT: CASE1. HISTOGRAM WITH DENSITY PLOT
mtext() > hist(RF$RLD0, main='Histogram of RLD0',
• mtext(text, side=3…) displays text on top of the plot col='plum4', border='black', br=5,
xlab="RLD0 Class",
# text – a character expression specifying the text to be ylab="Probability",
written freq=FALSE,
# side – on which side of the plot you want to display a xlim=c(0, 20))
text > x <- seq(from=0, to=20, length=100)
1 – bottom 2 – left > x
3 – top 4 – right > y <- dnorm(x,
mean(data.serial$Y,na.rm=TRUE),
> mtext("Fitting to a normal sd(data.serial$Y, na.rm=TRUE)))
distribution") > lines(x, y)
> mtext("Fitting to a normal
distribution")

HISTOGRAM WITH DENSITY PLOT:


lines(), dnorm(), and mtext()
Histogram of Y with Density plot
Fitting to a normal distribution
0.20

0.15

BOXPLOT
Probability

0.10

0.05

0.00

0 2 4 6 8 10

Y class
BOXPLOT :boxplot()
• Ex1. To obtain boxplot of Y with other graphics parameters
Boxplot of Y > boxplot(split
> Boxplot(data.serial$Y,
(data.serial$Y,
boxwex=0.35,
data.serial$Site))

8
main=“Boxplot of Y”,
xlab=“Y”,

6
horizontal=TRUE) > boxplot(Y~Site,
data=data.serial)

4
# boxwex = controls the width
of the boxplot

2
# horizontal = logical, if

0
TRUE, the boxplot is plotted A B C

horizontally 0 2 4 6 8

THANK YOU! ☺
Please do Exercise C

You might also like