Professional Documents
Culture Documents
Measures of Centre
The most common statistics in statistics are the mean, median, mode.
The (sample) mean or (sample) average (denoted as X or X-Bar) can be calculated by dividing the sum
of all observed values (or responses, in a more general term) by the sample size.
1
X
n
The (sample) median is defined as the middle-most value of an ordered list (of the observed values).
An ordered list is simply a list of all observed values arranged in ascending order (from smallest to
largest).
When n is too big, we could first find the location of the median ( i ) first with i
n 1
and the
2
The (sample) mode is defined as the observed values (or responses) that happen the most often. And it
is not technically discussed in the same context as mean and median.
Measures of Location
The easiest locations of all observed values are the minimum and the maximum.
To find their values, we just need to look for the smallest value and the largest value of the observed
values.
The percentiles are used to divide all observed values into 100 equal parts.
There are only 99 percentiles (1st percentile, 2nd percentile, , 99th percentile) because the first and
the last ones are called the minimum and maximum respectively.
The quartiles are used to divide all observed values into 4 equal parts.
How many quartiles should we have? If your answer is three, bingo!
However, for practicality, the second quartile is equivalent to median (so is the 50 th percentile).
Therefore, we only have two quartiles the first quartile ( Q1 ) and the third quartile ( Q3 ).
Page 1 of 5
To find the quartiles, all observed values are arranged in ascending order (i.e. ordered list).
The first quartile is the median of all the observed values strictly to the left of the overall median.
The third quartile is the median of all the observed values strictly to the right of the overall median.
Graphically, their use can be illustrated here.
Smallest
Second smallest
Third smallest
Largest
25% of the data
25% of the data
25% of the data
25% of the data
First Quartile ( Q1 )
Median
Third Quartile ( Q3 )
Note: There are different formula being used in different textbook and statistical software
application, and their values may be slightly different. I guess the point here is not to be too
concerned about the exact value (because its based on how things are defined), but focus on the
bigger picture in understanding the concept and how to apply it when needed.
Measures of Spread
The range, in statistics, is defined as the difference between the maximum and the minimum.
The interquartile range (or IQR) is defined as the difference between the third quartile and the first
quartile, i.e. IQR Q3 Q1 .
Loosely speaking, the standard deviation (or simply denoted as small letter s ) is defined as average of
the n deviations, where deviation is the difference between an observed value and the sample mean.
It tells you about what is the typical difference each observed value is from the sample mean.
They can be calculated by the following formulas:
1
( X X )2
n 1
) of (5).
Five-Number Summary
The five-number summary refers to a collection of five important statistics.
Minimum
Q1
Median
Page 2 of 5
Q3
Maximum
Once we have drawn a boxplot, make sure we are able to provide a description of it (i.e. centre,
variability, shape and outliers).
Detection of Outliers
There are two rules to determine outliers: One makes use of IQR and the other makes use of standard
deviation (later module). Note that both are measures of spread.
An observed value is defined as an outlier when it is either strictly bigger than the upper limit (UL) or
strictly smaller than the lower limit (LL), where LL and UL are defined as:
LL Q1 1.5 IQR and UL Q3 1.5 IQR
Practice Questions
Suppose you want to know about the selling price of detached houses in the city of Vancouver in September
2015. The selling price of fifteen randomly selected detached houses was recorded.
7.8
8.2
8.3
8.4
8.8
8.9
9.2
9.6
9.8
9.9
10.2 10.4 10.5 10.7 11.8
Note: The numbers are in CD$100,000. In other words, a number 7.8 means CD$780,000.
a) Calculate the mean and find the median. [2+2 marks]
b) Find the first quartile and third quartile. [2+2 marks]
c) Find the range, inter-quartile range and calculate standard deviation. [1+2+3 marks]
d) Report the five-number summary. [1 mark]
e) Draw a boxplot. [3 marks]
f) Is/Are there any outlier(s)? Justify your answer by providing some calculation. [2 marks]
Page 3 of 5
StatGraphics Page
SAME AS UNIT #2
Answer to Practice Questions
Suppose you want to know about the selling price of detached houses in the city of Vancouver in September
2015. The selling price of fifteen randomly selected detached houses was recorded.
7.8
8.2
8.3
8.4
8.8
8.9
9.2
9.6
9.8
9.9
10.2 10.4 10.5 10.7 11.8
Note: The numbers are in CD$100,000. In other words, a number 7.8 means CD$780,000.
a) Calculate the mean and find the median. [2+2 marks]
Mean:
X
n
Median:
With the observed values already in ordered list, the location index is i
n 1 15 1
8 .
2
2
The 8th location of the ordered list has 9.6. Therefore, median is 9.6, or the median price of
the 15 randomly selected detached house in the city of Vancouver in Sept 2015 is CD$960,000.
b) Find the first quartile and third quartile. [2+2 marks]
The first quartile is the median of all numbers strictly to the left of the overall median (i.e.
9.6). Therefore, the first quartile is 8.4 (by eyeballing).
The third quartile is the median of all numbers strictly to the right of the overall median (i.e.
9.6). The location index is i
7 1
4 . In other words, the third quartile is 10.4.
2
c) Find the range, inter-quartile range and calculate standard deviation. [1+2+3 marks]
Range:
Range = maximum minimum = 11.8 7.8 = 3.0
Meaning / interpretation: The difference among of 15 randomly selected detached houses in the
city of Vancouver in Sept 2015 is CD$300,000.
Interquartile range:
Since Q1 = 8.4 and Q3 = 10.4, IQR = Q3-Q1 = 10.4-8.4 = 2.0.
Meaning / interpretation: The difference between the middle 50% of the 15 randomly selected
detached houses in the city of Vancouver in Sept 2015 is CD$200,000.
Page 4 of 5
Standard deviation:
Since
X 9.5 , we have
Obs. Values
7.8
8.2
8.3
8.4
8.8
8.9
9.2
9.6
9.8
9.9
10.2
10.4
10.5
10.7
11.8
X X
-1.7
-1.3
-1.2
-1.1
-0.7
-0.6
-0.3
0.1
0.3
0.4
0.7
0.9
1.0
1.2
2.3
( X X )2
2.89
1.69
1.44
1.21
0.49
0.36
0.09
0.01
0.09
0.16
0.49
0.81
1.00
1.44
5.29
( X X )2 17.46 ,
(X X )
n 1
17.46 17.46
1.247 , s
15 1
14
Median
9.6
(X X )
n 1
Q3
10.4
1.247 1.117
Max
11.8
9
10
House Price (in CD$100,000)
11
12
f) Is/Are there any outlier(s)? Justify your answer by providing some calculation. [2 marks]
Lower limit and upper limit:
LL = Q1-1.5*IQR = 8.4 1.5*2.0 = 5.4
UL = Q3+1.5*IQR = 10.4 + 1.5*2.0 = 13.4
To determine if there are outliers, its typically sufficient to check the minimum and maximum.
Since the minimum is not less than the lower limit and the maximum is not larger than the upper
limit, there are no outliers in the data set.
Page 5 of 5