You are on page 1of 5

Stat 1181 Notes: Unit #3 Summarizing Univariate Data using Statistics

Summarizing a Single Numerical Variable Using Statistics


In numerical settings, we could use many more statistics because of the nature of the responses in
numbers.
There are three sets of statistics: (1) measures of the centre, and (2) measures of the location, and (3)
measures of the spread (or variability/dispersion/variation).

Measures of Centre
The most common statistics in statistics are the mean, median, mode.

The (sample) mean or (sample) average (denoted as X or X-Bar) can be calculated by dividing the sum
of all observed values (or responses, in a more general term) by the sample size.

1
X
n

The (sample) median is defined as the middle-most value of an ordered list (of the observed values).
An ordered list is simply a list of all observed values arranged in ascending order (from smallest to
largest).

When n is too big, we could first find the location of the median ( i ) first with i

observed value at location i is the median.


In a case where the location index is not a whole number, we would use the midpoint (or average) of the
two adjacent values in the ordered list.

n 1
and the
2

The (sample) mode is defined as the observed values (or responses) that happen the most often. And it
is not technically discussed in the same context as mean and median.

Measures of Location
The easiest locations of all observed values are the minimum and the maximum.
To find their values, we just need to look for the smallest value and the largest value of the observed
values.

The percentiles are used to divide all observed values into 100 equal parts.
There are only 99 percentiles (1st percentile, 2nd percentile, , 99th percentile) because the first and
the last ones are called the minimum and maximum respectively.

The quartiles are used to divide all observed values into 4 equal parts.
How many quartiles should we have? If your answer is three, bingo!
However, for practicality, the second quartile is equivalent to median (so is the 50 th percentile).
Therefore, we only have two quartiles the first quartile ( Q1 ) and the third quartile ( Q3 ).

Page 1 of 5

Stat 1181 Notes: Unit #3 Summarizing Univariate Data using Statistics

To find the quartiles, all observed values are arranged in ascending order (i.e. ordered list).
The first quartile is the median of all the observed values strictly to the left of the overall median.
The third quartile is the median of all the observed values strictly to the right of the overall median.
Graphically, their use can be illustrated here.
Smallest
Second smallest
Third smallest
Largest
25% of the data
25% of the data
25% of the data
25% of the data

First Quartile ( Q1 )

Median

Third Quartile ( Q3 )

Note: There are different formula being used in different textbook and statistical software
application, and their values may be slightly different. I guess the point here is not to be too
concerned about the exact value (because its based on how things are defined), but focus on the
bigger picture in understanding the concept and how to apply it when needed.

Measures of Spread
The range, in statistics, is defined as the difference between the maximum and the minimum.

The interquartile range (or IQR) is defined as the difference between the third quartile and the first
quartile, i.e. IQR Q3 Q1 .

Loosely speaking, the standard deviation (or simply denoted as small letter s ) is defined as average of
the n deviations, where deviation is the difference between an observed value and the sample mean.
It tells you about what is the typical difference each observed value is from the sample mean.
They can be calculated by the following formulas:

1
( X X )2

n 1

To calculate standard deviation, we need to do it from inside to outside:


0) It is preferred, but not necessary, to put all observed values in an ordered list.
1) Find the sample mean ( X ).
2) Take the difference between each value (
3) Square the differences from (2).
4) Sum all square values from (3).
5) Divide the sum from (4) by n 1 .
6) Take the square root (

X ) and the sample mean ( X ).

) of (5).

Five-Number Summary
The five-number summary refers to a collection of five important statistics.
Minimum

Q1

Median

Page 2 of 5

Q3

Maximum

Stat 1181 Notes: Unit #3 Summarizing Univariate Data using Statistics

Box-and-whisker Plot (or Boxplot)


Box-and-whisker plot or boxplot is the third of three numerical graphs.
Unlike the previous two graph (which make use of all observed values for the plot), boxplot only needs
five numbers from the five-number summary.
To draw a horizontal boxplot:
1) Draw the x-axis, label it with variable of interest, and put down a proper scale of the axis.
2) Draw two short vertical lines for Q1 and Q3 at their corresponding points, above the x-axis.
3) Connect the two lines to form a box.
4) Draw another vertical line for median inside the box in (3).
5) Draw the right hand whisker from Q3 to the maximum.
6) Draw the left hand whisker from Q1 to the minimum.
Note that vertical boxplot is fine too.

Once we have drawn a boxplot, make sure we are able to provide a description of it (i.e. centre,
variability, shape and outliers).

Detection of Outliers
There are two rules to determine outliers: One makes use of IQR and the other makes use of standard
deviation (later module). Note that both are measures of spread.
An observed value is defined as an outlier when it is either strictly bigger than the upper limit (UL) or
strictly smaller than the lower limit (LL), where LL and UL are defined as:
LL Q1 1.5 IQR and UL Q3 1.5 IQR

Unit #3 Learning Outcomes


Calculate measures of centre in mean and median
Calculate measures of location in quartiles and others.
Calculate measures of variability or spread in inter-quartile range and standard deviation.
Find the five-number summary.
Draw a boxplot.
Determine if an observed value is an outlier, using IQR.

Practice Questions
Suppose you want to know about the selling price of detached houses in the city of Vancouver in September
2015. The selling price of fifteen randomly selected detached houses was recorded.
7.8
8.2
8.3
8.4
8.8
8.9
9.2
9.6
9.8
9.9
10.2 10.4 10.5 10.7 11.8
Note: The numbers are in CD$100,000. In other words, a number 7.8 means CD$780,000.
a) Calculate the mean and find the median. [2+2 marks]
b) Find the first quartile and third quartile. [2+2 marks]
c) Find the range, inter-quartile range and calculate standard deviation. [1+2+3 marks]
d) Report the five-number summary. [1 mark]
e) Draw a boxplot. [3 marks]
f) Is/Are there any outlier(s)? Justify your answer by providing some calculation. [2 marks]
Page 3 of 5

Stat 1181 Notes: Unit #3 Summarizing Univariate Data using Statistics

StatGraphics Page

SAME AS UNIT #2
Answer to Practice Questions
Suppose you want to know about the selling price of detached houses in the city of Vancouver in September
2015. The selling price of fifteen randomly selected detached houses was recorded.
7.8
8.2
8.3
8.4
8.8
8.9
9.2
9.6
9.8
9.9
10.2 10.4 10.5 10.7 11.8
Note: The numbers are in CD$100,000. In other words, a number 7.8 means CD$780,000.
a) Calculate the mean and find the median. [2+2 marks]
Mean:

X
n

7.8 8.2 8.3 10.7 11.8


9.5 , or X 9.5
15

Median:
With the observed values already in ordered list, the location index is i

n 1 15 1

8 .
2
2

The 8th location of the ordered list has 9.6. Therefore, median is 9.6, or the median price of
the 15 randomly selected detached house in the city of Vancouver in Sept 2015 is CD$960,000.
b) Find the first quartile and third quartile. [2+2 marks]
The first quartile is the median of all numbers strictly to the left of the overall median (i.e.
9.6). Therefore, the first quartile is 8.4 (by eyeballing).
The third quartile is the median of all numbers strictly to the right of the overall median (i.e.
9.6). The location index is i

7 1
4 . In other words, the third quartile is 10.4.
2

c) Find the range, inter-quartile range and calculate standard deviation. [1+2+3 marks]
Range:
Range = maximum minimum = 11.8 7.8 = 3.0
Meaning / interpretation: The difference among of 15 randomly selected detached houses in the
city of Vancouver in Sept 2015 is CD$300,000.
Interquartile range:
Since Q1 = 8.4 and Q3 = 10.4, IQR = Q3-Q1 = 10.4-8.4 = 2.0.
Meaning / interpretation: The difference between the middle 50% of the 15 randomly selected
detached houses in the city of Vancouver in Sept 2015 is CD$200,000.

Page 4 of 5

Stat 1181 Notes: Unit #3 Summarizing Univariate Data using Statistics

Standard deviation:
Since

X 9.5 , we have

Obs. Values

7.8

8.2

8.3

8.4

8.8

8.9

9.2

9.6

9.8

9.9

10.2

10.4

10.5

10.7

11.8

X X

-1.7

-1.3

-1.2

-1.1

-0.7

-0.6

-0.3

0.1

0.3

0.4

0.7

0.9

1.0

1.2

2.3

( X X )2

2.89

1.69

1.44

1.21

0.49

0.36

0.09

0.01

0.09

0.16

0.49

0.81

1.00

1.44

5.29

( X X )2 17.46 ,

(X X )
n 1

17.46 17.46

1.247 , s
15 1
14

d) Report the five-number summary. [1 mark]


Min
Q1
7.8
8.4

Median
9.6

(X X )

n 1

Q3
10.4

1.247 1.117

Max
11.8

e) Draw a boxplot. [3 marks]


Box-and-Whisker Plot

9
10
House Price (in CD$100,000)

11

12

f) Is/Are there any outlier(s)? Justify your answer by providing some calculation. [2 marks]
Lower limit and upper limit:
LL = Q1-1.5*IQR = 8.4 1.5*2.0 = 5.4
UL = Q3+1.5*IQR = 10.4 + 1.5*2.0 = 13.4
To determine if there are outliers, its typically sufficient to check the minimum and maximum.
Since the minimum is not less than the lower limit and the maximum is not larger than the upper
limit, there are no outliers in the data set.

Page 5 of 5

You might also like