You are on page 1of 21

Chapter 3, Numerical Descriptive

Measures
Data analysis is objective
Should report the summary measures that best
meet the assumptions about the data set

Data interpretation is subjective
Should be done in fair, neutral and clear
manner

Summary Measures
Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile Range
Geometric Mean
Skewness
Central Tendency Variation Shape
Quartiles
Arithmetic Mean
The arithmetic mean (mean) is the most common measure of
central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)

Sample size
n
X X X
n
X
X
n 2 1
n
1 i
i
+ + +
= =

=

Observed values
Geometric Mean
Geometric mean
Used to measure the rate of change of a variable over time


Geometric mean rate of return
Measures the status of an investment over time



Where R
i
is the rate of return in time period I

n / 1
n 2 1
G
) X X X ( X =
1 )] R 1 ( ) R 1 ( ) R 1 [( R
n / 1
n 2 1
G
+ + + =
Median: Position and Value
In an ordered array, the median is the middle
number (50% above, 50% below)
The location (position) of the median:



The value of median is NOT affected by
extreme values
data ordered the in position
2
1 n
position Median
+
=
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes

Quartiles
Quartiles split the ranked data into 4 segments
with an equal number of values per segment
Find a quartile by determining the value in the
appropriate position in the ranked data, where

First quartile position: Q
1
= (n+1)/4

Second quartile position: Q
2
=2 (n+1)/4 (the median
position)

Third quartile position: Q
3
= 3(n+1)/4


where n is the number of observed values

Same center,
different variation
Measures of Variation
Variation
Variance Standard
Deviation
Coefficient
of Variation
Range Interquartile
Range
Measures of variation
give information on the
spread or variability of
the data values.

Range and Interquartile Rage
Range
Simplest measure of variation
Difference between the largest and the smallest observations:
Range = X
largest
X
smallest
Ignores the way in which data are distributed
Sensitive to outliers
Interquartile Range
Eliminate some high- and low-valued observations and calculate
the range from the remaining values
Interquartile range = 3
rd
quartile 1
st
quartile
= Q
3
Q
1



Average (approximately) of squared
deviations of values from the mean

Sample variance:
Variance
1 - n
) X (X
S
n
1 i
2
i
2

=

=
Where
= arithmetic mean
n = sample size
X
i
= i
th
value of the variable X
X
Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
It is a measure of the average spread around the mean
Sample standard deviation:

1 - n
) X (X
S
n
1 i
2
i
=

=
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more sets of data
measured in different units
100%
X
S
CV
|
|
.
|

\
|
=
Shape of a Distribution
Describes how data are distributed
Measures of shape
Symmetric or skewed

Mean = Median

Mean < Median Median < Mean
Right-Skewed Left-Skewed Symmetric
Using the Five-Number Summary to
Explore the Shape
Box-and-Whisker Plot: A Graphical display of data using
5-number summary:


The Box and central line are centered between the
endpoints if data are symmetric around the median

Minimum, Q1, Median, Q3, Maximum
Min Q
1
Median Q
3
Max
Distribution Shape and
Box-and-Whisker Plot
Right-Skewed Left-Skewed Symmetric
Q1 Q2 Q3 Q1 Q2 Q3
Q1 Q2 Q3
If the data distribution is bell-shaped, then the interval:
contains about 68% of the values in the population or
the sample

contains about 95% of the values in the population or
the sample

contains about 99.7% of the values in the population
or the sample

Relationship between Std. Dev. And
Shape: The Empirical Rule
1
2
3
Population Mean and Variance
N
) (X

N
1 i
2
i
2

=

=
Population variance
N
X X X
N
X
N 2 1
N
1 i
i
+ + +
= =

=
Population Mean
Covariance and Coefficient of
Correlation
The sample covariance measures the strength of the
linear relationship between two variables (called
bivariate data)
The sample covariance:




Only concerned with the strength of the relationship
No causal effect is implied
1 n
) Y Y )( X X (
) Y , X ( cov
n
1 i
i i


=

=
Covariance between two random variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent
Covariance does not say anything about the relative strength of
the relationship.
Coefficient of Correlation measures the relative strength of the
linear relationship between two variables

Y X
n
1 i
2
i
n
1 i
2
i
n
1 i
i i
S S
) Y , X ( cov
) Y Y ( ) X X (
) Y Y )( X X (
r =


=


= =
=
Coefficient of Correlation:
Is unit free
Ranges between 1 (perfect negative) and 1(perfect
positive)
The closer to 1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear relationship
At 0 there is no relationship at all

Correlation vs. Regression
A scatter plot (or scatter diagram) can be used
to show the relationship between two
variables
Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
Correlation is only concerned with strength of the
relationship
No causal effect is implied with correlation

You might also like