You are on page 1of 8

Statistics involves . . .

Statistics: Summarizing Data MATH283/STAT291 2013


Pam Davy Week 1, Friday Lecture (Statistics) collecting data about real life processes; presenting and describing data; formulating models which allow for chance variation; tting models to data, checking assumptions, and making predictions; making decisions in the presence of uncertainty.
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Pam Davy c Week 1, Friday Lecture (Statistics)

MATH283/STAT291 2013

Categorical Measurements
Each observation belongs to one of a set of categories. Categories may be labelled by text (e.g. male, female) or by numbers (e.g. 0, 1). Nominal measurements involve unordered categories, e.g. gender. Ordinal measurements involve ordered categories, e.g. low, medium, high.
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Quantitative Measurements
Observations take numerical values which measure a physical quantity. For interval measurements, dierences have meaning but ratios dont; e.g. temperature. For ratio measurements, dierences and ratios have meaning; e.g. weight. Dont do inappropriate things to data!
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Discrete or Continuous
If the possible values are separate points on the number line, a measurement is said to be discrete, e.g. no. of emails.
u u u u u u u -

Frequency Table
Method of organizing categorical or discrete data Lists all possible values along with number of observations (frequency or count) for each value Relative frequency = frequency / total is often included (possibly as %)
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

The possible values of a continuous measurement form 1 or more intervals on the number line, e.g. length.
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013

Example
Number of accidents reported per week: 0, 2, 1, 0, 0, 1, 1, 0, 0, 0 Accidents Frequency Relative frequency 0 6 0.6 1 3 0.3 2 1 0.1 total 10 1

Bar Chart

6
frequency

4 2 0 0 1 2

Frequency (or relative frequency) is shown on vertical axis.


MATH283/STAT291 2013 ,

accidents

Pam Davy c Week 1, Friday Lecture (Statistics)

MATH283/STAT291 2013

Pam Davy c Week 1, Friday Lecture (Statistics)

Histogram
Similar to bar chart, but used for continuous interval or ratio data. Real number scale on horizontal axis, no gaps between bars, Observations are grouped into classes, not necessarily of constant width. Frequency or relative frequency is represented by area of bar.
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Constant class width


MATLABs hist( ) function produces histograms with constant class (bin) width. Appearance of a histogram varies according to choice of classes; avoid too many classes (bumpy plot) or too few (uninformative). For non-constant class width, use MATLAB histc( ) and bar( ).
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

MATLAB Histogram
Sulphur emission data (from eLearning) Reasonably symmetric, single hump
Histogram of Sulphur Oxide Emissions 15

Area of histogram bars


Area = height width, so vertical axis of histogram should ideally display density (relative frequency width). For constant class width, area is proportional to height, so vertical axis can display frequency if preferred. For non-constant class width, vertical axis must display density.
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

frequency

10

0 5

10

15 20 25 30 tons of sulphur oxides

35

Pam Davy c Week 1, Friday Lecture (Statistics)

MATH283/STAT291 2013

Non-constant class width


For rst bar, 6 out of 80 obs in the class 6/80 6 x < 10, density = , total area = 1 4 Histogram of sulphur emissions
0.08 0.06 density 0.04 0.02 0

Stem-and-Leaf Plot
Graphical display of quantitative data which retains numerical values Easy to construct with pencil and paper Example: 0 46 leaf unit = 0.1 1 1 represents data values 0.4, 0.6, 1.1
Pam Davy c Week 1, Friday Lecture (Statistics)

10 13 15 17 1921 23 25 28 32 tons of sulphur oxide


MATH283/STAT291 2013 , MATH283/STAT291 2013 ,

Pam Davy c Week 1, Friday Lecture (Statistics)

Construction
Left-hand digit(s) used as stem (stem may be negative) Next single digit used as leaf, truncate right-hand digits if necessary Vertical line separates stems from leaves Observations with same stem are sorted according to leaf value
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Stem Unit, Leaf Unit


Problem: does 2|4 mean 24, 2.4 or . . . ? Solution: specify stem and/or leaf unit to indicate position of decimal point. Units are of form 10k , e.g. 1, 100, 0.1 Leaf unit = (stem unit)/10 e.g. 24.7 = (24 1) + (7 0.1), so 24.7 24|7 (stem unit 1, leaf unit 0.1)
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Interpretation
A stem-and-leaf plot is like a histogram rotated by 90o . Stem corresponds to horizontal histogram axis. Rows of leaves correspond to bars. Left or right tails of histogram correspond to top or bottom regions of stem-and-leaf plot
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Bad Stem-and-Leaf Plots


leaf unit = 0.1 24 4 25 26 559 27 3 28 07 29 2 30 too long! 31 8
Pam Davy c Week 1, Friday Lecture (Statistics)

leaf unit = 1 2 46667889 3 1 too short!

MATH283/STAT291 2013

Truncation
Problem: Stem-and-leaf plots with too many rows dont t on paper! Solution: truncate original data values, e.g. 246.8 2|4 (leaf unit 10). Some computer packages round rather than truncate.

Splitting Rows
Problem: Stem-and-leaf plots with too few rows do not reveal shapes/patterns. Solution: try splitting each row into 2. Put low leaves (0 to 4) in one row, high leaves (5 to 9) in other row. Still not enough? Split each original row into 5; group leaves 0 to 1, 2 to 3, 4 to 5, 6 to 7, 8 to 9.
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Pam Davy c Week 1, Friday Lecture (Statistics)

MATH283/STAT291 2013

Sample Mean
Consider n data values x1, x2 . . . , xn The mean is the average value. 1 x= n
n

Rescaling Data
If each data value xi is rescaled by a linear transformation a + bxi , the mean is rescaled in the same way. This is not true for a non-linear transformation such as xi2. For {1, 2, 3}, x = 2 For {13, 16, 19}, x = 10 + 3 2 = 16 For {12, 22, 32}, x = 14/3 = 22
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

xi
i =1

MATLAB code: mean(x)

Pam Davy c Week 1, Friday Lecture (Statistics)

MATH283/STAT291 2013

Sample Variance
Variance is a measure of spread, based on squared distances of individual data points from the mean. 1 s = n1
2 n

Sample Standard Deviation


Standard deviation is the square root of variance (same units of measurement as data). s = s2 On calculator, enter data then use n1 or sn1 or s key. MATLAB code: std(x)
, Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

(xi x )2
i =1

Variance is never negative, and is only zero when all data values are identical. MATLAB code: var(x)
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013

Rescaling Data
Adding a constant to all observations has no eect upon standard deviation. When all observations are multiplied by positive constant c , new s = c old s . For {1, 2, 3}, s = 1 For {4, 5, 6}, s = 1 For {102, 104, 106}, s = 2 1 = 2
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Sample Median
To nd the median, rst sort the n data values in ascending order. For odd n, the median is the middle sorted data value. For even n, the median is the average of the middle 2 data values. MATLAB code: median(x)
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Sample Quartiles
Idea: 25% of data lie below rst or lower quartile Q1, 25% of data lie above third or upper quartile Q3. In practice, dierent books/packages compute quartiles in dierent ways. MATLAB uses linear interpolation: quantile(x,[0.25 0.75]) The repeated median method is simple for hand calculations.
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Repeated Median Method


The second quartile Q2 is the median First (lower) quartile Q1 is median of lower half of observations Third (upper) quartile Q3 is median of upper half of observations For odd n, leave Q2 out of each half. e.g. For sorted data 2, 5, 6, 6, 9: Q2 = 6, so Q1 = (2 + 5)/2 = 3.5, Q3 = (6 + 9)/2 = 7.5
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Alternative Measures of Spread


Spread (variability) can be measured by: Range = maximum minimum (unreliable measure, depends on extreme values) Standard deviation s (uses all data values but is inated by outliers, i.e. unusually small or large values) Interquartile range IQR = Q3 Q1 (spans middle 50% of data, unaected by outliers, ignores variation in tails)
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Five Number Summary


The ve number summary of a dataset provides a concise description of centre and spread, used to construct a box plot. Minimum value Q1 Median Q3 Maximum value
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 ,

Box Plots
Using an axis with appropriate scale, draw a box from Q1 to Q3, and mark position of median. Draw whiskers from Q1 to the minimum and from Q3 to the maximum. Outliers are sometimes shown separately. Gives a quick, easy comparison of 2 or more samples.
Pam Davy c Week 1, Friday Lecture (Statistics) MATH283/STAT291 2013 , Pam Davy c Week 1, Friday Lecture (Statistics)

In MATLAB, use boxplot(x,g) to draw parallel box plots, where x contains data, g is a grouping variable

MATH283/STAT291 2013

You might also like