Professional Documents
Culture Documents
Oct. 2017
Course Contents
Chapter I: Basic concepts in statistics (overview)
– What is Statistics?
– Methods of data collection
– Some Basic Terms in Statistics
– Sampling Techniques
– Criteria for the acceptability of a sampling method
Chapter II: Classification and presentation of statistical data
(overview)
– Scales of measurement and types of classification of variables
– Grouped frequency distribution
– Graphical presentation of data
Chapter III: Measures of central tendency and dispersion
– Measures of central tendency
– Measures of dispersion
– General shape of distributions 2
Course Contents
Chapter IV: Estimation of sample size
– Sample size determination with continuous data
– Sample size determination for proportions
Chapter V: Introduction to tests of hypotheses
– Review of hypothesis testing
– Data analysis
– Parametric and non-parametric statistics (tests)
6
Introduction
Definition and classifications of statistics:
7
Applications, Uses & Limitations of
statistics
Applications of statistics:
10
Misuses of statistics
Many people, knowingly or unknowingly, use data in wrong
manner
• Unrepresentative/ Inadequate sample
• Unfair comparison
• Unwarranted conclusion:
– may be as a result of making false assumptions.
– may be the use of wrong average. Eg: Assume monthly incomes
of 1,000,000 and 1,000. The use of an arithmetic average in such
a case may give a wrong idea.
• Suppression of unfavorable results: hiding unfavorable, though
true, facts emerging from statistical study
• Use of inefficient statistical models or mistake in arithmetic
11
Classification of statistics
Based on the usage of statistical data, statistics is defined broadly
in to two mutually exclusive groups
Descriptive statistics:
• Used to describe the basic features of the data in a study
• Provide simple summaries about the sample and the measures
• Ways of organizing and summarizing data
• Helps to identify the general features and trends in a set of data
and extracting useful information
• Also very important in conveying the final results of a study
12
Descriptive Statistics
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean = X i
13
Inferential statistics
• is a method used to generalize from a sample to a population
• Eg, the average income of all families (the population) in
Ethiopia can be estimated from figures obtained from a few
thousands (the sample) families
• It is important because statistical data usually arises from
sample.
• Statistical techniques based on probability theory are required
14
Inferential Statistics
Estimation
e.g: Estimate the population mean
using the sample mean
Confidence interval
Hypothesis testing
e.g., Test the claim that the population
mean weight is 56 kg.
comparison of two or more means or
proportions
17
Population
Role of statistics in using
Information from a sample
to make inferences about
the population
Information
Sample
Generalizability
If the sample is not representative
of the population, the conclusions
will be restricted to the sample &
could not be generalized to the
target population!
18
Key Definitions
Target population: A collection of items that have something in
common for which we wish to draw conclusions at a particular
time. E.g., All financial offices in Ethiopia
• Defining the target population is an important and often difficult part
of the study. For eg, in a political poll, should the target population
be all adults eligible to vote? All registered voters? All persons who
voted in the last election?
• The choice of target population will profoundly affect the statistics
that result.
Study (Sampled) Population: The subset of the target population
that has at least some chance of being sampled
20
Parameter and Statistic
21
Stages in statistical investigation
Interpretation
Inferential Statistics
Analysis of Data
Presentation
Descriptive Statistics
Organization
22
Formulating the problem
• Research begins with a problem/problems
The problem need not be Earth-shaking
We can not study all subjects (e.g. all pregnant women) living in
a given geographical area
Sampling techniques
Sample size calculation, Study design
Method of data collection
Etc.
24
Stages in statistical investigation
Interpretation
Inferential Statistics
Analysis of Data
Presentation
Descriptive Statistics
Organization
25
Methods of Data Collection
Data are facts or figures from which conclusion can be drawn.
In order to draw valid conclusions, it is important to have
‘good’ data
Data are gathered with aim to meet predetermined objectives.
The data itself form the foundation of statistical analyses and
hence the data must be carefully and accurately collected.
Can be obtained from:
Routinely kept records, literature, Surveys, Experiments,
Reports, Observation, etc.
Who needs info?
Government, businesses, organizations, and everyone need info for
their day to day lives
26
Types of Data
Primary data: collected from the items or individual respondents
directly by the researcher for the purpose of a study.
you collect the data yourself
the data you collect is unique to you and your research and,
until you publish, no one else has access to it
Methods of collecting primary data: interviews, questionnaires,
observation (measurement) and diaries
Secondary data: which had been collected by someone else or
organization (e.g., researchers, institutions, other NGOs,…)
Some sources: official statistics, scholarly journals, reference
books, research institutes, universities, libraries, library search
engines, computerized data base and world wide web.
27
Method of primary data collection
Questionnaire: a popular means of data collection
written questions are mailed or hand-delivered to respondents
is difficult to design & often require many rewrites before an
acceptable questionnaire is produced.
Advantages:
Can be used as a method in its own right or as a basis for
interviewing or a telephone survey.
Relatively cheap
Can be posted, e-mailed or faxed
Can cover wide geographic area, a large number of people or
organizations
Avoids embarrassment on the part of the respondent.
Possible anonymity of respondent.
28
No interviewer bias
Method of primary data collection:
Questionnaire
Disadvantages:
29
Primary data collection: Interviewing
is primarily used to gain an understanding of the underlying
reasons & motivations for people’s attitudes…
Interviews can be undertaken on a personal one-to-one basis
or in a group.
can be conducted at work, at home, on the street, in a
shopping center, or some other agreed location.
Advantages:
Serious approach by respondent resulting in accurate info.
Good response rate, completed and immediate.
Possible in-depth questions.
Interviewer in control and can give help if there is a problem.
Can investigate motives and feelings.
Can use recording equipment.
30
If one interviewer used, uniformity of approach.
Primary data collection: Interviewing
Disadvantages:
• Time consuming.
• Geographic limitations.
• Can be expensive.
• Need to set up interviews.
• Normally need a set of questions.
• Respondent bias– tendency to please or impress, create false
personal image, or end interview quickly.
• Embarrassment possible if personal questions.
• Transcription and analysis can present problems– subjectivity.
• If many interviewers, training is required!
31
secondary data collection
Disadvantages
• Quality of documentation
• Data quality control
• Level of observation
• Data availability
• Outdated data
32
Sampling Techniques
33
Why Sample?
Researchers often use sample survey methodology to
obtain information about a larger population by selecting
and measuring a sample from that population.
Why Sample?
Universality
Detaildness
Representativness
35
• Due to the variability in the characteristics of the population,
scientific sample designs should be applied to select a
representative sample.
36
Sample Information
Population
It is essential that a sample should be correctly defined and
organized.
If the wrong questions are posed to the wrong people,
reliable information will not be received and lead to a wrong
conclusion when applied to the entire population.
37
Steps needed to select a sample and ensure
that this sample will fulfill its goals
39
Researchers are interested to know about factors associated
with ART use among HIV/AIDS patients attending certain
hospitals in a given Region
Sample
40
2. Define the target population
42
4. Set the level of precision
There is a level of uncertainty associated with estimates
coming from a sample.
6. Preparing Frame
List of all members of the population from which the sample
will be taken
44
The sample design
Sample design: how the sample will be collected.
Estimation techniques: how the results from the sample will
be extended to the whole population.
Measures of precision: how the sampling error will be
measured.
Other Considerations
• Sample size determination
• Questionnaire development
• Pretest
• Organization of the field work
• Data collection, Data entry
• Summary and analysis of the data (Edit the completed
45
questionnaires, Decide on computation procedures)
Sampling
• Sampling: The process of selecting a portion of the
population to represent the entire population.
48
Errors in sampling
1) Sampling error: errors caused by the act of taking a sample.
They cause sample results to be different than results of a
census.
– They cannot be avoided or totally eliminated.
– Can be controlled by selecting “large” sample
• Random sampling error – deviation between the sample
statistic and the population parameter caused by chance in
selecting a random sample. The margin of error in a confidence
statement includes only random sampling error.
2) Non-sampling error: errors not related to the act of selecting a
sample from the population. They can be present in a census.
- Observational error
- Respondent error
- Lack of preciseness of definition
49
- Errors in editing and tabulation of data
Errors in Sampling
• Most sample surveys afflicted by errors other than random
sampling.
• These errors introduce bias that makes a confidence interval
basically meaningless.
• Good sampling technique includes reducing all sources of
error.
• Part of this includes random sampling and confidence
statements.
50
Sampling Errors
• Random sampling error
– Margin of error & confidence statement
• Bad sampling methods
– Voluntary response & convenience samples
• Under-coverage bias
– Occurs when some groups in the population are left out
of the process of choosing a sample.
– Limited sampling frame
– Homeless
– Subjects excluded who are in hospitals, motels, etc.
• Do a pilot survey
54
How to live with non-sampling errors
• Non-sampling errors, such as non-response, are always
there.
55
Sampling Methods
Two broad divisions:
56
A. Probability Sampling
57
• Probability sampling is:
– more complex,
– more time-consuming and
– usually more costly than non-probability sampling.
58
• There are several different ways in which a probability
sample can be selected.
59
Most common probability
sampling methods
60
1. Simple random sampling
Every member of the population has an equal chance of being
selected
61
To use a SRS method:
Computer programs
62
SRS has certain limitations:
63
2. Systematic random sampling
• often used instead of random sampling
64
2. Systematic random sampling
• Taking individuals at fixed intervals (every kth) based on the
sampling fraction
65
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the
total population size).
66
Example
• To select a sample of 100 from a population of 400, you
would need a sampling interval of 400 ÷ 100 = 4.
• Therefore, K = 4.
• You will need to select one unit out of every four units to
end up with a total of 100 units in your sample.
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
• Each member of the population belongs to only one of the four
samples and each sample has the same chance of being
selected.
69
Disadvantages of Systematic random
sampling
The main limitation of the method is that it becomes less
representative if we are dealing with populations having
“hidden periodicities”.
70
3. Stratified random sampling
• It is done when the population is known to be have
heterogeneity with regard to some factors and those factors
are used for stratification
• Using stratified sampling, the population is divided into
homogeneous, mutually exclusive groups called strata, and
• A population can be stratified by any variable that is available
for all units prior to sampling (e.g., income (low, medium &
high), age, sex, province of residence, etc.)
73
• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
n
nj Nj
N
– nj is sample size of the jth stratum
– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size
74
Example: Proportionate Allocation
• Village A B C D Total
• HHs 100 150 120 130 500
• S. size ? ? ? ? 60
75
4. Cluster sampling
• Is preferable when the population is subdivided in to groups
or clusters that are internally heterogonous and externally
homogenous
• Sometimes it is too expensive to carry out SRS
– Population may be large and scattered.
– Complete list of the study population unavailable
– Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• Cluster sampling is the most widely used to reduce the cost
• The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous
76
Steps in cluster sampling
• Cluster sampling divides the population into groups
or clusters.
Advantages
• Cost reduction
• It creates 'pockets' of sampled units instead of spreading the
sample over the whole territory.
• Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.
78
Disadvantages
• Creates a loss of efficiency when compared with SRS.
79
5. Multi-stage Sampling
• Similar to the cluster sampling, except that it involves
picking a sample from within each chosen cluster, rather
than including all units in the cluster.
80
Woreda PSU
Kebele SSU
Sub-Kebele TSU
HH
81
• In the first stage, large groups or clusters are identified
and selected. These clusters contain more population
units than are needed for the final sample.
• In the second stage, population units are picked from
within the selected clusters (using any of the possible
probability sampling methods) for a final sample.
• cost reduction.
• saves a great amount of time and effort.
85
Most common types of non-
probability sampling
1. Convenience or haphazard sampling
2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique
86
1. Convenience or haphazard sampling
88
2. Volunteer sampling
• As the term implies, this type of sampling occurs when
people volunteer to be involved in the study.
• In psychological experiments or pharmaceutical trials
(drug testing), for example, it would be difficult and
unethical to enlist random participants from the general
public.
• In these instances, the sample is taken from a group of
volunteers.
• Sometimes, the researcher offers payment to attract
respondents.
90
3. Judgment sampling
• This approach is used when a sample is taken
based on certain judgments about the overall
population.
93
Quota sampling is:
generally less expensive than random sampling.
96
Non-Probability Sampling: Inherent concerns related to
generalizability and representation
97
Variable
Variable: is an attribute or characteristic which may take on
different values in different persons, places,…
98
Types of Variable/Data
Variable/Data
Categorical/Qualitative Numerical/Quantitative
Eg:
Marital Status
registered to vote? Discrete Continuous
Region
(Defined categories or groups) Eg: Examples:
Number of Children Weight
Defects per hour Height
(Counted items) (Measured characteristics)
Ordered Categories
(rankings, order, or Ordinal Data
scaling)
Eg: response to treatment Qualitative Data
Categories (no
ordering or direction) Nominal Data
Eg: Ethnic group 100
Exercise 2
What type of variable is?
a) Region
b) Blood group
c) Health status: very sick, sick and cured
d) Age of an employee in a company
e) Student mark
f) No. of movies seen this summer
g) Income
h) Income class (poor, medium, rich)
i) Test result (negative, positive)
• Quantitative or categorical?
• Continuous or discrete?
• Nominal, ordinal, interval or ratio scale? 101
Assignment
2. Match by permissible Arithmetic operations of measurement
of scales
102
Recap: Why Level of measurement is
important?
103
Stages in statistical investigation
Interpretation
Inferential Statistics
Analysis of Data
Presentation
Descriptive Statistics
Organization
104
Methods of Data Organization
and Presentation
105
Data Organization
Data in raw form are usually not easy to use for decision
making
Graph
106
Frequency Distributions
(Tables)
The actual summarization and organization of data starts
from frequency distribution.
107
Frequency Distributions
(Tables)
• For nominal and ordinal data, frequency distributions are
often used as a summary.
• Example:
109
Select a set of continuous, non-overlapping intervals such
that each value can be placed in one, and only one, of the
intervals.
The first consideration is how many intervals to include
A common rule of thumb states that there should be no fewer
than six intervals and no more than 15.
110
To determine the number of class intervals and the
corresponding width, we may use:
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
111
Example: A manufacturer of insulation randomly selects 20 winter
days and records the daily high temperature:
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Solution:
1. Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
2. Find range: 58-12 = 46
3. Select number of classes (K): K = 1 + 3.22 (log20) = 5.33≈5
4. Compute interval width: 10 (46/5 then round up)
5. Determine interval boundaries: 10 but less than 20, 20 but
less than 30, . . . , 60 but less than 70
6. Count observations & assign to classes
112
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
113
Exercise 3
Construct a grouped data frequency distribution for Leisure
time (hours) per week for 40 college students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19
27
29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15 21 25 16
114
Cumulative frequencies: When frequencies
of two or more classes are added.
K = 1 + 3.22 (log40) = 6.32≈6
Maximum= 38, Minimum= 10 Cumulative relative frequency: The
Width = (38-10)/6 = 4.66 ≈ 5 percentage of the total number of
observations that have a value either in
that interval or below it.
116
Importance of diagrammatic
representation
Diagrams have greater attraction than mere figures
They give quick overall impression of the data
They have great memorizing value than mere figures
They facilitate comparison
Used to understand patterns and trends
Well designed graphs can be powerful means of
communicating a great deal of information
When graphs are poorly designed, they not only
ineffectively convey message, but they are often
misleading.
117
Specific types of graphs include:
Categorical Numerical
Variables Variables
118
• Bar charts and Pie charts are often used for qualitative
(category) data
Hospital Patients by Unit
Hospital Number
5000
Unit of Patients
Cardiac Care 1,052 4000
Cardiac
Surgery
Emergency
Maternity
Intensive
Care
Care
119
Data presentation using
Histogram
120
Scatter Plot
Scatterplots (for quantitative variables)
plot response variable on vertical axis,
explanatory variable on horizontal axis
Descriptive Statistics: Numerical
Summary Measures
Single numbers that quantify the characteristics of a
distribution of values
Measures of central tendency (location)
Measures of dispersion
122
Describing Data Numerically
Describing Data Numerically
Mode Variance
Standard Deviation
Coefficient of Variation
123
Measures of Central Tendency
Overview
Central Tendency
x i
x i1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
124
Arithmetic Mean
The arithmetic mean (mean) is the most common
measure of central tendency
xx1 x 2 x N
i Population
μ
i1
values
N N
Population size
For a sample of size n:
n
x i
x1 x 2 x n Observed
x i1
values
n n
125
Properties of Arithmetic Mean
The most common measure of central tendency
Easy to calculate and understand (simple).
For a given set of data there is one and only one
arithmetic mean (uniqueness).
Influenced by each and every value in a data set
Greatly affected by the extreme values (outliers).
126
Weighted Mean
Weighted Mean is a special type arithmetic mean and it will
be functional when values have its own weight.
Some of the observations in a data set may have greater
importance.
w x i i
w1x1 w 2 x 2 w n x n
xw i 1
n
w1 w 2 w n
w
i 1
i 127
Weighted Mean
Example: An entrance exam for a job consists of 25% English,
50% Mathematics, 5% Typing and 20% Accounting. If an
applicant who took the entrance exam scored 48% in English,
35% in Mathematics, 80% in Typing and 50% in Accounting,
his average score is:
n
w x i i
0.25(48) 0.50(35) 0.05(80) 0.20(50)
xw i 1
43.5
n
0.25 0.50 0.05 0.20
w
i 1
i
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
n 1
Note that 2 is not the value of the median, only the
position of the median in the ranked data
130
131
Properties of median
There is only one median for a given set of data
(uniqueness)
The median is easy to calculate
Median is a positional average and hence it is insensitive
to very large or very small values
Median can be calculated even in the case of open end
intervals
It is determined mainly by the middle points and less
sensitive to the remaining data points (weakness).
132
Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
No Mode
Mode = 9
133
Mode
Examples: Compute mode for the following data
sets:
1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
134
Mode
Examples: Compute mode for the following data
sets:
1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
135
Class Exercise
Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest
nations in population size
Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1
Compute the mean and median of the carbon dioxide emissions data.
Which one is the best measure of central tendency? Why?
136
Class Exercise
Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest
nations in population size
Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1
Compute the mean and median of the carbon dioxide emissions data.
Which one is the best measure of central tendency? Why?
Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
Median = (1.4 + 1.8)/2 = 1.6
Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7
Mean sensitive to “outliers” (median often preferred for highly skewed
distributions)
137
Exercise
138
Describing Data Numerically
Describing Data Numerically
Mode Variance
Standard Deviation
Coefficient of Variation
139
Measures of Dispersion
Measures that quantify the variation or dispersion of a set of
data from its central location
Dispersion refers to the variety exhibited by the values of the
data.
Measures of variation give information on the spread or
variability of the data values.
The amount may be small when the values are close
together.
If all the values are the same, no dispersion
140
Why measures of Dispersion
The measures of dispersion are helpful in statistical
investigation
141
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
142
Comparing standard deviation
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
143
Measures of Dispersion
Variation
Same center,
different variation
144
Range (R)
• Simplest measure of variation
• Difference between the largest and the smallest observations
in a sample
• Range = Maximum value – Minimum value
• Example –Compute range of the following dataset:
Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
145
Properties of range
It is the simplest crude measure and can be easily
understood
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
146
Variance (2, s2)
• first used by Karl Pearson in 1893
147
Variance (2, s2)
n
Sample variance: i
(x x) 2
s2 i1
n -1
Where
X = arithmetic mean
n = sample size
N
Xi = ith value of the variable X
(X i ) 2
2
i 1
where
N
Population Variance: N
X i
= i=1
is the population mean.
N
148
Standard deviation (, s)
• It is the square root of the variance
and S = S 2 2
149
Sample Standard Deviation computation
Following are the survival times of 11 patients after heart
transplant surgery. Calculate their sample variance and SD.
150
Exercise 3
151
Sample Standard Deviation
computation
Sample Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16
126
4.2426
7
152
Properties of SD
SD is considered to be the best measure of dispersion and is
used widely because of the properties of the theoretical
normal curve
Each value in the data set is used in the calculation
Values far from the mean are given extra weight
(because deviations from the mean are squared)
The SD has the advantage of being expressed in the same
units of measurement as the mean
However, if the units of measurements of variables of two
data sets is not the same, then there variability can’t be
compared by comparing the values of SD.
153
Coefficient of variation (CV)
When two data sets have different units of measurements, or
their means differ sufficiently in size, the CV should be used
as a measure of dispersion
It is the best measure to compare the variability of two series
of sets of observations
Can be used to compare two or more sets of data measured
in different units
Measures relative variation s
CV 100%
Shows variation relative to mean x
Data with less coefficient of variation is considered more
consistent (less dispersed) 154
Comparing CV
Stock A:
Average price last year = $50
Standard deviation = $5
s $5
CVA 100% 100% 10%
x $50
Both stocks have the
Stock B: same standard
deviation, but stock B is
Average price last year = $100 less variable relative to
Standard deviation = $5 its price
s $5
CVB 100% 100% 5%
x $100
155
Comparing CV
SD Mean
SBP 15mm 130mm
Cholesterol 40mg/dl 200mg/dl
156
Comparing CV
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0
157
Standard Score
A standard score for sample value in a data set is obtained
by the mean of the data set from the value and dividing the
result by the standard deviation of the data set.
X-X
Z
S
Basically, the standard score (z-score) tells us how many
standard deviations a specific value is above or below the
mean value of the data set.
i.e. the z-score is the number of standard deviations the data
value falls above (positive z-score) or below (negative z-
score) the mean for the data set.
158
Standard Score
Ex. Suppose a student scored 65% in a statistics test and 70%
in mathematics test. In which subject did he perform better?
3 8 6 14 4 12 7 10
159
Standard Score
Exercise: what is the Z-score for the value of 14 in the following
sample data set?
3 8 6 14 4 12 7 10
X - X 14 - 8
Z 1.57
S 3.8173
The data value of 14 is located 1.57 standard deviations above
the mean 8 because the z-score is positive.
160
General Shape of Distributions
Histograms and box plots can be quite useful in suggesting
the shape of a probability distribution.
161
Eg: Weight, Height, IQ, etc.
General Shape of Distributions
For a distribution that is skewed right (positively Skewed), the bulk
of the data values (including the median) lie to the left of the
mean, and there is a long tail on the right side.
Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1. Compute a measure of
central value.
Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
1 outlier
165
Identifying Outliers
166
Remedial Measures for Outliers
If outliers exist, their potentially large squared errors may have a
strong influence on the fitted model (regression line)
Be sure to examine your data graphically for outliers and
extreme points
Decide, based on your model and logic, whether the extreme
points should remain or treated differently.
Outliers
1. Check if these are simply incorrectly recorded data.
2. Fit the model with and without the outlier.
Do the results change much?
If not, report the results including the outlier, but note that it is present.
If results do change substantially, report both.
3. Use a robust estimation procedure.
167