You are on page 1of 116

3/1/2010

1
Biostatistics
Samson G/Medhin, MPH
Course Objective
General Objectives:
To acquaint students with the basic and intermediate
statistical concepts and tools for collecting, analyzing,
presenting and drawing conclusions from data.
2
Course Objective
Specific objectives:
At the end of the course students will be able to:
Describe the scope and application of statistics;
Acquaint with the types of variables and scale of
measurements;
Describe data with appropriate diagrammatic and
numeric summery techniques;
Understand the basic rules of probability and their
statistical application in health sciences;
Comprehend different sources of health and
demographic data and appreciate their respective
advantage and disadvantage;
3
Course Objective Cont
Understand the basic sampling techniques;
Calculate optimal sample size for different types of
studies;
Calculate and interpret confidence intervals;
Carryout hypothesis testing about different statistical
parameters;
Understand and apply intermediate statistical methods
including correlation, linear regression, logistic
regression and ANOVA;
Carryout exploratory data analysis using SPSS;
Understand and interpret statements in published
articles pertaining to statistics.
4
3/1/2010
2
Time Schedule
Time Schedule.doc
5
Mode of Evaluation
Mid 35%
Final 40%
Assignments/Quiz 10%
Term paper 15%
6
References
1. M. Pagano and K. Gauvreau. Principles of Biostatistics, 2
nd
ed.,
Duxbury Thompson Learning, 2000.
2. T. Colton. Statistics in Medicine, Lippincott Williams & Wilkins
Publisher, 1974.
3. B. Rosner. Fundamentals of Biostatistics, 6
th
ed., Thomson
Books, 2006.
4. M. Bland. An Introduction to Medical Statistics, 5
th
ed., Oxford
Medical Publications, 1993.
5. W. Daniel. Biostatistics: A Foundation for Analysis in Health
Sciences, 8
th
ed., John Wiley and Sons Inc, 2005.
6. Landau S and Everitt BS. Handbook of Statistical Analyses
using SPSS, Chapman & Hall/CRC, 2004.
7
Introduction
3/1/2010
3
What is Statistics?
Statistics is a field of study concerned with the collection,
organization and summarization of data, and drawing of
inferences about a body of data when only part of the data
is observed.
It is concerned with:
Designing experiments and data collection,
Summarizing information to aid understanding,
Drawing conclusions from data,
Estimating the present and predicting the future based
on Statistical evidence.
9
What is Statistics?
Mathematical statistics: Concerns with the development
of new methods of statistical inference and requires
detailed knowledge of abstract mathematics.
Applied statistics: Involves applying the method of
mathematical statistics to specific subject areas.
Biostatistics is an application of statistical method to
Biological phenomena.
10
What is Statistics cont
In clinical medicine and PH Statistics can be applied to:
Determine the accuracy of measurement,
To compare measurement techniques,
To assess diagnostic tests,
To determine normal value,
To estimate prognosis,
To compare efficacy of treatment techniques,
To determine prevalence of an event,
To identify determinates of health problem,
To compute adequate sample size for studies.
Etc.
11
Statistical Data
Refers to numerical description of things through the
form of count or measurement.
Though statistical data always involves numeric
description, all numeric descriptions are not statistical
data.
Statistical data should have the following characteristics:
They must be in aggregate,
They must be affected to marked extent by multiple causes,
They must be collected in systematic manner,
They must be estimated at reasonable accuracy,
They must be placed in relation to each.
12
3/1/2010
4
Classification of Statistics
Descriptive Statistics: Is the methodology of effectively
collecting, organizing and describing data.
Inferential Statistics: Includes:
Inductive Statistics: The process of drawing
conclusion about unknown characteristics of a
population, based on sample based study.
Predictive Statistics: The process of predicting future
based on historical data.
13
Classification Cont..
During analysis based on the underlying assumptions,
statistics (statistical methods) can be classified as:
Parametric statistics: is a branch of statistics that
assumes data come from a type of probability
distribution and makes inferences about the data based
on the distribution.
Non-parametric statistic: Interpretation does not depend
on the population fitting any distributions.
14
Rationale of Studying
Statistics
Enable to organize information in formal manner.
Issues in science are becoming more and more
quantitative,
Statistics is extensively used in medical literature.
The planning, conducting and implementing of medical
and public health research are highly reliant on statistical
methods.
There is a great deal of intrinsic variations in most
biological process.
15
Possible Limitations of
Statistics
It mainly deals with variables which can be quantified.
It deals on aggregate of facts; it may not give individual
information.
Highly reliant on cutoff points.
Analysis is done based on multiple assumptions.
Errors are possible in statistical decisions.
16
3/1/2010
5
Types of Variables
A variable is any characteristic of a study unit (example
an individual) that is measureable and/or classifiable,
and can take any value for different units.
Depending on their quantifiablity, can be classified as
Qualitative and Quantitative variables.
Qualitative (Categorical) Variable: is a characteristic
which can not be measured in quantitative form but can
be identified by names or categories. For example
religion, ethnicity, illness status (well or ill), treatment
outcome (improved or not improved), Stage of breast
cancer (I, II, III, IV) etc
17
Types of Variables Cont
Quantitative Variable: is a characteristic that can be
measured and expressed numerically.
This can be of two types:
Discrete Quantitative Variable:
Can only take on a finite number of values (usually whole
numbers).
Example: number of children, number of episode of illness.
Continuous Quantitative Variable:
Measured on continuous scale.
It can assume infinite number of values between two given
values.
Example: height, weight, age, blood sugar level.
18
Scale of Measurement
In clinical medicine and public health as in many other
areas of science, we typically assign numbers to various
attributes of people, objects, or concepts.
This process is known as measurement.
The process of measurement involves assigning
numbers to observations according to rules.
The way that the numbers are assigned determines the
scale of measurement.
Four scales of measurement are typically discussed
here.
19
Scale of Measurement Cont
Nominal Scale:
Is the lowest scale of measurement.
Numbers are assigned to categories as "names"
arbitrarily.
Therefore, the only number property of the nominal scale
of measurement is identity.
For example classifying people according to gender is a
common application of a nominal scale. We may assign
number "1" to "male" and number "2" to "female" or the
opposite. The only mathematical operation we can
perform with nominal data is to count.
20
3/1/2010
6
Scale of Measurement Cont
Ordinal Scale:
Ordinal scale has the property of magnitude.
It assigns each measurement to one of a limited number
of categories that are ranked in terms of graded order.
However the interval between the categories is not
necessarily equal.
Example: Cancer stage, rank in a race.
21
Scale of Measurement Cont
Interval Scale:
Interval scale has property of equal interval b/n values.
It doesnt have a true zero point; the number "0" is
arbitrary.
Similarly the ratio between two values on interval scale
doesnt have meaningful interpretation.
Eg: in measuring temperature using
0
C scale, we can
always be confident that the distance between 25
0
C and
35
0
C is the same as the distance b/n 65
0
C and 75
0
C.
However, 0
0
C doesnt mean there is no temperature.
Similar, it would be inappropriate to say that 60
0
C
degrees is twice as hot as 30
0
C degrees.
22
Scale of Measurement Cont
Ratio Scale:
Ratio scale of measurement has the property of equal
interval between values and absolute/true zero.
These properties allow us to apply all mathematical
operations (addition, subtraction, multiplication, and
division) in data analysis.
The absolute/true zero allows us to know how many
times greater one case is than another.
23
Data Collection Method
In order to generate valid conclusion from a data,
information has to be collected in a systematic manner.
A haphazardly collected dataset is less likely to produce
valuable and generalizable information.
Data may be derived from several sources.
Depending on the source, it can be classified as Primary
or Secondary data.
Primary data is gathered for the first time by the
researcher for a given purpose; while,
Secondary data is data already collected by others, for
purposes other than the question of the research at hand.
24
3/1/2010
7
Data Collection Method
Cont
Survey through interview:
A quantitative approach in which a standardized
questionnaire, to be administered through interview, is
used to collect information.
Advantage
Quick and inexpensive,
Responses from different respondents is comparable,
Easy to quantify and analyze,
Useful in describing quantifiable characteristics of a
large population,
25
Data Collection Method
Cont
Very large and representative samples are feasible,
Standardized questions make measurement more
precise,
Participants do not need to be able to read and write
to respond,
Disadvantage:
Doesnt give qualitative information,
Doesnt give opportunity to probe and explore,
Relatively inflexible,
Less reliable to assess behavior and attitude of
respondents,
26
Data Collection Method
Cont
Survey through self administered questionnaire:
A quantitative method in which a standardized
questionnaire, to be filled by the respondents
themselves, is used.
Advantage:
Quick and inexpensive,
Responses from different respondents is comparable,
Useful in describing quantifiable characteristics of a large
population,
Very large and representative samples are feasible,
Standardized questions make measurement more
precise. 27
Data Collection Method
Cont
Disadvantage:
Participants need to be able to read and write to
respond,
High non-response rate,
Doesnt give qualitative information,
Doesnt give opportunity to probe and explore,
Less reliable to assess behavior and attitude of
respondents,
Relatively inflexible,
28
3/1/2010
8
Data Collection Method
Cont
Secondary data:
A quantitative approach which utilizes data already
collected by others.
Advantage:
Less resource and time consuming,
Disadvantage:
May not give in depth information,
No knowledge on the accuracy of data collection,
Can be outdated,
Limited control on the sampling method and size,
Less likely to give qualitative information.
29
Data Collection Method
Cont
Focus Group Discussion (FGD):
A qualitative method to obtain in-depth information on
concepts and perceptions about a certain topic through
spontaneous group discussion of approximately 612
persons, guided by a facilitators.
Advantage:
Excellent approach to gather information on in-depth
attitudes, and beliefs of a group,
Group dynamics might generate more ideas than
individual interviews,
Provides an excellent opportunity to probe & explore,
Participants are not required to read or write,
30
Data Collection Method
Cont
Unearth sensitive issues which are not commonly raised
by individuals.
It facilitates the exploration of collective memories.
Disadvantage:
Requires strong facilitator to guide discussion and
ensure participation by all members,
Doesnt give quantitative information,
It is difficult to organize the discussion,
Analysis is relatively difficult.
31
Data Collection Method
Cont
In-depth interview:
A qualitative method that relies on person to person
discussion.
Advantage:
Good approach to gather in-depth attitudes and
beliefs from individual respondents,
Provides an excellent opportunity to probe and
explore,
Participants dont need to be able to read and write to
respond,
Assures privacy,
32
3/1/2010
9
Data Collection Method
Cont
Disadvantage:
Doesnt give quantitative information,
It is time taking,
the respondent may feel like a bug under a
microscope,
The analysis is relatively difficult,
33
Data Collection Method
Cont
Observation:
A qualitative method that involves critical observation
and recording the practice (behavior, culture) of
individuals or a group.
Excellent approach to discover behaviors,
Usually takes longer time,
Liable to Observational bias
34
Designing Questionnaire
Most of the data collection techniques utilize
questionnaires.
Hence, the quality of the data is dependant on how best
the questionnaire is designed.
There are two main objectives in designing a
questionnaire:
To obtain accurate relevant information for the study,
To maximize the response rate.
35
Designing Questionnaire
Cont
A questionnaire can be classified based on different issues:
Structured Vs Non-structured Questionnaire:
The structured one is mainly designed for surveys.
A series of questions are arranged in a logical order and
sequence and divided into subtopics.
Skipped patter is important for structured questionnaire.
The data collector is expected to smoothly go through the
sequence.
The non-structured one is commonly used for qualitative
studies.
It doesnt have strict sequence of questions.
The data collector may rearrange the questions depending
on the response of the subject. 36
3/1/2010
10
Designing Questionnaire
Cont
Open ended Vs Close ended Questionnaire
(Question):
Open ended questions permit free response that should
be recorded in respondents own word.
Allows exploration of the range of possible themes.
Close ended questions offer a list of possible options or
answers from which the respondents must choose.
It is relatively easy and quick to fill, code, analyze and
report.
37
Designing Questionnaire
Cont
Standardized Vs Non-standardized Questionnaire:
Standard questionnaire is developed by a well known
body and considered to be standard to assess a given
research question.
A nonstandard one is developed by the researcher to
address the research question.
What are the advantages and disadvantages of using
standardized questionnaire?
38
Steps in Designing a
Questionnaire
1. Developing Individual Questions:
Use short and simple sentences.
Ask for only one piece of information at a time.
Ask precise questions to address the objective of the
study.
Give extra attention to sensitive questions.
Avoid leading questions.
2. Format of responses: Questions should be formatted
into open or closed formats depending on the need.
39
Steps Cont
3. Arranging the Questions:
Go from general to particular.
Go from easy to difficult.
Go from factual to abstract.
Start with closed questions.
Start with demographic and personal questions.
4. Piloting and Evaluation of Questionnaire.
Given the complexity of designing a questionnaire, it is
impossible even for the experts to get it right the first
time round.
Questionnaires must be pretested (piloted) on a small
sample of people characteristic of those in the survey.
40
3/1/2010
11
Diagrammatic
Summarization
Introduction
Data collection yields a set of data called Raw Data.
The size of the data can range from a few hundreds to
many thousands of observations.
Raw data however will not necessarily provide
information that can easily be interpreted.
Data presentation is a mechanism which enables easier
understanding of a given set of data through the use of
tables and graphs.
In data summarization the detailness of the data is
compromised but this is compensated by gain in
knowledge of the data.
42
Tables
Simplest means of data presentation which can be used
for all type of data.
Frequency Distribution
One type information that is commonly used to organize
data in tables is Frequency Distribution.
For nominal or ordinal data, the frequency distribution
consists of a set of categories along with numeric
counts that correspond to each one.
Example:
43
Tables Cont
Table 2.1: Ethnicity Composition of Women of Reproductive age in
Awassa Town, Jan 2006.
Ethnic Group Frequency Distribution
Wolita 377
Amhara 355
Sidama 163
Oromo 144
Guragae 138
Kenbata 82
Tigray 47
Hadya 20
Others 50
Total 1376
44
3/1/2010
12
Tables Cont
In displaying numeric data using frequency distribution
we should note the following:
The range of values must be broken-down into a series
of distinct and non-overlapping intervals.
The intervals should cover all data points.
Intervals are often constructed, though not necessarily,
so that all have equal width. This facilitates comparison
among classes.
Open ended intervals should be avoided.
The limits for each class must agree with the accuracy of
the raw data.
45
Tables Cont
Appropriate number of intervals should be considered as
too many intervals wont be much explanatory and too
few intervals loose a great deal of information.
The rule of thumb states the number of classes should
be between 10-20.
When we dont have any evidence to decide number of
classes, we can use Sturges Formula:
No of classes = 1+[3.322 x log (no of observations)]
The width of each class can also be calculated as:
)
classes of No
Min value - Max value
( class the of Width =
46
Tables Cont
Relative and Cumulative Frequency
In addition to counts, it is useful to know the proportion of
values that fall into a given class.
Relative frequency of a class is the proportion or
percentage of total number of observations that fall in a
given class.
Cumulative relative frequency of a class is the proportion
(percentage) of total number of observations that have a
value less than or equal to the upper limit of a given
interval.
If such information is given in the form of counts it is
simply called Cumulative frequency.
47
Tables Cont
Age Group Number of women Relative Frequency
(%)
Cumulative Relative
Frequency (%)
15-19 399 28.9 28.9
20-24 341 24.7 53.6
25-29 281 20.4 74.0
30-34 143 10.4 84.3
35-39 116 8.4 92.8
40-44 54 3.9 96.7
45-49 42 3.0 100.0
Total 1380 100.0
Table 2.2: Cumulative and Relative Frequency of Age Structure of Women of
Reproductive age in Awassa Town, Jan 2006.
48
3/1/2010
13
Tables Cont
Depending on the number of variables represented in,
tables can be classified as one way, two way and higher
order tables.
One-way Table: Only one variable is summarized in the
table.
Two-way Table (Cross tabulation): Two variables are
organized simultaneously in combined manner in a table.
Higher Order Table: Three or more variables are
presented simultaneously in a table. The higher order
the table the more complicated the interpretation.
49
Tables Cont
Child Ever Born
>=5 < 5
E
d
u
c
a
t
i
o
n
a
l

s
t
a
t
u
s

o
f

w
o
m
e
n
Illiterates 42 68
Read and Write 9 19
1
st
-4
th
grade 32 60
5
th
-8
th
grade 46 211
9
th
-12
th
grade 42 239
> 12
th
grade 7 68
Total 175 665
What type of table is this?
50
Tables Cont
Childs Age Childs Sex History of illness in the preceding 2 weeks
Yes No Total
0-11 mo
Male 15 86
101
Female 18 84
102
12-23 mo
Male 13 80
93
Female 12 78
90
24-35 mo
Male 10 76
86
Female 11 77
88
36-47 mo
Male 9 74
83
Female 9 73
82
48-59 mo
Male 6 69
75
Female 7 70
77
51
Tables Cont
In constructing tables, the following standards should be
followed:
Tables should be simple and self explanatory,
Every table should have a title (usually at the top of the table)
which indicates who, what, when, where of the data presented,
Row and columns should be labeled,
Totals should be indicated,
Numeric entities of zero should be written as 0 while missed or
unobserved data should be represented by -,
If the data are not original, there source should be given as
footnote,
Complicated tables should be avoided.
52
3/1/2010
14
Diagrammatic Representation
A second way to present data is through the use of graphs
or pictures. (Diagrammatic Representations).
Though diagrammatic representation is easier to read than
tables, they supply a lesser degree of details.
However, the lesser detail can be compensated by a gain
in understanding of the data.
Diagrammatic representation has the following advantages:
They are easier to understand and memorize,
They are more attractive,
They facilitate comparison among groups,
They may show pattern within the data set.
53
Bar Charts (Bar Graphs)
Bar graphs are popular type of graph used to display a
frequency distribution for Nominal or Ordinal data.
In the case of the commonest Vertical Bar Graph
(Column Graph), various categories into which the
observation falls are presented along horizontal axis.
A vertical graph is drawn above each category so that
the height of the bar represents either the frequency or
relative frequency of observations within that class.
The bar should have equal width, and separated from
one another so that not to imply continuity.
In the case of Horizontal Bar Graph, the vise-versa holds
true.
54
Bar Charts Cont
Bar graph has different types:
Simple Bar Graph:
Depicts the frequency /relative frequency of classes of a variable.
The intension is to compare the frequency of different classes of a
variable.
0
10
20
30
40
50
60
70
Within an hr 1-24 hr After the first day
The time breast feeding was initated
P
e
r
c
e
n
t
a
g
e

o
f

c
h
i
l
d
r
e
n

a
g
e
d

0
-
1
1

m
o
n
t
h
s
55
Bar Charts Cont
Multiple Bar Graph:
Depicts the frequency or relative frequency of classes of a
variable at two or more situations.
This type enables comparison between the levels of classes of
the variable at different situations.
28
60
26
63.3
33.5
2.8
0
10
20
30
40
50
60
70
Within an hr Within a day After the first day
The Time Breastfeeding was Initated
%
Baseline
End line
56
3/1/2010
15
Bar Charts Cont
Component Bar Graph:
Similar as that of simple bar graph except bars are divided into
components.
The graph shows the relative contribution of the components to
the bar (category).
0
10
20
30
40
50
60
70
Within an hr 1-24 hr After the first day
The time breastfeeding was initiated.
P
e
r
c
e
n
ta
g
e
o
f
c
h
il
d
r
e
n

a
g
e
d

0
-1
1

m
o
n
th
s
Female
Male
57
Bar Charts Cont
100% Component Bar Graph:
Similar as that of component bar graph.
But the height of all the bars is set at 100% so that comparison
on the relative contribution of the components can easily be
made.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Within an hr With in a day After the first day
Females
Males
58
Pie Chart
A Pie Chart is a circular chart divided into sectors,
illustrating relative magnitudes or frequencies of classes
of a given variable.
Pie chart usually represents categorical data but it is also
possible to use it for discrete quantitative data.
The angle of each sector has to be proportional to the
relative frequency of a given class.
59
Pie Chart Cont.
60
3/1/2010
16
Histogram
Whereas Bar-chart is representation of a frequency
distribution for either nominal or ordinal data, a Histogram
depicts a frequency distribution for continuous data.
The horizontal axis displays the true limit of the interval,
the vertical axis represents the frequency or relative
frequency of the interval.
If the interval of the bars is equal, the frequency
associated with each interval can be represented by the
height of the respective bars.
However if the bars have different width, the histogram
should be drawn in such a way that the Y axis represents
the frequency density and the X axis the interval.
61
Histogram Cont
Then the respective frequency of the interval is
represented by the area of the bar.
Frequency density of an interval = frequency of the
interval /true class width.
Unlike Bar-graph, in the case of Histogram the
categories (bars) must be adjacent. Hence, in order to
construct a Histogram, rather than class intervals, true
class boundaries should be used.
For example the following table summarizes the
Biostatistics mid exam score of 38 students out of 35
marks.
62
63
Frequency Polygon
Frequency Polygon depicts a frequency distribution
continuous numeric data.
Frequency polygons are a graphical device for
understanding the shapes of distributions.
A Histogram can easily be changed to Frequency
Polygon by joining the mid points of the top of the
adjacent rectangles of the Histogram with a line.
It is also possible to draw Frequency Polygon without
drawing Histogram. The procedure is as follows:
64
3/1/2010
17
Frequency Polygon Cont
1. Identify the mid points of all the intervals of the classes
of the give data,
2. Plot the mid points (as X axis) with the respective
frequency distribution or relative frequency of the class
(as Y axis)
3. Connect adjacent plots with a straight line
65
Frequency Polygon Cont
For example the following Frequency Distribution
represents the ages (in years) of 60 patients at a
psychiatric counseling centre.
66
Frequency Polygon Cont
First we have to identify the mid points of each interval.
67
Frequency Polygon Cont
Finally we have to plot the midpoints (as X axis) with respective
frequency of each class (as Y axis) and connect adjacent plots with
a straight line.
68
3/1/2010
18
Scattered Plot (Scattered Graph)
Scattered plot is used to show the relation between two
different continuous measurements.
The scale for one quantity is marked on the X axis and
the scale for the other on the Y axis.
Each point on the graph represents a pair of values for
the two measurements.
For each value on the X axis, it is possible to have
multiple Y values.
The following scattered plot, shows the relation between
age and blood glucose level among diabetic patients
aged 50-70 years.
69
Scattered Plot Cont..
120
125
130
135
140
145
150
155
160
165
170
175
180
185
190
195
200
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
Age in Years
B
l
o
o
d

G
l
u
c
o
s
e

l
e
v
e
l

m
g
/
d
l
70
Line Graph
A line graph is similar to scattered plot as it shows the
relation between two different continuous measurements.
Once again each point on the graph represents a pair of
values.
However, unlike scattered plot, each value on the X axis
has a single corresponding measurement on the Y axis.
As the name indicates, points on the graph are connected
to the adjacent points with straight line.
Most commonly the scale along the X axis represents time.
Consequently we are able to trace the chronological
changes.
71
Line Graph Cont
Figure 2.8: Mean Number of Child Ever Born to Women at the Age of
25 years, Awassa Town (1980-2005)
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
3.25
3.5
3.75
1980 1985 1990 1995 2000 2005
Year (GC)
M
e
a
n

C
h
i
l
d

E
v
e
r

B
o
r
n

a
m
o
n
g

W
o
m
e
n

a
t

t
h
e

A
g
e

o
f

2
5
72
3/1/2010
19
Cumulative Line Graph
Also known as Ogive Graph.
It is best used when you want to display the total at any
given time.
The relative slopes from point to point will indicate
greater or lesser increases.
For example, a steeper slope means a greater increase
than a more gradual slope.
For example, if you saved $300 in both January and
April and $100 in each of February, March, May, and
June, the Ogive would looks like as follows.
73
Cumulative Line Graph
Cont
74
Box and Whisker Plot
In descriptive statistics box-and-whisker plot is a
convenient way of pictorially depicting groups of
numerical data through their five-number summaries
The smallest observation, 1
st
quartile, median, 3
rd
quartile, and largest observation.
75
Box and Whisker Plot Cont
However in some cases the ends of the whiskers can
represent several possible alternative values.
For example In SPSS:
The ends of the whiskers represent lowest datum but
still within 1.5 times of the IQR of the lower quartile,
and the highest datum still within 1.5 IQR of the upper
quartile.
Values more than three IQRs from the end of a box
are labeled as extreme, denoted with an asterisk (*).
Values more than 1.5 IQRs but less than 3 IQRs
from the end of the box are labeled as outliers (o).
76
3/1/2010
20
Stem and Leaf Plot
Is a display that organizes data to show its shape and
distribution.
Each data value is split into a "stem" and a "leaf" portion.
The "leaf" is the last digit of the number and the other
digits to the left of the "leaf" form the "stem".
For example, the number 42 would be split apart, with
the stem becoming the 4 and the leaf becoming the 2.
Consider the following dataset, sorted in ascending
order: 8, 13, 16, 25, 26, 29, 30, 32, 37, 38, 40, 41, 44,
47, 49, 51, 54, 55, 58, 61, 63, 67, 75, 78, 82, 86, 95.
77
Stem and Leaf Plot Cont
0|8
1|3 6
2|5 6 9
3|0 2 7 8
4|0 1 4 7 9
5|1 4 5 8
6|1 3 7
7|5 8
8|2 6
9|5
78
Pictogram
Pictogram is a graph which uses pictures or symbols to
present a certain data.
Usually presents the frequency of one or more
categorical or discrete numeric variables in the form of
symbols.
The magnitude of the can be shown either by the size of
the picture or the number of pictures.
For example the following pictogram represents the
number of passengers per year across four airports of
UK.
79
Pictogram Cont
80
3/1/2010
21
Issues to be considered in
diagrammatic representation
Depending on the type of the data, the right type of
diagrammatic representation should be selected.
It is not common to use two or more types of
diagrammatic representation simultaneously for a
specific data. The best should be selected and used.
Each graph and diagram should be labeled (usually the
title is given below the figure).
The title should indicate Who, What, When and
Where of the data presented.
If the representation is taken from another source the
primary source should be indicated.
81
Issues to be considered
Cont
In graphs, the X and Y axis should be indicated clearly
with their unit of measurement.
In graphs, the scale of X and Y axis should be drawn
proportionally.
Pictorial representations usually require Key to facilitate
easier interpretation.
When colors are employed, contrasting colors should be
selected.
82
Diagrammatic Representation
Using SPSS
In order to develop graphs using SPSS, the following
steps should be followed;
Graphs > legacy dialogues > select appropriate graph
Available types are Bar graph, Pie chart, Histogram, Line
graph, Scattered plot and Box plot.
Other rarely used types are also there.
Most of the graphs can also be found under Analysis >
Descriptive Statistics icon.
83
Numeric Summarization
3/1/2010
22
Introduction
Even though diagrammatic representation greatly
enhance understanding of the data, it does not give
mathematically amenable outputs.
This gap is addressed by numeric summarization.
In summarizing a dataset using numeric indicators, we
often focus on describing the data with two summary
figures. These are:
Central Tendency (Location)
Variation (Spread)
85
Measures of Central Tendency
One of the most commonly used measures to
summarize a set of data is its center.
The center is value (usually a single value), chosen in
such a way that it gives a reasonable approximation of
the whole dataset.
In statistics the number which tends to approximate the
center of a set of data is called Measure of Central
Tendency or Average.
The Arithmetic Mean, Median and Mode are the most
commonly used measures of central tendency.
86
Measures of Central
Tendency Cont
Attributes of good measure of central tendency are:
It should be based on all observations.
It should not be affected by extreme values.
It should have a definite value.
It should not be subjected to complicated computation.
It should be capable of further algebraic treatment.
It should be close to the location were majority of the
observations are located.
87
Arithmetic Mean
The Arithmetic Mean is usually called the Mean.
It is most familiar measure of central tendency.
It is calculated by adding all of the individual values and
dividing the sum by the number of individual values.
In statistics, two separate letters are used for the mean.
The Greek letter (mu) is used to denote the population
mean.
The symbol (read as "x bar") is used to denote the
sample mean.
88
3/1/2010
23
Arithmetic Mean Cont
When n is the total number of observations and X
i
is the value
of X for i
th
observation the formula of arithmetic mean is given
as:
In calculating the mean from grouped data we assume all
values falling into particular class interval are located at the
mid point of the interval.
The formula is given as:
n
f m
Mean
K
i
i i
=
n
x
Mean
n
i
i
=
=
1
89
Arithmetic Mean Cont
Where k is the number of class intervals,
m
i
is the mid point of the i
th
class interval,
f
i
is the frequency of the i
th
class interval,
n is total number of observations,
The formula simply means each value within the interval
is represented by the midpoint of the true class interval.
Then we can calculate the mean as usual.
90
Arithmetic Mean Cont
Example 3.1: Consider the time taken by 30 students to do
a Biostatistics quiz.
Thus mean of the data is 350/30 = 11.7 minutes
Minutes spent
on Quiz
Number of
students (f)
True Class interval Mid point (m) m
i
f
i
1-5 2 0.5-5.5 3 6
6-10 12 5.5-10.5 8 96
11-20 16 10.5-20.5 15.5 248
Total 30 350
91
Arithmetic Mean Cont
The major advantages of mean are:
It is calculated based on all observations.
Its mathematical computation is not complicated.
It accommodates further mathematical applications.
It can only have one value.
The major disadvantages of mean are:
It is affected by extreme values.
It shouldnt be used when the dataset is not normally
distributed.
92
3/1/2010
24
Median
The Median is the value which divides the data into two equal
halves, with half of the values being lower than the Median
and half higher than the median.
When n is the number of observation in a dataset, the median
is calculated in such a way:
Sort the values into ascending order.
If you have an odd number of observations, the median is
the middle observation, i.e. (n+1)/2 position of your data.
If you have an even number of observations, the median is
the arithmetic mean of the two middle observations, i.e.
pick the numbers at positions n/2 and (n/2) + 1 and find the
mean of those two observations.
93
Median Cont
Example 3.2: Compute the median for {1, 2, 3, 4, 5}
The numbers are already sorted, so that it is easy to see
that the median is 3 (two numbers are less than 3 and
two are bigger).
Example 3.3: Compute the median for {1, 2, 3, 4, 5, 6}
The median would be 3.5 since that is the middle
between 3 and 4, computed as (3 + 4)/ 2.
Note that three numbers are less than 3.5, and three are
bigger, as the definition of the median requires.
94
Median Cont
When we are dealing with grouped data, the median can be
calculated as:
Where:
L
m
is the lower true class boundary of the interval
containing the interval,
F
c
is cumulative frequency of the interval just above the
median class interval,
F
m
is frequency of the interval containing the median,
W is class interval width,
n total number of observations.
w
F
F
n
L X
m
c
m
)
2
(
~

+ =
95
Median Cont
The major advantages of the median are:
Not affect by extreme values,
Can be used in skewed distribution,
It is easy to calculate,
It can only has one value,
Can be calculated when there is open end interval.
The major limitations of the median are:
It could not be a good representative if the number of
observations is too few,
It does not accommodates further mathematical
applications (in parametric statistics),
It is calculated based on one or two observations.
96
3/1/2010
25
Mode
Mode is by far the simplest, but the least widely used
measure of central tendency.
It is simply the score that occurs most frequently.
When the distribution has only one vale with highest
frequency it is called Unimodal. If it has two values with
equal and highest frequency it is called Bimodal.
Similarly, it is possible to have multimodal frequency.
Example: {1, 2, 2, 3, 3, 4, 4, 4, 5}
The mode is 4.
In grouped data the mid point of the interval with highest
frequency is considered as the mode of the distribution.
97
Mode Cont
98
Salary in Br Number of Factory Workers
500-600 3
600-700 6
700-800 5
800-900 5
900-1000 0
1000-1100 1
Mode Cont
For example the following table displays the salary of 20
factory workers in factory X.
mid point of this interval i.e. 650 is taken as the mode of
distribution.
99
Mode Cont
The major advantages of the mode are:
It can be used when the variable is ordinal or nominal,
It is very easy to compute,
It is less likely to be affected by extreme values,
Can be calculated to distributions with open end class
interval.
The major disadvantages of mode are:
It may not perfectly denote what central tendency imply,
It does not accommodate further mathematical application,
It is calculated based on few observations,
It may have more than a value for a dataset,
At times a mode value may not exist in a dataset.
100
3/1/2010
26
Skewness and the Measures
of Central Tendency
The normal distribution is one that is bell shaped, unimodal
and symmetric.
Skewness measures the symmetry of a distribution.
If the distribution is not symmetric, (one side does not reflect
the other), then it is skewed.
Skewness is indicated by the tail or trailing frequencies of
the distribution.
If the tail is to the right it is a positive skew. If the tail is to the
left then it is a negatively skewed distribution.
In normal distribution, the mean, median and mode are equal.
Skewness affect their arrangement of the three measures of
the central tendency in the following way.
101
Skewness and the Measures
of Central Tendency Cont
102
Weighted Mean
The weighted mean is similar to an arithmetic mean except it
is a mean where there is some variation in the relative
contribution of individual data values to the mean.
Each data value (X
i
) has a weight assigned to it (W
i
).
Data values with larger weights contribute more to the
weighted mean and data values with smaller weights
contribute less to the weighted mean.
The formula is
103
Weighted Mean Cont
If all the weights are equal, then the weighted mean is
the same as the arithmetic mean.
The best example for the application of weighted mean
is the calculation of GPA.
Scoring an A grade has larger weight than scoring a
B grade.
104
3/1/2010
27
Geometric Mean
The geometric mean is an average calculated by
multiplying a set of numbers and taking the n
th
root,
where n is the number of numbers.
Geometric mean is related to the log-normal distribution.
The log-normal distribution is a distribution which is
normal for the logarithm transformed values.
105
Harmonic Mean
The harmonic mean (H) of n positive values is defined by the
formula;
It is the reciprocal of the arithmetic mean of the reciprocals.
It applies more accurately to situations involving rates.
For example: A blood donor fills a 250mL blood bag at
70mL/min on the first visit, and 90mL/min the second visit.
What is the average rate at which the donor fills a bag?
Given:
250mL at 70mL/min = 3.571 mins total
250mL at 90mL/min = 2.778 mins total
106
Harmonic Mean Cont
So 500mL total in (3.571+2.778) mins total = 500/6.349 =
78.753 mL/min
The harmonic mean of 2/[1/70+1/90] = 78.750 gives a more
accurate description of average rate, than the arithmetic mean
(80mL/min).
Source: http://wiki.answers.com/Q/What_is_the_application_of_harmonic_mean_in_medicine
107
Measures of Dispersion
While measures of central tendency are used to estimate
"center" value of a dataset, measures of dispersion are
important for describing the spread of the data, or its
variation around a central value.
Two distinct samples may have the same mean or
median, but completely different levels of variability, or
vice versa.
Set 1: 30, 40, 40, 50, 60, 60, 70 (Mean = 50)
Set 2: 48, 49, 49, 50, 50, 51, 53 (Mean = 50)
108
3/1/2010
28
Range
Defined as the difference between the largest and
smallest sample values (x
max
-x
min
).
Major advantage: It is simple to calculate.
Major disadvantages:
It depends only on extreme values and provides no
information about how the remaining data is distributed.
The range value can not be used when the units of
measurements are different.
The extreme values are the most unreliable parts of the
data.
It doesnt accommodate further mathematical
application.
109
Standard Deviation and
Variance
Standard deviation is the most common and useful
measure of dispersion.
It is the average distance of each score from the mean.
The formula for sample standard deviation is given as:
The formula for population standard deviation is give as:
What might be the reason for the difference?
1
) (
1
2

=
n
x x
S
n
i
i
n
x
n
i
i
=

=
1
2
) (

110
Standard Deviation and
Variance Cont
Variance is just the square of the standard deviation.
The formulas for sample and population variance are
given as follows:
NB: Occasionally, the abbreviations SD for standard
deviation and Var for variance are used.
Standard deviation for grouped data is calculated as:
1
) (
1
2
2

=
n
x x
S
n
i
i
n
x
n
i
i
=

=
1
2
2
) (

2 1
2
1
x
n
m f
S
K
i
i i

=
111
Standard Deviation and
Variance Cont
Advantages:
They accommodate further mathematical
applications.
They are calculated from the whole observations.
Disadvantages:
They must always be understood in the context of the
mean of the data.
They are measured in the unit of measurement of the
observed data. Thus it is difficult to compare the
standard deviation/variance of two datasets
measured in two different units.
112
3/1/2010
29
Coefficient of Variation (CV)
The standard formulation of the CV is the ratio of the
standard deviation to the mean of a give data.
The coefficient of variation is a dimensionless number.
So when comparing between data sets with different
units one should use CV instead of SD.
The CV is useful in comparing the variability of several
different samples, each with different arithmetic mean as
higher variability is expected when the mean increases.
CV is also important to compare reproducibility of
variables.
% 100 x
x
S
CV =
113
Example on Grouped Data
Example 3.4:
Consider the time taken by 30 students to do a
Biostatistics quiz. Their time is summarized in the
following table.
Minutes spent on Quiz Number of students (f)
1-5 2
6-10 12
11-20 16
Total 30
114
Example Cont
Minutes
spent on
Quiz
Number of
students (f)
True Class
interval
Mid point
(m)
f
i
m
i
f
i
m
i
2
1-5 2 0.5-5.5 3 6 18
6-10 12 5.5-10.5 8 96 768
11-20 16 10.5-20.5 15.5 248 3844
Total 30 350 4630
minutes 11.7 = 350/30 = =

n
f m
Mean
K
i
i i
min 10.8 = (5/16) + 10.5 )
2
(
~
=

+ = w
F
F
n
L X
m
c
m
min 6.55 = 64 . 116
29
4630
1
2 1
2
=

=
x
n
m f
S
K
i
i i
115
Measures of Position (Fractiles)
In addition to measures of central tendency and
dispersion, measures of position give additional
information about a given data.
Fractiles (Quantiles) are numbers that partition, or divide,
an ordered dataset into equal parts.
For instance, the median is a fractile because it divides
an ordered data set into two equal parts.
The commonly used measure of positions are Quartiles
(that divide the data into 4 parts), Deciles (that divide the
data into 10 parts), and Percentiles (that divide the data
into 100 parts).
116
3/1/2010
30
Quartiles
Quartiles divide a data set into four equal parts.
The three quartiles Q
1
, Q
2
, and Q
3
divide an ordered data
set into four equal parts.
About of the data falls on or below the first quartile Q
1
.
About of the data falls on or below the second quartile
Q
2
(equivalent to median).
About of the data falls on or below the third quartile
Q
3
.
About of the data falls above the third quartile Q
3
.
117
Quartiles Cont
In order to identify the Quartiles of a given dataset
Sort the values in increasing order
Identify the Quartiles accordingly;
Q
1
is the {0.25 (n+1)}
th
observation
Q
2
is the median observation or {0.5 (n+1)}
th
Q
3
is the {0.75(n+1)}
th
observation
NB: if the identified observation is not a whole number
then it should be determined by interpolation of the
observations on either side.
118
Quartiles Cont
Example: Lets assume the following dataset presents the
age of 8 factory workers. Identify the first and the third
quartiles.
{18, 21, 23, 24, 24, 32, 42, 59}
First make sure that the data is sorted in increasing order.
Q
1
is the {0.25 (n+1)}
th
observation
{0.25 (8+1)}
th
observation
{0.25 (9)}
th
observation
{2.25}
th
observation
119
Quartiles Cont
i.e. the Q
1
is a quarter distance between 21 and 23 this
can be interpolated as:
21 + (23-21)0.25 = 21.5
The interpretation is one forth of the observations are
below or equal to the value 21.5.
Q
3
is the {0.75(n+1)}
th
observation
{6.75}
th
observation
32 + (42-32)0.75 = 39.5
The interpretation is three forth of the observations are
below or equal to the value 39.5.
120
3/1/2010
31
Quartiles Cont
Additional use of the quartiles:
The inter quartile range (Q
3
- Q
1
) can be used as
measure of dispersion (like that of Range). Inter quartile
range can over come one of the limitations of range, (i.e.
being affected by extreme values).
Quartile deviation [(Q
3
- Q
1
)/2] and Coefficient of quartile
deviation [(Q
3
- Q
1
)/(Q
3
+ Q
1
)] are also rarely used as
measures of dispersion.
A dataset can be summarized using the so called The
five numbers summary (this is sometimes represented
graphically as a box-and-whisker plot). The five numbers
are: the first and third quartiles, the median, and the
maximum and minimum values.
121
Deciles
Deciles serve to partition data into10 equal parts.
Not commonly used as common as percentiles and
Quartiles.
There are 9 deciles dividing the population into 10 parts.
The deciles are termed D
1
through D
9
.
The interpretation of Deciles is as follows:
About one tenth of the data falls on or below D
1
.
About two tenth of the data falls on or below D
2
.
The same meaning for other deciles.
Note that the D
5
has similar meaning to the median or
the third quartile.
122
Deciles Cont
A given percentile is determined in the following manner;
1. Arrange the data in ascending order.
2. Compute the decile using the formula:
3. NB: if the identified observation is not a whole number
then it should be determined by interpolation of the
observations on either side.
n observatio n
k
decile k
th
th
(

+ = ) 1 )(
10
(
123
Percentiles
Percentiles are also like quartiles, but divide the data set
into 100 equal parts.
Each group represents 1% of the data set.
There are 99 percentiles termed P
1
through P
99
.
P
50
is yet another term for median.
Other equivalents, such as P
25
=Q
1
, P
75
=Q
3
, P
10
=D
1
, etc.,
should also be obvious.
The interpretation of Percentiles is as follows:
1% of the data falls on or below P
1
.
2% of the data falls on or below P
2
.
The same for other values.
124
3/1/2010
32
Percentiles Cont
A given percentile is determined in the following manner;
1. Arrange the data in ascending order.
2. Compute the percentile using the formula:
3. If the identified observation is not a whole number then
it should be determined by interpolation of the
observations on either side.
n observatio n
k
percentile k
th
th
(

+ = ) 1 )(
100
(
125
Example
The following data represents the Biostatistics result of
18 students out of 100 marks. Calculate the 4
th
decile
and 70
th
percentile.
{72, 51, 59, 80, 84, 71, 82, 71, 51, 48, 66, 81, 78, 69, 75,
67, 76, 75}
Computing the 4
th
decile
Before starting the computation arrange the observations
in increasing order. i.e.
{48, 51, 51, 59, 66, 67, 69, 71, 71, 72, 75, 75, 76, 78, 80,
81, 82, 84}
Compute 4
th
decile using the formula:
126
Example Cont
Compute 4
th
decile using the formula:
4
th
decile is b/n the 7
th
& 8
th
observation (i.e. b/n 69 & 71)
In order to get the exact value we have to interpolate
69 + (71-69) 0.6 = 70.2
About four tenth of the data falls on or below 70.2
n observatio n decile
th
th
(

+ = ) 1 )(
10
4
( 4
[ ] n observatio decile
th th
) 19 )( 4 . 0 ( 4 =
[ ] n observatio decile
th th
) 6 . 7 ( 4 =
127
Example Cont
Compute the 70
th
percentile
The data is already sorted
Compute the 70
th
percentile using the formula
70
th
percentile is b/n the 13
th
& 14
th
observation (i.e. b/n
76 & 78).
In order to get the exact value we have to interpolate
76 + (78-76) 0.3 = 76.6
About 70% of the data falls on or below the value 76.6.
n observatio n percentile
th
th
(

+ = ) 1 )(
100
70
( 70
[ ] n observatio percentile
th th
) 3 . 13 ( 70 =
128
3/1/2010
33
Rate, Ratio and Proportion
In addition to measures of central tendency, measures of
dispersion, and measures of position, a dataset can be
mathematically summarized by the use of Rate, Ratio
and Proportion.
129
Rate
In mathematics rate is a numeric presentation which is
given in the form of fraction by which the numerator
measures one variable and the denominator another.
Usually the denominator of rate is a time measure.
In epidemiology we use rates to measure the occurrence
of events over time.
If time element is directly reflected into the denominator
it is called real rate. (Example: Incidence density).
If the fraction measures number of events per population
at risk in a given period of time it is called operational
rate (Example: Incidence proportion).
130
Ratio
Mathematically a ratio is the comparison of two
quantities that have the same units (usually classes of a
variable).
A ratio can be written in three different ways:
As two numbers separated by a colon (a:b)
As a fraction (a/b)
As two numbers separated by the word to (a to b)
In epidemiology ratio present two variables (as
numerator and denominator) where one is not included
in the other.
131
Proportion
A proportion is usually presented in fraction, decimal or
percentage.
Unlike ratio numerator is the subset of the denominator,
hence the value indicates the overall contribution of the
numerator to the denominator.
132
3/1/2010
34
Numeric Summarization
Using SPSS
In SPSS numeric summaries are available under many
alternatives. Commonly used are:
Analyze > Descriptive statistics > Frequency >
Statistics.
Analyze > Descriptive statistics > Descriptives >
Statistics.
Analyze > Descriptive statistics > Cross tabs >
Statistics.
Analyze > Descriptive statistics > Explore > Statistics.
Analyze > Reports > OLAP Cubes > Statistics.
133
Basic Probability
What is Probability
Probability is the chance that an event will occur given
the trial has been conducted nearly infinitely under the
same condition. OR
The probability of an event is the relative frequency of
set of outcomes over indefinitely large (or infinite)
number of trials.
A sampling space is the set of all possible outcomes of a
trial or experiment.
Event is the subset of the sample space.
An event can be simple or composite. Composite event
contains more than one simple events.
135
Concept of Union, Intersection and
Complement
136
3/1/2010
35
Mutually Exclusive Events and The
Additive Law
Events are said to be mutually exclusive if they have no
outcome in common.
Examples:
The Additive Law when applied to two mutually exclusive
events states that the probability of either of the two
events occurring is obtained by adding the probability of
each event.
p(A or B) = p(A) + p(B)
137
Mutually Exclusive Cont..
Example 4.1:
Role a six sided Die. The possible outcomes (Sampling
space) are six (1,2,3,4,5,6). Each event has equal
probability of occurrence (i.e. 1/6). Probability of rolling
an even number would be:
p(even) = p(2)+ p(4)+ p(6)
= (1/6)+(1/6)+(1/6)=1/2
138
Mutually Exclusive Cont..
Example 4.2:
The natural history of Tuberculosis indicates for TB
patients without any treatment, at the end of the 5
th
year
of illness of them would die, would develop
permanent disability and would recover. What is the
probability of an untreated TB patient either to recover or
to develop permanent disability (in other words to avoid
death) after 5 years of illness?
139
Conditional Probability and the
Multiplicative Law
Conditional probability is defined as the probability that a
certain event will occur given that a composite event has
also occurred.
p(A|B) or "probability of A given B"
This formula is conveniently rewritten as the following
which is commonly referred to as the Multiplicative Rule.
p(B)
B) p(A
B) | (

= A p
) ( ) B | ( ) ( B p x A p B A p =
140
3/1/2010
36
Conditional Probability Cont..
Example 4.3:
What is the probability that the outcome of a roll of a die
is 2 (A2) given that the outcome is even?
Example 4.4:
A medical practitioner measured the CD4 count of AIDS
patient on ART two times with in a month. About 25% of
the patients had normal value in both tests and 42% of
them had normal result in the first test. What percent of
those who had normal value in the first test also have the
same in the second test?
141
Independent Events and the
Multiplicative Law
For two given events, if the occurrence or nonoccurrence
of one doesnt affect in any way the occurrence or
nonoccurrence of the other, the events are called
independent events.
With independent events the multiplicative law becomes:
p(A and B) = p (A)p(B)
142
Independent Events Cont..
Example 4.5:
Assume we have rolled a die twice. What is the
probability to get 6 in both rolls?
Example 4.6:
The probability of getting normal birth weight baby at 33
rd
weeks gestational age is 1/5. If two pregnant women at
the aforementioned gestational age gave birth in Bethel
Hospital yesterday, what is the probability for those two
babies to have normal birth weight?
143
Bayes' Theorem
Bayes' theorem, was published in the eighteenth century
by Thomas Bayes.
It says that you can use conditional probability to make
predictions in reverse.
Sometimes called the inverse probability law:
P(B|A) = P(A and B)/P(A) 1
P(A|B) = P(A and B)/P(B) 2
Solving [1] for P(A and B) and substituting into [2] gives
Bayes' Theorem:
P(A|B) = [P(B|A)][P(A)]/P(B)
The general formula for Bayes' Theorem is:
144
3/1/2010
37
Bayes' Theorem Cont
Example 4.7:
Suppose there is a certain disease randomly found in
0.005% of the general population. A certain clinical blood
test is 99% effective in detecting the presence of the
disease among persons with the disease. But it also
yields false-positive results in 5% of individuals without
the disease. The following tables show the probabilities
that are stipulated in the example and the probabilities
that can be inferred from the stipulated information:
(Source: http://faculty.vassar.edu/lowry/bayes.html)
145
Bayes' Theorem Cont
P
(A)
= .005
The probability that the disease will be present in any
particular person
P
(~A)
= 1.005 = .995
The probability that the disease will not be present in
any particular person
P
(B|A)
= .99
The probability that the test will yield a positive result
[B] if the disease is present [A]
P
(~B|A)
= 1.99 = .01
The probability that the test will yield a negative result
[~B] if the disease is present [A]
P
(B|~A)
= .05
The probability that the test will yield a positive result
[B] if the disease is not present [~A]
P
(~B|~A)
= 1.05 = .95
The probability that the test will yield a negative result
[~B] if the disease is not present [~A]
Given:
146
Bayes' Theorem Cont
P
(B)
= [P
(B|A)
x P
(A)
] + [P
(B|~A)
x P
(~A)
]
= [.99 x .005]+[.05 x .995] = .0547
The probability of a positive test result
[B], irrespective of whether the disease
is present [A] or not present [~A]
P
(~B)
= [P
(~B|A)
x P
(A)
] + [P
(~B|~A)
x P
(~A)
]
= [.01 x .005]+[.95 x .995] = .9453
The probability of a negative test result
[~B], irrespective of whether the
disease is present [A] or not present
[~A]
Given this information, the derivation of two simple
probabilities is possible using conditional probability
formula.
147
Bayes' Theorem Cont
P
(A|B)
= [P
(B|A)
x P
(A)
] / P
(B)
= [.99 x .005] / .0547 = .0905
The probability that the disease is present [A] if
the test result is positive [B]
P
(~A|B)
= [P
(B|~A)
x P
(~A)
] / P
(B)
= [.05 x .995] / .0547 = .9095
The probability that the disease is not present
[~A] if the test result is positive [B]
P
(~A|~B)
= [P
(~B|~A)
x P
(~A)
] / P
(~B)
= [.95 x .995] / .9453 = .99995
The probability that the disease is absent [~A] if
the test result is negative [~B]
P
(A|~B)
= [P
(~B|A)
x P
(A)
] / P
(~B)
= [.01 x .005] / .9453 = .00005
The probability that the disease is present [A] if
the test result is negative [~B]
Then it is possible to calculate the remaining
probabilities.
148
3/1/2010
38
Summary of the Basic Properties of
Probability
1. The value of a probability can only be 0p1.
2. If an event is certain to occur, its probability is 1 and if an
event is certain not to occur, its probability is 0.
3. If two events are mutually exclusive (disjoint), the
probability that one or the other will occur equals the
sum of the probabilities: p(A or B) = p(A) + p(B)
4. If A and B are two events, not necessarily disjoint, then
p(A or B) = p(A) + p(B)-p(A and B)
5. The sum of the probabilities that an event will occur and
that it will not occur is equal to 1.
6. If A and B are two independent events then p(A and B) =
p(A)p(B)
7. p(A|B) = P (AnB)/P(B)
149
Random Variable and Probability
Distribution
Random Variable
Any characteristic that can be measured or categorized
is called Variable.
If a variable can assume a number of different values so
that any particular outcome is determined by chance, it is
called a Random Variable.
A Random Variable is a function, which assigns unique
numerical values to all possible outcomes of a random
experiment under fixed conditions.
151
Random Variable Cont
Example 4.8
Three students are taken
at random from this
classroom. Suppose our
interest is the number of
female students that we
will get out of the three
samples. The possible list
of outcomes with number
of females is:
Outcome No of
Females
MMM 0
MMF 1
MFM 1
FMM 1
MFF 2
FMF 2
FFM 2
FFF 3
152
3/1/2010
39
Random Variable Cont
There are two types of random variables.
A Continuous Random Variable is one that takes an
infinite number of possible values; and,
A Discrete Random Variable: is one that takes finite
distinct values.
Example 4.9:
A coin is tossed 10 times. The random variable X is the
number of tails that are noted. X can only take the values
0, 1, ..., 10, so X is a Discrete Random Variable.
A light bulb is burned until it burns out. The random
variable Y is its lifetime in hours. Y can take any positive
real value, so Y is a Continuous Random Variable.
153
Probability Distributions
Every Random Variable has a corresponding Probability
Distribution.
A Probability Distribution applies the theory of probability
to describe the behavior of the random variable.
In the discrete case, it specifies all possible outcomes of
the random variable along with the probability that each
will occur.
In the continuous case, it allows us to determine the
probabilities associated with specified ranges of values.
154
Discrete Probability Distribution
Usually represented by
table.
Example 4.10:
Table 4.1: Probability
Distribution of a random
variable X representing
the birth order of children
born in US.
x P(X=x)
1 0.416
2 0.330
3 0.158
4 0.058
5 0.021
6 0.009
7 0.004
8+ 0.004
Total 1.000
155
Continuous Probability
Distributions
Since a continuous random variable assumes infinite
number of outcomes, it cannot be expressed in tabular
form. Instead, an equation or graph describes it.
The equation used to describe a continuous probability
distribution is called a Probability Density Function
(PDF).
PDF has the following properties:
156
3/1/2010
40
Continuous Probability
Distributions Cont..
The area bounded by the curve of the density function
and the x-axis is equal to 1, when computed over the
domain of the variable.
The probability that a random variable assumes a value
between a and b is equal to the area under the density
function bounded by a and b.
The probability that a continuous random variable will
equal a specific value is always zero.
157
Binomial Distribution
A discrete probability distribution.
It handles dichotomous /binary/bernoulli random
variable.
A variable which has only two outcomes (Success and
failure).
The trial is called Bernoulli trial.
The experiment consists of n repeated trials.
Each trial can result in just two possible outcomes.
The probability of success (x), denoted by P, is the
same on every trial.
The trials are independent.
158
Binomial Distribution Cont..
b(x; n, P): The probability that an n-trial binomial
experiment results in exactly x successes, when the
probability of success on an individual trial is P.
b(x; n, P) =
n
C
x
* P
x
* (1 - P)
n x
159
Binomial Distribution Cont..
Example 4.11:
Suppose a die is tossed 5 times. What is the probability
of getting exactly 2 fours?
Suppose in Addis Ababa the probability of a commercial
sex worker to be HIV positive is 0.15. If we consider 5
randomly selected commercial sex workers in the city,
what is the probability that exactly 2 prostitutes will be
positive?
160
3/1/2010
41
Binomial Distribution Cont..
Cumulative Binomial Probability:
Refers to the probability that the binomial random
variable falls within a specified range (e.g., is greater
than or equal to a stated lower limit and less than or
equal to a stated upper limit).
161
Binomial Distribution Cont
Example 4.12:
The probability that a student is accepted to a
prestigious college is 0.3. If 5 students from the same
school apply, what is the probability that at most 2 are
accepted?
What is the probability of getting 4 or more HIV positives
among 5 randomly selected sex workers given that the
probability of a commercial sex worker to be HIV positive
is 0.15?
162
Poisson Distribution
A discrete probability distribution.
First introduced by Simon-Denis Poisson (17811840)
It expresses the probability of a number of random
events occurring in a fixed period of time if these events
occur with a known average rate.
A Poisson experiment is a statistical experiment that has
the following properties:
163
Poisson Distribution Cont
The experiment results in outcomes that can be
classified as successes or failures.
The average number of successes () that occurs in a
specified period is known.
The probability that a success will occur is
proportional to the duration of the time.
The probability that a success will occur in an
extremely small time is virtually zero.
Note that the distribution can also be used to quantify the
probability of occurrence of an event in a length, an area,
a volume, etc.
164
3/1/2010
42
Poisson Distribution Cont
The following notations are important,
e: A constant equal to approximately 2.71828.
: The mean number of successes (occurrence
of an event) that occur in a specified period of
time.
x: The actual number of successes that occur in
a specified period of time.
P(x; ): The Poisson probability that exactly x
successes occur in a Poisson experiment,
when the mean number of successes is .
165
Poisson Distribution Cont
Given the mean number of successes () that occur in a
specified period of time, we can compute the Poisson
probability based on the following formula:
P(x; ) = (e
-
) (
x
) / x!
Example 4.13:
Lets assume the average number of breast cancer
cases death is 2 per day. What is the probability that
exactly 3 will die tomorrow?
= 2; since 2 patients die per day, on average.
x = 3; i.e. likelihood that 3 will die tomorrow.
e = 2.71828; 166
Poisson Distribution Cont
We put these values into the formula as follows;
P(x; ) = (e
-
) (
x
) / x!
P(3; 2) = (2.71828
-2
) (2
3
) / 3!
P(3; 2) = (0.13534) (8) / 6
P(3; 2) = 0.180
Thus, the probability of getting 3 deaths by tomorrow is
0.180.
167
Poisson Distribution Cont
Example 4.14:
In a study of suicides, a researcher found that the
monthly distribution of adolescent suicides in US follows
a poisson distribution with parameter of = 2.75. Find the
probability that a randomly selected month will be one in
which three adolescent suicides occur.
P(x; ) = (e
-
) (
x
) / x!
P(3; 2.75) = (e
-2.75
) (2.75
3
) / 3!
P(3; 2.75) = 0.222
168
3/1/2010
43
Poisson Distribution Cont
If the number of admissions in a hospital is 10 per hour
on average, determine the probability that, in any hour
there will be:
0 admissions;
6 admissions;
Less than 2 admissions.
169
Normal Distribution
Is the most important probability distribution function.
It is also known as the Gaussian Distribution.
Named after Carl Friedrich Gauss (17771855).
Given by the formula:
The formula is affected by two main factors: mean and
SD
2
2
2
) (
* ] 2 * )
1
[(


=
x
e Y
170
Normal Distribution Cont
Normal distribution has the following chx:
1. Bell shaped
2. Symmetrical at the mean
3. Unimodal
4. Mean median and mode are equal
5. Area under the curve is 1
6. Extends from negative infinity to positive infinity
The normal distribution can be used to describe, at
least approximately, any variable that tends to cluster
around the mean. (Mainly as result the central limit
theorem)
171
Skewness, Kurtosis, and
Normal Curve
Skewness and kurtosis are used to measure normality.
Significant skewness and kurtosis indicate that data are
not normal.
Skewness is a measure of asymmetry.
For univariate data Y
1
, Y
2
, ..., Y
N
, the formula for
skewness is:
Where Y bar is the mean, S is the standard deviation,
and N is the number of data points.
The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero. 172
3/1/2010
44
Skewness, Kurtosis Cont
Kurtosis is a measure of whether the data are peaked or
flat relative to a normal distribution.
For univariate data Y
1
, Y
2
, ..., Y
N
, the formula for kurtosis
is:
The kurtosis for a normal distribution is three.
For this reason, some use the following definition of
kurtosis (often referred to as "excess kurtosis"):
Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution. 173
Normality Test
Normality tests assess the likelihood that the given data
set comes from a normal distribution.
It is important aspect statistics as many procedures
assume normality.
Typically the null hypothesis H
0
is that the observations
are distributed normally with unspecified mean and
variance
2
.
The alternative H
a
that the distribution is arbitrary.
A great number of tests (over 40) have been devised for
this problem, the more prominent of them are outlined
below:
174
Normality Test Cont
The simplest method of assessing normality is to look at
the frequency distribution histogram. (symmetry,
peakiness of the curve, modality of the distribution).
The other option is the use of probability plots.
Probability Plot Is a graphical technique for comparing
two datasets, either two sets of empirical observations,
one empirical set against a theoretical set, or two
theoretical sets against each other.
It is a common way of assessing normality, i.e. by
comparing a given data against normal distribution.
Has two variants; Q-Q plot and P-P plot.
175
Normality Test Cont
Quantile-Quantile Plot (Q-Q plot):
Compares two probability distributions by plotting
their quantiles against each other.
If the two distributions being compared are similar,
the points in the Q-Q plot will approximately lie on the
line y = x.
Probability-Probability plot (P-P plot):
Compares two probability distributions by plotting
their cumulative distribution functions against each
other.
176
3/1/2010
45
Normality Test Cont
It is possible to assess normality of a data objectively
using statistical techniques. (Example: Kolmogorov-
Smirnov test, Shapiro-Wilk test).
In SPSS:
Analysis > descriptive statistic > explore > enter the
variable under dependent list > open plot and check
normality plots with test > continue > ok.
But such tests have serous limitation as:
Small samples almost always pass a normality test,
With large samples minor deviations from normality
may be flagged as statistically significant.
177
Normal Distribution Cont
Application of Normal distribution to calculate probability:
1. Area under the curve is 1,
2. Probability of x > a is the area between a and positive
infinity,
3. Probability of x < a is the area between a and negative
infinity,
4. Probability of b<x<a is the area between a and b,
5. Probability of x = a is zero,
6. The empiric rule of 68%, 95% and 99.7% rule.
But how can we compute the area???
178
Standard Normal Distribution
Is a normal distribution with a mean of 0 and a standard
deviation of 1.
Any point (x) from a normal distribution can be converted
to the standard normal distribution (Z) with the formula:
Z = (x-mean)/standard deviation.
Corresponding area can be calculated from a standard
table.
179
Standard Normal Distribution
Cont..
Example 4.15:
if 1.4m is the height of a student where the mean for
students of his age and sex is 1.2m with a standard
deviation of 0.4.
What is the corresponding Z value for the student?
What is the probability to have a student more than
height of 1.4?
180
3/1/2010
46
Standard Normal Distribution
Cont..
Example 4.16:
Assume a distribution of blood glucose level among
medical students is normally distributed with mean of
90mg/dl and SD of 6mg/dl. Student X has mean glucose
level of 100mg/dl. Another student Y has mean glucose
level of 80mg/dl.
What is the Z score for student X?
What is the Z score for student Y?
What is the probability of getting mean glucose level
less than 100mg/dl ?
What is the probability of getting mean glucose level
less than 80mg/dl ?
181
Standard Normal Distribution
Cont..
What range around the mean which encompasses
68% of the observation?
What is the probability for a student to have blood
glucose level between 100 and 105 mg/dl?
182
Standard Normal Distribution
Cont..
Example 4.17:
Among pregnant women having ANC follow-up in a
hospital, WBC count follows normal distribution with
mean of 8,000 and standard deviation of 800.
What is the probability to get WBC more than 10,000
in those pregnant women?
What is the probability to get WBC count between
7,500 and 10,000?
183
Standard Normal Distribution
Cont..
1. Suppose in BL Hospital the probability of a donated blood to be
positive to Hepatitis B is 0.2. If we consider 4 randomly selected
donated bloods, what is the probability that exactly 2 of the
samples will be positive for Hepatitis B?
2. Suppose that systolic blood pressures follow a normal distribution
with a mean of 108 and a SD of 14. According to this information
attempt the following questions.
About 95% of the blood pressures are between ____ & ____.
About ______% of the blood pressures are between 66 & 150
What is the probability that a patients BP is > 120?
What is the probability that the patients BP is b/n 110 & 130?
What is the probability that a patients BP is < 108.
184
3/1/2010
47
Introduction to Demographic
Methods and Health Service
Statistics
What is Demography?
Demos + graphy
Is a discipline that studies human population with respect to
size, composition, distribution, mobility and its variation with
respect to all the above features and the causes of such
variations and the effect of all these on health,
environmental, social, ethical and economic conditions.
Demography as a method and data.
Demography studies a population in static and dynamic
aspects.
Static aspects include characteristics at a point in time such
as composition by Age, Sex, Race, Marital status etc.
Dynamic aspects are Fertility, Mortality, Nuptiality, Migration
and Growth.
186
Source of Demographic Data
Demographic data can be acquired through three
methods:
Census
Survey
Vital Registration
187
Census
Refers to the total process of collecting, compiling,
analyzing, and publishing or otherwise disseminating
demographic, economic, and social data pertaining to all
persons in a country or in a well-delineated part of a
country at a specified time.
Census has the following characters:
Universality
Simultaneity
Individual enumeration
Regular interval
188
3/1/2010
48
Census Cont..
The first real census was conducted in UK in 1841.
However there are evidences of large scale counting of
population starting from the prehistoric period.
Content of Census
Demographic data
Economic data
Social data
Mortality and Birth
189
Approaches to Census
De jure:
The enumeration is according to the legal or customary
place of residence.
i.e. people are registered where they usually reside.
Such type of counting gives information relatively
unaffected by seasonal and temporary movements.
However, this might not be accurate when a persons
legal or customary residence is not known.
It also creates risk of omission and double counting.
Information collected from a person away from his/her
usual residence can also be incomplete.
190
Approaches Cont
De facto:
The enumeration is according to physical residence at
the time of the census.
i.e. people are registered where they are currently
staying/residing at the time of the census.
This method is advantageous in a sense that it has got
less chance of double counting or omission.
However, if it is applied in areas where there is high level
of migration and mobility, the result can be distorted.
191
Advantage and Disadvantage of
Census
Advantage
It represents the whole population,
Serves as sampling frame for further studies,
Provides population denominators,
Provides small area data.
Disadvantage
Size limits content and quality control efforts,
Cost limits frequency,
Delay between field work and results,
Sometimes politicized.
192
3/1/2010
49
Vital Registration (Civil Registration)
Vital Registration is continuous registration of vital
events as they happen.
What are the vital events?
Vital Registration is relatively modern concept in its
present format.
The major purpose of vital registration is primarily
administrative.
Vital Registration has got the following features:
Continuity
Universality
193
Advantages of Vital Registration
Continuously monitors vital rates,
May provide both numerator and denominator for
some rates,
Small area data available,
Can be used as base for testing the accuracy of
censuses and surveys,
Once a system is established, it would be cost
effective.
194
Disadvantages of Vital Registration
Uncertain coverage,
It is difficult to establish the system,
Information may come from third party,
It can easily be disrupted by political/economic events.
195
Survey
Refers to the process of obtain information from a
sample representative of some population at a given
point in time.
How can we make it representative?
Survey can be of two types:
Single rounded retrospective survey
Multi-round follow up survey
The content of survey widely varies.
Features of Survey:
Representativeness,
Smaller size
More in-depth information.
196
3/1/2010
50
Advantage and Disadvantage of
Survey
Advantages:
Quick and inexpensive,
Gives detailed data,
Follow up can be achieved
Limitations:
Small area data might not be available,
Perfect representativeness is difficult to achieve,
A survey can only be focused on few thematic areas.
197
Demographic Transition
Conceptual framework to explain population change over
time.
Developed by American demographer Warren
Thompson, 1929.
Observed changes in birth and death rates in
industrialized societies over the past two hundred years.
Demographic change has got three stages.
Developed countries started the second stage in the
beginning of eighteenth century. Less developed
countries began the transition later.
198
Demographic Transition Cont
199
Demographic Transition
Cont
Stage I: Characterized by high and fluctuating mortality,
high fertility and low population growth.
Stage II: Characterized by beginning of mortality decline
followed by fertility decline. This is the period of rapid
population growth.
Stage III: Characterized by low mortality, low and
fluctuating fertility, growth slows down and eventually
reaches a no-growth stage.
200
3/1/2010
51
Important Indicators of Composition
of a Population
1. Sex Ratio: Is the total number of male population per
1000 female population. This can be explained as Y to
1000, Y:1 or Y/X when Y is number male and X is
number of female.
2. Child to Women Ratio: This is the ratio of number of
children under five to number of women of reproductive
age in given place and time. It can also be used as
measure of fertility.
3. Dependency Ratio: Describe the ratio between non
productive (age 0-14 and 65+) and productive (15-64)
age groups in given place and time.
4. Population Pyramid:
201
Population Pyramid
A graphical illustration that shows the distribution of
various age groups in a population.
Normally forms the shape of a pyramid.
Consists of two back-to-back bar graphs, with the
population plotted on the X-axis and age on the Y-axis,
One showing the number of males and one showing
females in a particular population in five-year age
groups.
Males are shown on the left and females on the right.
202
Population Pyramid
203
Population Pyramid
204
3/1/2010
52
Vital Statistics
Among the focus of demography, some of the issues are
more important and applicable in public health.
Especially the measures of mortality and fertility are vital
inputs to the health system so they are called Vital
Statistics.
205
Measures of Fertility
Crude Birth Rate (CBR): The number of live births in a
year per 1000 mid year population in the same year.
1000 x
year same a in population year Mid
year a in births live of number Total
CBR =
206
Measures of Fertility Cont..
General Fertility Rate (GFR): The number of live births
in a year per 1000 mid year women of reproductive age.
1000
49 15
x
year same a in yrs aged population female year Mid
year a in births live of number Total
GFR

=
207
Measures of Fertility Cont..
Age Specific Fertility Rate (ASFR): Refers to the
number of live births in a year per 1000 women of
reproductive age in a give age or age group.
Usually ASFR is calculated for the following 7 age
groups of 5 years age category: 15-19 yr, 20-24 yr, 25-
29 yr, 30-34 yr, 35-39 yr, 40-44 yr, 45-49 yrs.
1000 x
year same the in group age same the for population female year Mid
year a during group age given a of women to births live of no Total
ASFR =
208
3/1/2010
53
Measures of Fertility Cont..
Age category ASFR
15-19 104
20-24 228
25-29 241
30-34 231
35-39 160
40-44 84
45-49 34
209
Measures of Fertility Cont..
Total Fertility Rate (TFR): The number of children a
woman expected to have at the end of her reproductive
age given the current ASFRs are maintained.
Mathematically, it is the sum of all ASFRs from 15-49
yrs.
TFR for data given in the usual 5 years age category is
provided as:

=
=
7
1
5
i
i
ASFR x TFR
210
Measures of Fertility Cont..
Gross Reproduction Rate (GRR): Is the total fertility
rate restricted to female births only.
1000 Pr x births female of oportion x TFR GRR =
211
Measures of Fertility Cont..
Child Ever Born (CEB):
Total number of children a woman has ever given birth
to.
It is the average number of children a woman has in a
given study area.
212
3/1/2010
54
Measures of Fertility Cont..
Example 5.1:
Calculate ASFR, TFR, GFR, CBR from the following
data.
213
Measures of Fertility Cont..
Age category Women of
reproductive age
Live
births
ASFR
15-19 15,600 1596
20-24 14,400 3300
25-29 13,300 3210
30-34 12,200 2830
35-39 11,600 1860
40-44 10,100 850
45-49 9,200 320
Total 86,400 13,966
214
Measures of Mortality
Crude Death Rate (CDR): Refers to total number of
deaths in a given area usually in a year per 1000 mid
year population.
1000 x
population year Mid
year per death of number Total
CDR=
215
Measures of Mortality
Age Specific Death Rate (ASDR): Quantifies death
occurring in defined age category in a given area per
1000 mid year population of same age category.
1000 x
year same the in category age that of population year Mid
year a in category age given a in death of No
ASFR =
216
3/1/2010
55
Measures of Mortality
Neonatal Mortality Rate (NMR): It refers to number of
death before the age of 28 days (neonatal period) in a
year out of 1000 live births in the same year.
Infant Mortality Rate (IMR): It refers to number of death
before the age of 1 year (Infancy period) in a year out of
1000 live births in the same year.
Under Five Mortality Rate (U5MR): Quantifies the
probability of dying between birth and age five per 1000
live births in a given year.
Child Death Rate (ChDR): Quantifies the probability of
dying between age of one and five years per 1000 live
births in a given year.
217
Measures of Mortality
Cause Specific Mortality Rate (CSMR):
Cause Specific Death Ratio (Proportionate
Mortality Ratio):
1000
sec
x
risk at Population
year a in cause given a to ondary death of No
CSMR =
1000
sec
Pr x
year same the in death of no Total
year a in cause a to ondary death of No
Ratio Mortality e oportionat =
218
Measures of Mortality
Maternal Mortality Ratio:
Maternal Mortality Rate:
100000 x
year same the in births live of number Total
year given a in death maternal of Number
MMR
o
=
100000 x
year same the in age ve reproducti of women of number Total
year given a in death maternal of Number
MMR
a
=
219
Measures of Migration
Crude In-Migration Rate: Number of in-migrants (I)
per 1,000 population in a given year.
Crude Out-Migration Rate: Number of out-migrants
(O) per 1,000 population in a given year.
Crude Net Migration Rate: Difference between the
number of in-migrants (I) and number of out-migrants
(O) per 1000 population in a given year.
220
3/1/2010
56
Measures of Marriage
Crude Marriage Rate: Number of marriage (M) per
1000 population in a given year.
General Marriage Rate: Number of marriage (M) per
1000 population age 15 and older in a given year.
221
Measure of Population Growth
and Projection
Crude Rate of Natural Increase (r):
Population Projection:
Population Doubling Time:
CDR CBR r =
t
o t
r P P ) 1 ( + =
) 1 ( log
2 log
r
t
+
=
222
Health Service Statistics
Data generated from the health system itself.
Advantages:
Gives morbidity information
Identify priority health problem in the area.
Determine met and unmet health need.
Determine success or failure of specific
health care program.
Assess utilization of health service.
223
Health Service Statistics Cont..
Limitations
Lack of completeness
Lack of representativeness to the general
community
Lack of denominators
Lack of uniformity
Lack of quality
Lack of compliance with reporting
224
3/1/2010
57
Health Service Statistics Cont..
1. Relative Frequency of a Disease:
2. Cure Rate:
Quantifies proportion of patients who have been cured
for a disease condition using a treatment modality out of
100 patients who received similar type of treatment.
The term Success Rate can be used if the measured
parameter is a procedure.
% 100 disease given a of Frequency Relative x
visits n institutio health of number Total
disease specific a with diagnosed patients of No
=
% 100
mod sin
x
treatment the recieved who patients of Number
ality treatment a g u disease given a of patients cured of No
Rate Cure =
225
Health Service Statistics
Cont..
3. Admission Rate:
Quantifies proportion of admissions of patients among
patients who visited the health institution in a given
period of time.
4. Hospital Death Rate:
Quantifies proportion of deaths among hospitalized
patients in a given period of time.
226
% 100 x
n institutio the visited patients of number Total
n institutio health a to admitted patients of No
Rate Admission =
% 100 x
admission of no Total
patients ed hospitaliz among death of No
Rate Dealth Hospital =
Health Service Statistics
Cont..
5. Bed Occupancy Rate:
Quantifies percentage occupancy of hospital beds in a
year.
6. Average Length of Stay:
Quantifies the average duration (in days) of hospitalized
patients.
227
deaths or es disc of Number
days patient ed hospitaliz of number Annual
ALS
arg
=
% 100
365
x
beds of number total x
days patient ed hospitaliz of number Annual
BOR =
Sampling Method
3/1/2010
58
Why Sampling?
Sampling is that part of statistical practice concerned with
the selection of individual observations intended to yield
reasonable knowledge about a population of concern,
especially for the purposes of statistical inference.
Study population Vs Target (Source) (Reference)
Population.
Parameter: A descriptive measure computed from the data
of the source population,
Statistic: A descriptive measure computed from the data of
a sample.
The issues of adequate sample size and representative
sampling technique are important for correct estimation of
the parameter using a statistic.
229
Why Sampling?
230
Why Sampling?
Researchers rarely survey the entire population for two
reasons
(1) The cost is too high and
(2) The population is dynamic.
Main advantages of sampling:
(1) The cost is lower,
(2) Data collection is faster, and
(3) It is possible to ensure accuracy and quality of
the data because the dataset is smaller.
Main disadvantage of sampling
Non representativeness (sampling error)
231
Sampling
Important terms:
Sampling Unit: Is the unit of selection in the sampling
process.
Study Unit: The unit on which information is collected.
Sampling Frame: The list of all the units in the source
population from which a sample is to be taken.
Sampling Fraction (Sampling Interval): The ratio
between the number of units in the sample to the
number of units in the source population.
232
3/1/2010
59
Types of Sampling
Probability Sampling: Every unit in the population has
a known, non-zero probability, of being sampled and the
process involves random selection.
Nonprobablity Sampling: Nonprobability sampling is
any sampling method where some elements of the
population have no chance of selection or where the
probability of selection can't be accurately determined.
233
Probability Sampling
Simple Random Sampling (SRS)
Systematic Random Sampling
Stratified Sampling
Cluster Sampling
Multistage Sampling
234
A. Simple Random Sampling (SRS)
Is the purest (the most representative) form.
Each member of the population has an equal, nonzero
and known chance of being selected.
This could be accomplished by writing each study units
name on a slip of paper and selecting adequate
number of them using Lottery Method.
It can also be done by assigning a number to each
sampling unit then samples are selected using Table
of Random Numbers or Computer packages.
235
How to use table of random
numbers
1. Number each member of the population.
2. Determine population size (N).
3. Determine sample size (n).
4. Determine starting point in table by randomly picking a
page and dropping your finger on the page with your
eyes closed.
5. Choose a direction to read. (to the left, right, down or up)
6. Select the first n numbers read from the table whose last
digits are between 0 and N.
7. Once a number is chosen, do not use it again.
8. If you reach the end of the table before obtaining your n
numbers, pick another starting point, read in a different
direction, and continue until done.
236
3/1/2010
60
Simple Random Sampling
Cont
When large dataset is available in databases, statistical
packages can select a given size randomly.
In SPSS:
Data > Select Cases > Random > complete the
dialogue box accordingly.
In Excel:
Tools > Data Analysis > Sampling > Complete the
dialogue box accordingly.
237
Simple Random Sampling Cont
Limitation of SRS
Requires sampling frame,
Takes longer time.
238
B. Systematic Random Sampling
Selects units at a fixed interval throughout the sampling
frame after a random start.
The steps are:
Number the units in the population from 1 to N,
Decide on the n (sample size) that you need,
Calculate the Sampling Fraction k (K = N/n),
Randomly select an integer between 1 to k,
Then take every k
th
unit.
239
Systematic Random Sampling
Cont...
Advantage:
It is easier and less time consuming to perform.
Rarely it can be conducted without sampling frame.
Disadvantage:
Can be biased when there is cyclic patter in the order
of the subjects.
240
3/1/2010
61
C. Stratified Sampling
Applied when the source population is heterogeneous
on a variable of interest.
The population is first divided into classes (strata).
Then a separate sample is taken from each stratum
using Simple or Systematic Random Sampling tech.
The number taken from each stratum might be equal
(Non Proportional Stratified Sampling) or the number is
determined based on the proportion of each class in
the source population (Proportional Stratified
Sampling).
241
Stratified Sampling Cont
Advantage: improves representativeness of the sample
(Proportional Stratified Sampling) or it creates
reasonable comparison among strata (Non Proportional
Stratified Sampling).
Limitation: Requires separate sampling frame for each
stratum.
242
D. Cluster Sampling
Is a sampling method applied when the source
population is composed of natural groups.
Assuming the groups are homogenous among each
other, Cluster sampling selects few groups (clusters)
from the population as Primary Sampling Unit (PSU).
Then the required information is collected from all
elements, Secondary Sampling Units (SSU), within
each selected group.
243
Cluster Sampling Cont..
Advantage:
It doesnt require the sampling frame of the SSU.
Requires less time and resource.
Disadvantage:
Relies on the assumption of homogeneity among
clusters.
Less control on sample size.
244
3/1/2010
62
E. Multistage Sampling
Is like cluster sampling, but involves selecting a sample
within each chosen cluster, rather than including all units
in the cluster.
Thus, multi-stage sampling involves selecting a sample
in at least two stages.
The advantage is it is simpler than SRS.
But the disadvantage is as the number of stages
increased, sampling error inflates.
245
Probability Proportional to Size
Sampling Technique
PPS is a variant of cluster sampling technique.
Useful when the sampling units vary considerably in
size.
Probability of selecting a sampling unit (e.g., village,
zone, district, health center) is proportional to the size of
its population.
Involves the following procedures
List all clusters with their respective source population
size and cumulative frequency.
Decide the number of clusters (a) which will be included
in the study.
246
PPS Cont
Decide the number of individuals which will be studied
per one selection of a cluster (b).
Divide the total population by number of clusters to be
studies. This will give you the sampling interval (SI)
Choose a number between 1 and the SI at random. This
is the Random Start (RS) point.
Calculate the following series: RS; RS + SI; RS + 2SI;
.....RS + (a-1)SI.
Based on the cumulative frequency identify at which
clusters the selected numbers fall.
For every selection of a cluster select b individuals at
random from it. Note that if a cluster is selected twice 2b
individuals should be selected at random.
247
2. Nonprobablity Sampling
Here, the sample is less likely to be representative of
the population, thus it is difficult to extrapolate from the
sample to the population.
Is used when there is no sampling frame or when it is
impossible to conduct probability sampling due to
economical and feasibility factors.
248
3/1/2010
63
Nonprobablity Sampling Cont..
Judgmental or Purposive Sampling: The researcher
chooses the sample based on who he/she think would be
appropriate for the study.
Convenience Sampling: The selection of units from the
population is based on availability and/or accessibility.
Quota Sampling: It starts with systematically setting
Quota to represent subgroups of a population. Then
data is collected to meet the predefined Quota.
Snowball Sampling: The researcher begins by identifying
someone who meets the inclusion criteria of the study.
Then the study subject would be asked to recommend
others who s/he may know who also meet the criteria.
249
Sampling Error
Sampling error or estimation error is part of the total
error or uncertainty caused by observing a sample
instead of the whole population.
Non-sampling errors such as non-response and
reporting errors may also affect the outcome of a sample
based study.
Theoretically estimated from a sample minus the
population value.
Unlike bias, sampling error can be predicted, calculated,
and accounted for.
There are several measures of sampling error.
250
Sampling Error Cont
1. Standard error
Is a measure of the variability of an estimate due to
sampling.
It indicates the extent to which an estimate derived from
a sample survey can be expected to deviate from the
population value.
Depends upon the underlying variability in the population
for the characteristic as well as the sample size used for
the survey.
The standard error is a foundational measure from which
other sampling error measures are derived.
251
Sampling Error Cont
2. Confidence intervals:
A range that is expected to contain the population value
of the characteristic with a known probability.
3. Margin of error:
Is a measure of the precision of an estimate at a given
level of confidence.
4. Coefficient of variance:
The relative amount of sampling error in comparison
with a sample estimate.
CV = SE / Estimate * 100%
No hard and fast rules to define acceptable level.
The smaller the CV, the more reliable the estimate.
252
3/1/2010
64
Sampling Error Cont
5. P values:
is the probability of obtaining a test statistic at least as
extreme as the one that was actually observed,
assuming that the null hypothesis is true.
Importance of such measures:
To indicate the statistical reliability and usability of
estimates.
To make comparisons between estimates.
To conduct tests of statistical significance.
To help users draw appropriate conclusions about data.
253
Exercise 1
A medical practitioner wanted to assess the quality of
family planning service offered in a hospital. Accordingly
he made an exit interview to those women who have ID
number of multiple of five. What sampling method is
employed?
254
Exercise 2
A medical practitioner wanted to assess the prevalence
of malnutrition among under five children in a woreda.
Assuming all kebeles in the woreda are similar, he
included all under five children in two randomly selected
kebeles.
What sampling method is employed?
What possible limitation do you expect?
255
Exercise 3
A medical practitioner wanted to assess the prevalence
of malnutrition among under five children in a woreda.
Assuming the problem is different across the three agro-
ecological zones in the woreda he included children from
2 kebeles each from Kolla, Dega and Woynadega.
What sampling method is employed?
What possible limitation do you expect?
256
3/1/2010
65
Exercise 4
A researcher wanted to study the prevalence of drug
addiction among adolescents in Addis Ababa. First he
randomly select Bole sub city. Then he selected woreda
17 at random from all woredas in Bole sub city. Finally
he conducted his study in Kebele 19 (after random
selection).
What sampling method is employed?
What possible limitation do you expect?
If woreda 17 was selected because of its proximity to
the organization of the researcher what would have
been the sampling method?
257
Sampling Distribution and
Estimation
Estimation
Estimation refers to the process by which one makes
inferences about a population, based on information
obtained from a sample.
Can be of two types:
Point Estimation
Interval Estimation
259
Point Estimate
Point Estimate: A point estimate of a population
parameter is a single value of a statistic.
The following table gives commonly used point
estimators.
260
3/1/2010
66
Interval Estimate
An interval estimate is defined by two numbers, between
which a population parameter is said to lie.
For example, is an interval estimate of the
population mean .
i.e. the population mean is greater than a but less than b.
An interval estimate has got three components
(concepts).
b X a < <
261
Interval Estimate Cont.
An interval estimate has got three components (concepts)
A statistic: (the point estimator)
A margin of error: (the measure of precision)
A confidence level: (the measure of uncertainty)
The interval estimate of a given confidence level is
defined by the sample statistic + margin of error.
Interval Estimate is preferred than point estimate as it
considers the precision and uncertainty of estimation.
262
Interval Estimate Cont.
Margin of Error
In a confidence interval, the range of values above and
below the sample statistic is called the margin of error.
It measures the precision of a sampling method.
It is the function of the confidence level and another
parameter called the standard error.
263
Interval Estimate Cont.
Confidence Level
The probability part of the interval.
It describes how strongly we believe that a particular
sampling method will produce an interval that
includes the true population parameter.
90, 95, and 99% Confidence interval
For example, 95% CI means: If we used the same
sampling method to select different samples and
compute different interval estimates, the true
population mean would fall within a range defined by
the sample statistic + margin of error in 95% of the
time.
264
3/1/2010
67
Interval Estimate Cont.
Example 6.1:
A local newspaper conducts an election survey and
reports that the independent candidate will receive
30% of the vote. The newspaper states that the
survey had a 5% margin of error and a confidence
level of 95%.
Meaning: We are 95% confident that the independent
candidate will receive between 25% and 35% of the
vote.
265
CI for a single mean
Background Concept: Sampling Distribution of Means.
One can generate sampling distribution of means in the
following manner:
Obtain a sample of n observations selected completely
at random from a large population. Determine their
mean and then replace the observations in the
population.
Repeat the sampling procedure indefinitely.
The result is a series of means of sample size n.
If each mean in the series is now treated as individual
observation and arranged in a frequency distribution,
one comes up with the sampling distribution of means of
samples of size n. 266
CI for a single mean cont..
The sampling distribution of means has the following
properties:
1. The mean of the sampling distribution of means is the
same as the population mean.
2. The SD of the sampling distribution of means (which is
called the standard error of the mean) is:
3. Sampling distribution of means is approximately a
normal distribution, regardless of the original distribution
provided n is large. (Central Limit Theorem)
n
x
/ =
267
CI for a single mean cont..
The general formula is
CI=Sample statistic + Z value x SE
95 . 0 ) 96 . 1
/
(-1.96 Pr = |
.
|

\
|

n
x

[ ] 95 . 0 ) / ( 96 . 1 ) / ( 96 . 1 Pr = + n X n X
) / ( 96 . 1 % 95 n X for CI =
) / (
2
n Z X for CI

=
268
3/1/2010
68
CI for a single mean cont..
However when the population variance is unknown and
the sample size is less than 30:
Sample variance should replace population variance
Student t distribution should be used in the place of
standard normal distribution.
Hence the formula would be:
) / ( ,
) 1 (
2
n t X
n


=
269
CI for a single mean cont..
Example 6.2:
The mean blood glucose level of 100 randomly selected
healthy adults is 85mg/dl. Find 95% CI for the mean
blood glucose level for all health adults () given the
standard deviation for the population is 15mg/dl.
270
CI for difference between two
means
Background Concept: The Sampling distribution of
Difference of Means.
Consider two different populations X and Y.
The first population has mean of
x
and standard
deviation of
x
.
The second population has mean of
y
and standard
deviation of
y
.
From the first population take a sample of size n
x
and
compute its mean .
From the second population take a sample size of n
y
and compute its mean .
Then determine .
X
Y
Y X
271
CI for difference between two
means cont
Do the same for all pairs of samples that can be chosen
independently from the two populations.
The Differences are new set of scores which form
the sampling distribution of differences of means.
Y X
272
3/1/2010
69
CI for difference between two
means cont
Properties of the sampling distribution of differences of
means.
1. The mean of the sampling distribution of differences of
means equals to the difference of the population means
( ).
2. The SD of the sampling distribution of differences of
means (SE) is equal to:
3. The distribution is approximately normally distributed.
2 1

2
2
2
1
2
1
) (
n n
Y X

+ =

273
CI for difference between two
means cont
95 . 0 ) 96 . 1 ( ) ( ) ( ) 96 . 1 ( ) ( Pr
2
2
2
1
2
1
2 1
2
2
2
1
2
1
=
(
(

+ +
(
(

+
n n
Y X
n n
Y X



95 . 0 ) 96 . 1
) ( ) (
96 . 1 ( Pr
2
2
2
1
2
1
2 1
= <
(
(
(
(
(

+

<
n n
Y X


) ( ) (
2
2
2
1
2
1
2
2 1
n n
Z Y X



+ =
) 96 . 1 ( ) ( % 95
2
2
2
1
2
1
2 1
n n
Y X of CI

+ =
274
CI for difference between two
means cont.
Example 6.3:
A randomly selected 120 HIV patients who were on ART
had averagely lived for 25 years with SD of 5 years since
their diagnosis for the virus was made. Similarly a
randomly selected 140 HIV patients who were not on
ART had averagely lived for 14 year with SD of 4 years.
Calculate the point estimate for the difference between
the population means.
Find the 95% CI for the difference between the means.
275
CI for single proportion
Background Concept: The Sampling distribution of
Proportions
Here we are interested in the proportion of the
population that has a certain characteristic represented
by P or .
If we take indefinite random sample of n observation and
if we calculate p for all samples then we will have
sampling distribution of proportions.
The sampling distribution of proportion has the following
characteristics:
276
3/1/2010
70
CI for single proportion cont
The sampling distribution of proportions has the
following properties:
1. The mean of sampling distribution of proportions = ,
2. The SD (SE) of the sampling distribution of proportions:
3. The distribution is approximately normally distributed.
n
P P
P
) 1 (
=
277
CI for single proportion cont..
95 . 0 ) 96 . 1
) 1 (
96 . 1 ( Pr = <
(
(
(
(

<
n
P P
p
n
P P
p for CI
) 1 (
( 96 . 1 % 95

=
)
) 1 (
(
2
n
P P
Z p

=

95 . 0 )
) 1 (
( 96 . 1 )
) 1 (
( 96 . 1 Pr =
(


n
P P
p
n
P P
p
278
CI for single proportion cont..
Example 6.4:
In Addis Ababa blood test of randomly selected 120
commercial sex workers revealed that 30 of them are
HIV positive. What will be the 99% confidence interval of
HIV/AIDS prevalence for whole commercial sex workers
in the city?
279
CI for difference between two
proportions
Consider two different populations X and Y.
The first population has proportion of

and the second
population has proportion of

.
From the first population take a sample of size n
x
and
compute its sample proportion p
x.
From the second
population take a sample size of n
y
and compute its
sample proportion p
y.
Then determine p
x
-p
y
.
Do for all pairs of samples that can be chosen
independently from the two populations.
The Differences p
x
-p
y
are new set of scores which form
the sampling distribution of differences of proportions.
280
3/1/2010
71
CI for difference between two
proportions cont
The sampling distribution of differences of proportions
has the following properties:
1. The mean of the sampling distribution of differences of
proportions equals the difference of the population
proportion (

-

)
2. The SD (SE) given as:
3. The distribution is approximately normally distributed.
2
2 2
1
1 1
) (
) 1 ( ) 1 (
2 1
n
p p
n
p p
p p

281
CI for difference between two
proportions cont
95 . 0 ) 96 . 1
) 1 ( ) 1 (
) ( ) (
96 . 1 ( Pr
2
2 2
1
1 1
2 1 2 1
= <
(
(
(
(
(


<
n
p p
n
p p
p p
95 . 0 )
) 1 ( ) 1 (
96 . 1 ( ) ( ) (
) 1 ( ) 1 (
96 . 1 ( ) ( Pr
2
2 2
1
1 1
1 1 2 1
2
2 2
1
1 1
1 1
=
(
(

+
(
(


n
p p
n
p p
p p
n
p p
n
p p
p p
2
2 2
1
1 1
2 1 2 1
) 1 ( ) 1 (
( 96 . 1 ) ( % 95
n
p p
n
p p
p p for CI

+

=
2
2 2
1
1 1
2
2 1 2 1
) 1 ( ) 1 (
( ) (
n
p p
n
p p
Z p p

+

=


282
CI for difference between two
proportions cont
Example 6.5:
Among randomly selected 200 illiterate married women,
50 of them use contraceptive. Similarly, among randomly
selected 300 married women who can read and write,
150 of them use contraceptive.
Calculate the point estimate for the difference between
the population proportions.
Find the 95% CI for the difference between the
proportions.
283
CI for OR and RR
When the intention of measurement of association is to
have inference about a population parameter, CI for OR
or RR can be calculated using the following formula.
Why do we need natural logarithm here?
]
1 1 1 1
[ln(OR) exp OR for CI
2
d c b a
Z + + + =

( ) ( )
]
1 1
[ln(RR) exp RR for CI
2
c
d c
c
a
b a
a
Z
+

+
+

=

284
3/1/2010
72
CI for OR and RR Cont..
SPSS can compute OR and RR with their confidence intervals
given the information is fed in the following manner.
Create 3 variables in the variable view page:
Frequency (for the four cells),
Exposure (0 as Yes, 1 as No) and
Outcome (0 as Yes, 1 as No)
Enter the values into the data view page as mentioned above.
Weight cases based on frequency variable.
Do the analysis in the following manner:
Descriptive statistics > Cross tabs > Put exposure as row
and outcome as column > Statistics > Check risk >
Continue > Ok
OR is given as Odds ratio for exposure (yes/no)
RR is given as For cohort disease = yes 285
Unbiased and Biased Estimators
A statistic is called an unbiased estimator of a population
parameter if the mean of the sampling distribution of the
statistic is equal to the value of the parameter.
Based on the Central Limit Theorem, the sample mean is an
unbiased estimator of population mean.
If the mean value of an estimator is either less than or
greater than the true value of the quantity it estimates, then
the estimator is called a biased.
A case of biased estimation is seen to occur when sample
variance, is used to estimate the population variance using
the following formula:
286
Unbiased and Biased
Estimators Cont
The sample variance calculated using this formula is always
less than the true population variance.
This is because sample observations are closer to each
other than population observation.
To compensate for this, n-1 is used as the denominator.
It is important to note that, using n-1 as the denominator, the
sample variance still remains a biased estimator of the
population standard deviation, but for large sample sizes
this bias is negligible.
287
Estimation of Sample Size for
Cross Sectional Studies
Why we need to calculate sample size:
Representativeness Vs Cost
Estimation can be made based on a given confidence
level and standard error.
288
3/1/2010
73
Sample Size to Estimate a Single
Population Proportion
2
2
2
) 1 (
d
P P Z
n

If the main objective of the study is to estimate single


population proportion, then the sample size can be
determined using the formula:
Where;
n is the minimum sample size required for very large
population (100,000)
Z is the critical value for a given confidence interval
P is expected proportion of the event to be studied (to
be estimated based findings of previous studies)
d is margin of error
289
Sample Size to Estimate a Single
Population Proportion Cont
NB:
If p is not known it has to be taken as 0.5. (Why?)
Depending on the nature of the study 10-15%
contingency should be added.
If the size of the population is less than 100,000 the
sample size should be corrected using the formula;
Where:
n is the non-corrected sample size
N is the size of the source population
N n
N x n
size sample Corrected
+
=
290
Sample Size to Estimate a Single
Population Proportion Cont
Example 6.6:
A researcher is interested to determine the prevalence of
family planning use in Addis Ababa city. A previous
study indicates the prevalence is around 55%. If the
researcher is interested to determine the sample size
with 95% CI and 5% of margin of error, what number of
women of reproductive age should be included into his
study?
291
Sample Size to Estimate Single
Population Mean
If the main objective of the study is to estimate single
population mean, then the sample size can be determined
using the formula:
Where:
n is the minimum sample size required for large
population
Z is the critical value for a given confidence level
is the expected SD of the event to be studied
d is the margin of error
2
2
|
|
|
.
|

\
|
=
d
Z
n

292
3/1/2010
74
Sample Size to Estimate Single
Population Mean
Example 6.7:
A researcher is interested to determine the mean blood
glucose level among high school students. A previous
study indicates the mean is 85mg/dl with standard
deviation of 15mg/dl. If the researcher is interested to
determine the sample size with 95% CI and tolerates 2
mg/dl margin of error, what number of students should
be included into his study?
293
Hypothesis Testing
What is a Hypothesis
A statistical hypothesis is an assumption or a statement
which may or may not be true concerning one or more
population.
Setting up and testing hypotheses is an essential part of
statistical inference.
Examples of statistical hypothesis:
The mean pulse rate among AAU-HI students is 72/min.
The prevalence of HIV in AA is 12%.
The mean blood glucose level among Chinese and
Indians is the same.
The prevalence of Hypertension in US and UK is the
same.
The mean blood cholesterol level is the same before
and after taking a drug.
295
Steps in Hypothesis Testing
Hypothesis testing involves the following steps:
1. Choose the hypothesis to be tested,
2. Choose an alternative hypothesis which would be
accepted if the first hypothesis is rejected.
3. Decide on the appropriate test statistic for the
hypothesis (Z, t, X
2
)
4. Decide the level of significance and corresponding
critical value.
5. Obtain the value of the test statistic.
6. Make a decision and interpret it.
296
3/1/2010
75
The Null and Alternative
Hypothesis
In hypothesis testing two hypotheses are involved: The Null
Hypothesis and the Alternative Hypothesis.
Every hypothesis test requires the analyst to state a null
hypothesis and an alternative hypothesis.
They are mutually exclusive and complementary events.
Both hypotheses are about the parameter not about the
statistic.
The null hypothesis (H
0
or H
N
):
The first hypothesis to be set by the researcher.
It commonly implies the meaning of equals to, no
effect or no difference, no association conclusions.
297
The Null and Alternative
Hypothesis Cont..
Example;
The mean pulse rate among AAU-HI students is 72/min.
Drug A has no effect on the blood glucose level of
diabetic patients.
There is no difference in the prevalence of malaria in
region A and Region B.
There is no association between smoking and lung
cancer.
298
The Null and Alternative
Hypothesis Cont..
The alternative hypothesis (H
A
or H
1
)
The hypothesis that will be accepted if H
0
is rejected.
Implies conclusions like is not equal, has effect, there
is difference and there is association.
Example:
The mean pulse rate among AAU-HI students is not
equal to 72/min.
Drug A has effect on the blood glucose level of diabetic
patients.
There is difference in the prevalence of malaria in region
A and B.
There is association between smoking and lung cancer.
299
Test Statistic
In hypothesis testing we accept or reject the hypothesis
through calculating the probability of getting the
estimated sample value given the hypothesized
population value is true.
If the probability is very low we reject the null hypothesis.
The probability is calculated using test statistic.
The most commonly used test statistic are Z, students-t
and X
2
tests.
The general formula to calculate test statistic is:
SE
value ed hypothesiz estimate
statistic test
) ( ) (
=
300
3/1/2010
76
Test Statistic
Students t Distribution:
The use of z-test requires a knowledge of the variance of
the population from which the sample is taken.
It is somewhat strange that once can have knowledge of
the population variance and not know the value of the
population mean.
In statistics as long as sample size is large enough, most
datasets can be explained by standard normal dist.
But when the sample size is small and population SD is
not known, statisticians rely on the distribution of the t
statistic.
301
Test Statistic Cont
Students t distribution was developed by William Gosset
(1876-1937) under the pseudonym of Student t.
There are many different t distributions. (t distribution is a
family of distributions)
The particular form of the t distribution is determined by
its Degrees of Freedom (df).
The degrees of freedom (df) refers to the number of
independent observations in a dataset after some
restriction is made.
n
s
x
t
] [
=
302
Test Statistic Cont
The t distribution has the following properties:
The mean of the distribution is equal to 0.
Symmetrical about the mean.
The variance is equal to v / ( v - 2 ), where v is the df.
(i.e. V>2) In general the variance is greater than 1,
but approaches 1 as the sample size becomes large.
Extends from + infinity to infinity
Compared to normal distribution, t distribution is less
picked in the center and has higher tails.
The t distribution approaches the normal distribution
as n-1 approaches infinity.
303
Test Statistic Cont
304
3/1/2010
77
Test Statistic Cont
For the t distribution to apply strictly we need the
following two assumptions:
1. The observations are selected at random from the
population.
2. The population distribution is normal.
Sometimes the second assumptions may not be met as
the t test is robust for departures from the normal
distribution.
That means even when assumption 2 is not satisfied, the
probabilities calculated from the t table are still
approximately correct.
305
Test Statistic Cont
Chi Square Distribution (X
2
):
Mainly developed by Karl Pearson (1857-1936)
A type of probability distribution like Z or t.
Represented by the Greek letter Chi ( )
It is the distribution of the sum of the squared values of
the observations drawn from the N(0,1) distribution.
Let {X
1
, X
2
, ..., X
n
} be n independent random variables,
all ~ N(0,1).
Then the X
2
n
is defined as the distribution of the sum X
1

+ X
2
+...+ X
n
.

306
Test Statistic Cont
Mainly used to check association between two
categorical variables.
It is the most frequently used statistical technique for
analysis of count or frequency data.
It is not a distribution but rather a family of distributions,
indexed by the df.
The mathematical formula of X
2
distribution is given as
(where x is 0):
) 2 / ( 1 ) 2 / (
2
)
2
1
(
)! 1
2
(
1
x k
k
e x
k
Y

=
307
Test Statistic Cont
The graph is given as:
308
3/1/2010
78
Test Statistic Cont
The formula for the test statistic which approximates X
2
distribution is: (where O is the observed frequency and E
is expected frequency)
It has the following characteristics:
Extends indefinitely to the right from 0.
Has only one tail.
As the df increase, the chi-square curve approaches
a normal distribution.
309
Test Statistic Cont
310
Errors in Hypothesis Testing
In testing hypothesis, two types of errors can be
committed: Type I and Type II errors.
The probability of committing type I error is denoted as
. It is also called the Level of significance. (1-
confidence level)
The probability of committing type two error is denoted
as . (1-power of the study)
Decision of the
hypothesis testing
Accept H
0
Reject H
0
Null
Hypothesis
H0 True Correct Type I error
H0 False Type II error Correct
311
One and Two Tailed Hypothesis
Some hypotheses test whether one value is different
from another or not, without additionally predicting which
will be higher: Non-directional or two-tailed test
At times some hypotheses not only test difference of one
value from the other but also direction of the difference.
i.e. it would be lower or higher: Directional or one-tailed
test.
312
3/1/2010
79
Level of Significance, Critical
Values and Critical Area
In practice, the level of significance () is chosen arbitrarily.
Three levels 0.01, 0.05, or 0.10. (depending on confidence
level)
The smaller the level of significance, the stronger the
hypothesis test.
The level of significance determines the values of the test
statistic that would cause us to reject the hypothesis.
The corresponding test statistic values for the level of
significance are called the Critical Values.
In a probability distribution the area which is left to the
extreme right or/and left of the critical value is called the
Critical area (Rejection area).
The area between the two critical values is called the
Acceptance Area.
313
Level of Significance, Critical
Values and Critical Area
314
Level of Significance, Critical
Values and Critical Area
A level of significance has different critical values for one
and two tailed test,
Level of significance of 0.05 has critical value of 1.96 if
the test is two tailed.
However if the test is one tailed the critical value would
be 1.64 to either of the tails.
Note that critical values for a given level of significance
differ depending on the test statistic intended to be used.
315
Level of Significance, Critical
Values and Critical Area
316
3/1/2010
80
Level of Significance, Critical
Values and Critical Area
317
Level of Significance, Critical
Values and Critical Area
318
Level of Significance, Critical
Values and Critical Area
(level of
significance)
Two tailed
test
On tailed
test, <
On tailed test,
>
0.10 1.64 -1.28 1.28
0.05 1.96 -1.64 1.64
0.01 2.58 -2.33 2.33
319
Interpretation and Conclusion
Interpretation is made based on comparisons between:
Test Statistic Calculated Vs Critical Value.
P value Vs significance level.
Conclusion (i.e. accepting and rejecting the null
hypothesis) should be made at the given level of
confidence.
320
3/1/2010
81
Test of Hypothesis about Single
Population Mean
Shows how to test the null hypothesis that the population
mean is equal to some hypothesized value.
One begins with a statement that claims a particular
value for the unknown population mean.
The hypothesis testing for single population mean either
accepts or rejects this statement.
The Z test and the t test used.
Sample > 30: Z test
Sample < 30 and population SD known: Z test
Sample < 30 and population SD unknown: t test
321
Test of Hypothesis about Single
Population Mean Cont..
n
X
Z
/

=
n S
X
t
/

=
322
Test of Hypothesis about Single
Population Mean Cont..
Example 7.1:
Researchers are interested in the mean level of an
enzyme in a certain population. They take a sample of
36 individuals, determine the level of enzyme in each
and compute a sample mean 22. It is known that the
variable of interest is approximately normally distributed
with a standard deviation of 10. Lets say that they are
asking the following question: Can we conclude that the
mean enzyme level in this population is different from
25?
323
Test of Hypothesis about Single
Population Mean Cont..
Step 1 and 2: Define the H
o
and H
1
:
Step 3: Decide approprate test statistic:
Z test
Step 4: Decide the level of significance and critical value:
value of 0.05.
1.96 is the critical value.
Step 5: Obtain the value of the test statistic:
25 : =
o
H 25 :
1
H
324
3/1/2010
82
Test of Hypothesis about Single
Population Mean Cont..
n
X
Z
/

=
36 / 10
25 22
= Z
1.67
3
= Z
80 . 1 = Z
325
Test of Hypothesis about Single
Population Mean Cont..
Step 6: Make a decision and interpret it.
Accept the H
0
at 95% confidence level:
1.80 is with in the acceptance region.
P value of 0.036 is > /2 value of 0.025.
326
Test of Hypothesis about Single
Population Mean Cont..
Example 7.2:
The researchers mentioned in example 7.1, instead of
asking if they could conclude that 25, they asked: Can
we conclude that the mean enzyme level in this
population is less than 25?
Solution:
Step 1 and 2: Define the H
0
and H
1
:
25 :
o
H
25 :
1
< H
327
Test of Hypothesis about Single
Population Mean Cont..
Step 3: Decide approprate test statistic:
Z test
Step 4: Decide the level of significance and critical
value:
value of 0.05.
1.645 is the critical value.
Step 5: Obtain the value of the test statistic:
n
X
Z
/

=
36 / 10
25 22
= Z
1.67
3
= Z 80 . 1 = Z
328
3/1/2010
83
Test of Hypothesis about Single
Population Mean Cont..
Step 6: Make a decision and interpret it.
Reject the H0 with 95% confidence level
Test statistic -1.80 is with in the acceptance region.
P value of 0.036 is less than the value of 0.05.
25
329
Test of Hypothesis about Single
Population Mean Cont..
Example 7.3:
Serum Amylase level determination was made on a
sample of 15 apparently health subjects. The sample
yielded the mean of 96 units/100 ml and a standard
deviation of 35 units /100 ml. The variance of the
population was unknown. We want to know wheter we
can conclude that the mean of the population is different
from 120 units/100 ml.
330
Test of Hypothesis about Single
Population Mean Cont..
Step 1 and 2: Define the H
0
and H
1
.
Step 3: Decide approprate test statistic.
t test
Step 4: Decide level of significance and critical value.
value of 0.05.
t value for of 0.0025 at df of 14: 2.145
Step 5: Obtain the value of the test statistic.
120 : =
o
H 120 :
1
H
n S
X
t
/

=
15 / 35
120 96
= t 65 . 2 = t
331
Test of Hypothesis about Single
Population Mean Cont..
Step 6: Make a decision and interpret it.
We reject the null hypothesis b/c
The cal test statistic -2.65 is in the rejection area
The corrspoinding P value of -2.65 (b/n 0.01 and
0.005) is less than the /2 value of 0.025.
332
3/1/2010
84
Testing of Hypothesis about Two
Population Means
Compare the difference between two populations mean.
H
0
: there is not difference between the two mean.
H
1
: there is difference between the two means.
Z or t test can be employed.
Sum-up the sample size of the two groups, if it is greater
than 30 use Z test, if less than 30 use t test.
2
2
2
1
2
1
2 1
) (
n n
X X
Z

+

=
333
Testing of Hypothesis about Two
Population Means Cont..
t test is carried out with df of n1+n2-2
2
2
1
2
2 1
) (
n
S
n
S
X X
t
+

=
2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1
+
+
=
n n
S n S n
S
334
Testing of Hypothesis about Two
Population Means Cont..
Example 7.4:
A researcher wants to check whether the systolic blood
pressure among males is different from females or not.
Among 50 male samples the mean SBP was 100mmHg
with standard deviation of 5 mmHg. Among 60 females,
the mean SPB was 104mmHg with standard deviation of
10 mmHg. Is there significant difference between the two
means?
335
Testing of Hypothesis about Two
Population Means Cont..
Step 1 and 2: Define the H
0
and H
1
Step 3: Decide approprate test statistic:
Z test
Step 4: Decide the level of significance and critical
value:
value of 0.05.
1.96 is the critical value.
Step 5: Obtain the value of the test statistic:
f m o
H = :
f m
H :
1
2
2
2
1
2
1
2 1
) (
n n
X X
Z

+

=
60
10
50
5
104 100
2 2
+

= Z
67 . 1 5 . 0
4
+

= Z 72 . 2
47 . 1
4
=

= Z
336
3/1/2010
85
Testing of Hypothesis about Two
Population Means Cont..
Step 6: Make a decision and interpret it.
We reject the H0 and accept the H1 (at 95%
confidence level) b/c
The cal test statistic -2.72 is in the rejection region.
The corrspoinding P value of -2.72 (0.0033) is less
than the value of 0.025.
f m

337
Testing of Hypothesis about Two
Population Means Cont..
Example 7.5:
Serum amylase determination was made on a sample of
15 apparently health subjects and 12 hospitalized
subjects. Among health subjects, the mean was 96
units/100ml with standard deviation of 35 units/100 ml.
Among hospitalized patients, the mean was 120
units/100ml with standard deviation of 40 units/100 ml. Is
there significant difference between the two mean
values?
338
Testing of Hypothesis about Two
Population Means Cont
Step 1 and 2: Define the H
0
and H
1
Step 3: Decide approprate test statistic.
t test
Step 4: Decide level of significance and critical value.
value of 0.01.
t value for /2 of 0.005 at df of 25: 2.787
Step 5: Obtain the value of the test statistic
2 1
: =
o
H
2 1 1
: H
3 . 37 1390
25
17600 17150
25
) 40 )( 11 ( ) 35 )( 14 (
2
) 1 ( ) 1 (
2 2
2 1
2
2 2
2
1 1
= =
+
=
+
=
+
+
=
n n
S n S n
S
339
Testing of Hypothesis about Two
Population Means Cont
Step 6: Make a decision and interpret it.
We accept the null hypothesis (at 99% confidence level)
b/c:
The calculated test statistic -1.67 is in the acceptance
region.
The corrspoinding P value of -1.67 (which is b/n 0.1 and
0.05) is greater than the value of 0.005.
67 . 1
4 . 14
24
12
3 . 37
15
3 . 37
120 96
2 2
=

=
+

= t
340
3/1/2010
86
Testing of Hypothesis about Two
Population Means Cont
Paired t test for difference between two means:
Every observation in one sample has one matching
observation in the second sample.
Commonly used in evaluation of interventions like new
treatment modalities.
Hence pre and post intervention (treatment) results are
compared.
Usually t test is used since individuals involved in the
trial are few.
The null hypothesis: there is no significant difference
between the two tests.
341
Testing of Hypothesis about Two
Population Means Cont
Procedures of hypothesis testing are the same. Except
the formula for the test statistic calculation.
d = mean of differences between the two samples.
SD = is the standard deviation for the differences
between the two samples.
n = the number of paired cases.
Note that the calculated test statistic is compared at
degree of freedom of n-1.
n
SD
d
t =
342
Testing of Hypothesis about Two
Population Means Cont
Example 7.6:
A random sample of 10 young men was taken and the
pulse rate was measured before and after taking a cup
of coffee. The result is given as follows. Does the coffee
has any effect on the heart rate? (perform the hypothesis
testing with 95% CI)
343
Testing of Hypothesis Cont
Subject PR before PR after Difference
1 68 74 +6
2 64 68 +4
3 52 60 +8
4 76 72 -4
5 78 76 -2
6 62 68 +6
7 66 72 +6
8 76 76 0
9 78 80 +2
10 60 64 +4
Mean 68 71 +3
344
3/1/2010
87
Testing of Hypothesis about Two
Population Means Cont
H
0
: Coffee intake has no effect on PR
H
1
: Coffee intake has effect on PR
Test statistic: t test (paired)
Critical value 2.262
First calculate the SD then the test statistic:
Reject the null hypothesis (at 95% confidence level)
Coffee intake has effect on PR.
92 . 3
1
) (
2
=

n
d di 4 . 2
10
92 . 3
3
= = t
345
Test of Hypothesis About Single
Population Proportion
The null hypothesis that the population proportion is
equal to some hypothesized value.
One begins with a statement that claims a particular
value for the unknown population proportion.
The hypothesis testing for single population proportion
either accepts or rejects this statement.
Here Z test statistic is used. The formula is given as:
n
p
Z
) 1 (

=
346
Test of Hypothesis on Means
Using SPSS
In SPSS One sample T test, independent T test and
paired sample T test are available under;
Analyze > means > One sample T test or independent T
test or paired sample T test
347
Test of Hypothesis About Single
Population Proportion
Example 7.7:
A survey was conducted to determine the prevalence of
protein energy malnutrition in a rural kebele. Of 300
under five children assessed, 123 were stunted. Can we
conclude that the prevalence of PEM in the population is
50%?
348
3/1/2010
88
Test of Hypothesis About Single
Population Proportion
Step 1 and 2: Define the H
0
and H
1
Step 3: Approprate test statistic:
Z statistic
Step 4: Decide the level of significance and the
corresponding critical value:
Lets take value of 0.1. Hence 1.645 is the critical
value.
Step 5: Obtain the value of the test statistic:
5 . 0 : =
o
H
11 . 3
300
25 . 0
09 . 0
300
) 5 . 0 ( 5 . 0
5 . 0 41 . 0
) 1 (
= =

=
n
p
Z

5 . 0 :
1
H
349
Test of Hypothesis About Single
Population Proportion
Step 6: Make a decision and interpret it.
At 90% confidence level wee reject the null hypothesis
that P=0.5.
The calculated test statistic -3.11 is in the rejection
region.
The corrspoinding P value of -3.11 (i.e. 0.0009) is
less than the value of 0.05.
350
Testing of Hypothesis About
Two Population Proportions
The null hypothesis that a population proportion is equal
to another population proportion.
The hypothesis testing for single population proportion
either accepts or rejects this statement.
Here Z test statistic is used. The formula is given as:
|
|
.
|

\
|
+

=
2 1
2 1
1 1
) 1 (
n n
p P
p p
Z
2 1
2 2 1 1
n n
p n p n
P
+
+
=
351
Testing of Hypothesis About
Two Population Proportions
Example 7.8:
The prevalence of malaria among two malaria endemic
kebeles X and Y was compared. In kebele X among 120
samples 15 were positive. In kebele B among 100
samples 20 were positive. Is there any significant
difference between the prevalence of malaria kebele X
and Y?
352
3/1/2010
89
Testing of Hypothesis About
Two Population Proportions
Step 1 and 2: Define the H
0
and H
1
:
Step 3: Decide approprate test statistic:
Z statistic
Step 4: Decide value & the critical value:
Lets take value of 0.05. Hence 1.96 is the critical
value.
Step 5: Obtain the value of the test statistic:
First calculate the proportions & the pooled proportion
P1 = 15/120 = 0.125, P2 = 20/100 = 0.2
2 1
: P P H
o
=
2 1 1
: P P H
353
Testing of Hypothesis About
Two Population Proportions
Then we calculate the test statistic:
Step 6: Make a decision and interpret it.
At 95% confidence level we accept the H0 P1=P2 b/c:
-1.51 is in the acceptance region.
- 0.0655 is greater than the value of 0.025.
2 1
2 2 1 1
n n
p n p n
P
+
+
=
100 120
) 2 . 0 ( 100 ) 125 . 0 ( 120
+
+
= P 159 . 0
220
20 15
=
+
= P
|
.
|

\
|
+

=
100
1
120
1
) 159 . 0 1 ( 159 . 0
2 . 0 125 . 0
Z
( )
51 . 1
0.0183 0.1337
075 . 0
=

= Z
354
Test of Hypothesis on
Proportions Using SPSS
There is no point and click option in SPSS to do such
hypothesis testing on proportions.
Syntax based analysis can be done.
355
Test of Hypothesis about
Categorical Data
It is also possible to apply hypothesis testing on
categorical data.
The Chi-square (
2
) test statistic commonly used.
This test is usually applied to tabulated data.
The table contains two variables called the row and
column variables.
The test measures the discripancy between K observed
frequencies (O) and correspoinding K expected
frequencies (e). i.e. for all cells of the tabulation.
Expected frequencies are frequencies which happen
when there is no association between the raw and
column variables.
356
3/1/2010
90
Test of Hypothesis about
Categorical Data
The H
0
of Chi-square test is there is no association
between the row and column variables.
While the H
1
is there is associaiton between the row and
column variables.
The closer observed frequencies are to expected
frequencies, the more likely the H0 is true.

=
|
|
.
|

\
|
=
k
i i
i i
e
e O
x
1
2
2
) (
total grand
cell the for total column x cell the for total row
e =
357
Test of Hypothesis about
Categorical Data
Assumptions of Chi-square test:
No cell of the table has expected frequency less than
1,
No more than 20% of the the expected frequencies
should be less than 5.
Chi-square test should compaired with chi-square
disribution with df of (R-1)(C-1).
Though the distribution of Chi-square is one tailed, the
test is always two tailed.
358
Test of Hypothesis about
Categorical Data
Example 7.9:
A researcher is interested to assess the effect of litracy
on family planning use. Accordingly he collected data
and tabulated the findings in the following manner. Can
we say there is association between educational status
and family planning use?
FP use Educational Status
Illiterate Literate Total
Yes 63 49 112
No 15 33 48
Total 78 82 160
359
Test of Hypothesis about
Categorical Data
Step 1 and 2: Define the H
0
and H
1
:
H
0
: There is not association between litracy and
family planning use.
H
1
: There is association between litracy and family
planning use.
Step 3: Decide approprate test statistic:
X
2
test.
Step 4: Decide and the corresponding critical value:
Lets take value of 0.01.
At df of 1 the critical value is 6.635.
Accptance area is 0-6.635, Rejection area X
2
> 6.635.
360
3/1/2010
91
Test of Hypothesis about
Categorical Data
Step 5: Obtain the value of the test statistic:
First the expected frequency should be calculated:
Expected frequency for cell a: 78 x 112/160 = 54.6
Expected frequency for cell b: 82 x 112/160 = 57.4
Expected frequency for cell c: 78 x 48/160 = 23.4
Expected frequency for cell d: 82 x 48/160 = 24.6
Assumptions of X
2
test fulfilled.
Then we calculate the Chi-square statistic.

=
|
|
.
|

\
|
=
k
i i
i i
e
e O
x
1
2
2
) (
361
Test of Hypothesis about
Categorical Data
Step 6: Make a decision and interpret it.
At 99% confidence level we accept the H
1
that the two
variables are associated due to the following reasons:
The calculated test statistic 8.41 is in the rejection area.
The corrspoinding P value of 8.41 (between 0.005 and
0.002) is less than the value of (0.01).
But how is the direction of association?
|
|
.
|

\
|
+
|
|
.
|

\
|
+
|
|
.
|

\
|
+
|
|
.
|

\
|
=
6 . 24
) 6 . 24 33 (
4 . 23
) 4 . 23 15 (
4 . 57
) 4 . 57 49 (
6 . 54
) 6 . 54 63 (
2 2 2 2
2
x
( ) ( ) ( ) ( ) 41 . 8 87 . 2 02 . 3 23 . 1 29 . 1
2
= + + + = x
362
Test of Hypothesis about
Categorical Data Using SPSS
In order to do chi-square test using SPSS, track the
following steps.
Analyze > Descriptive Statistics > Cross tab > Put the two
categorical variables as column and row > Statistics >
Check Chi-square > Ok.
Chi-square test is given in a table as Pearson Chi-square.
363
Fisher's exact test
Fisher's exact test is a statistical significance test used in the
analysis of contingency tables when sample size is small.
(when assumption of chi square test are not fulfilled)
It is named after its inventor, R. Fisher.
For hand calculations, the test is only feasible in the case of a
2 x 2 contingency table.
Its application to higher order tables is controversial.
H
0
: there is no association between the two variables
H
1
: there is association between the two variables
The hypothesis is tested by comparing the probability of
observing the given or more extreme tables with the level of
significance, given the null hypothesis is true.
364
3/1/2010
92
Fisher's exact test
The exact probability of observing a given table is given as:
= [(a+b)!(c+d)!(a+c)!(b+d)!]/[N!a!b!c!d!]
a b (a+b)
c d (c+d)
(a+c) (b+d) N
365
Fisher's exact test
Hypothesis testing using fishers exact test involves the
following steps:
1. Calculate the probability of the observed table itself,
2. List all possible extreme tables manually (given the
marginal totals are maintained),
3. Calculate their respective exact probability,
4. Calculate the probability of getting observed or more
extreme tables,
5. Multiply the total by 2 (to get 2 tailed value)
6. Compare the value with value of
366
Fisher's exact test
Example 7.10:
In the following tabulated data, Is there any
association between the treatment type and survival
rate of patients? (Test the hypothesis at 95%
confidence level)
Treatment type Survived Died Total
A 7 2 9
B 5 6 11
Total 12 8 20
367
Fisher's exact test
H
0
: No association between the treatment modalities and
survival rate.
H
1
: There is association between the treatment
modalities and survival rate.
Test statistic: F exact test b/c two of the expected
frequencies have values less than 5.
Level of significance: 5%
Calculate the probability of getting the given or more
extreme tables.
368
3/1/2010
93
Fisher's exact test
Observed table:
Probability of observing this table = 9!11!12!8!/20!7!2!5!6!
= 0.132
Treatment type Survived Died Total
A 7 2 9
B 5 6 11
Total 12 8 20
369
Fisher's exact test
First possible extreme table:
Probability of observing this table = 9!11!12!8!/20!8!1!4!7!
= 0.024
Treatment type Survived Died Total
A 8 1 9
B 4 7 11
Total 12 8 20
370
Fisher's exact test
Second possible extreme table:
Probability of observing this table = 9!11!12!8!/20!9!0!3!8!
= 0.001
Treatment type Survived Died Total
A 9 0 9
B 3 8 11
Total 12 8 20
371
Fisher's exact test
Probability of getting the observed or more extreme
tables:
0.132 + 0.024 + 0.001 = 0.157 (one tailed)
Two tailed 2 x 0.157 = 0.314
Conclusion and interpretation:
Accept the null hypothesis at 95% confidence level
There is no association between the treatment
modalities and survival rate.
372
3/1/2010
94
Fisher's exact test using
SPSS
In order to do Fishers exact test using SPSS, track the
following steps.
Analyze > Descriptive Statistics > Cross tab > Put the
two categorical variables as column and row > Statistics
> Check Chi-square > Ok.
Fishers exact test is given in a table titled Chi-square
tests.
NB: SPSS doesnt do Fishers exact test for higher order
tables.
373
Summary
The interpretation of the hypothesis test is dependent on the
confidence level at which the test is conducted.
A hypothesis which is accepted at a lower level of confidence
can not be rejected at a higher level of confidence.
A hypothesis which is rejected at a lower level of confidence
can be accepted at a higher level of confidence.
A hypothesis which is rejected at a higher level of confidence
can not be accepted at a lower level of confidence.
A hypothesis which is accepted at a higher level of confidence
can be rejected at lower level of confidence.
374
Sample Size Calculation for
Comparative Studies.
The concept discussed in this chapter can be applied to
the calculation of sample size for comparative studies.
For comparative studies like case control, cohort,
interventional ,optimal size for the two groups is
calculated using the formula;
Where
2
2 1
2 2
1 1
2
1
) (
) 1 (
) 1 ( ) 1 ( )
1
1 (
P P
r
P P
P P Z p P
r
Z
n

+ + +
=

r
rP P
P
+
+
=
1
2 1
375
Sample Size Calculation
Cont..
Were;
P is the pooled proportion
P1 is the expected 1
st
proportion
P2 is the expected 2
nd
proportion
r is the number of controls per a case
Alpha is the probability of type I error
Beta is the probability of type II error
n
1
is sample size for the first group
NB: n
2
is calculated by multiplying n
1
by r.
376
3/1/2010
95
Correlation and Linear
Regression
Regression and Correlation
Many medical investigations are concerned with:
Establishment of relationship between two variables.
The strength of a relationship.
Predicting one variable on the basis of another.
Controlling the effect of unwanted variables.
Such intentions can be addressed either by using
correlation or regression analysis.
378
Correlation Analysis
Initially developed by Sir Francis Galton (1888) and Karl
Pearson (1896)
Correlation is the quantification of the degree to which two
random quantitative variables are related provided the
relationship is linear.
Both of the variables should be measured on the same
set of study units.
Strength of relationship measurement: Correlation
Coefficient.
Most commonly used coefficients: Product Momentum
Correlation or Pearson Correlation Coefficient (r).
The symbol rho ( ) used to represent population
correlation coefficient
Unit less measure.

379
Correlation Analysis
Does not imply cause and effect relationship.
The value of r ranges from -1 to +1.
If the correlation coefficient is greater than 0, the
variables are said to be positively correlated (i.e. as X
increases, Y tends to increase).
If the correlation coefficient is less than 0, the variables
are said to be negatively correlated (i.e. as X increases,
Y tends to decrease).
If the correlation coefficient is 0 then the variables are
said to be uncorrelated.
380
3/1/2010
96
Correlation Analysis Cont
The formula for computing sample correlation coefficient
(r) for two variables X and Y is given as:
Or
Before computing r, scattered plot between the two
variables should be drawn. Why?



=
] ) ( ][ ) ( [
) )( (
2 2
y y x x
y y x x
r


=
] ) ( ) ( ][ ) ( ) ( [
2 2 2 2
y y n x x n
y x xy n
r
381
Correlation Analysis Cont
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships
382
Correlation Analysis Cont
y
x
y
x
y
y
x
x
Strong relationships Weak relationships
(continued)
383
Correlation Analysis Cont
y
x
y
x
No relationship
(continued)
384
3/1/2010
97
Correlation Analysis Cont
Assumptions of correlation analysis:
Independent random samples are taken
Both variables are on interval/ratio scale
Linear association between X and Y
Paired measures for X and Y
Normal distribution for X and Y
Homogeneity of variance (Homoscedasticity)
In situations where its assumptions are violated,
correlation becomes inadequate to explain a given
relationship.
385
Correlation Analysis Cont
Example 8.1:
The data of a random sample of 20 countries are shown
in the following table. X represents the percentage of
children immunized by age one year and Y represents
the under five year mortality rate. Determine the strength
of association between the two variables.
386
Correlation Analysis Cont
387
Country % Immunized (X) CMR/1000LB (Y) XY Y
2
X
2
Bolivia 77 118 9086 13924 5929
Brazil 69 65 4485 4225 4761
Cambodia 32 184 5888 33856 1024
Canada 85 8 680 64 7225
China 94 43 4042 1849 8836
Czech 99 12 1188 144 9801
Egypt 89 55 4895 3025 7921
Ethiopia 13 208 2704 43264 169
Finland 95 7 665 49 9025
France 95 9 855 81 9025
Greece 54 9 486 81 2916
India 89 124 11036 15376 7921
Italy 95 10 950 100 9025
Japan 87 6 522 36 7569
Mexico 91 33 3003 1089 8281
Poland 98 16 1568 256 9604
Russia 73 32 2336 1024 5329
Senegal 47 145 6815 21025 2209
Turkey 76 87 6612 7569 5776
UK 90 9 810 81 8100
Total 1548 1180 68626 147118 130446
Correlation Analysis Cont
There is strong linear relationship between the two
variables.


=
] ) ( ) ( ][ ) ( ) ( [
2 2 2 2
y y n x x n
y x xy n
r
] ) 1180 ( ) 147118 ( 20 [ ] ) 1548 ( ) 130446 ( 20 [
) 1180 1548 ( ) 68626 ( 20
2 2

=
x
x
r
79 . 0 = r
388
3/1/2010
98
Correlation Analysis Cont
Interpretation option:
100% r
2
:
Shows proportion of variation of a variable
explained by the other.
Rule of thumb:
S|ze of Coeff|c|ent Genera| Interpretat|on
0.8-1.0 very sLrong relaLlonshlp
0.6-0.8 SLrong relaLlonshlp
0.4-0.6 ModeraLe relaLlonshlp
0.2-0.4 Weak relaLlonshlp
0.0-0.2 very weak or no relaLlonshlp
389
Correlation Analysis Cont
Hypothesis Testing for a Correlation Coefficient
As that of mean and percentage, it is also possible to
test significance about population correlation.
For two tailed test
H
0
: r is 0
H
1
: r is different from 0
The t test statistic is given as (with n-2 df):
2
1
2
r
n
r t

=
390
Correlation Analysis Cont
Example 8.2:
At the 0.05 level of significance, can we claim the
correlation coefficient in example 8.1 indicates significant
negative relationship between immunization coverage
and child mortality?
391
Correlation Analysis Cont..
The critical t value for 0.05 level of significance at 18
degree of freedom is - 1.734. Then we calculate the test
statistics.
Hence we accept the H
1
that r indicates significant
negative relationship between immunization coverage
and child mortality.
5.47 )
0.3759
18
( 79 . 0 )
) 79 . 0 ( 1
2 20
( 79 . 0
1
2
2 2
= =

=
r
n
r t
392
3/1/2010
99
Correlation Analysis Cont..
Limitations:
Applied only to a linear relationship.
One must not extrapolate an observed correlation
beyond observed ranges of the x and y value.
Does not differentiate dependent and independent
variable.
Confounding by a third variable.
393
Correlation Analysis Cont..
Spearmans Rank Correlation
It is a nonparametric (distribution-free) rank statistic
proposed by Charles Spearman in 1904 as a measure of
the strength of the associations between two variables
Denoted as r
s
Is applied when:
Normality assumption is not satisfied or can not be
tested,
At least one of the variable is given in ordinal scale,
In the calculation of the coefficient, actual values of both
variables should be changed into ranks.
394
Correlation Analysis Cont..
The formula for the Spearman Correlation Coefficient is
(given that there is no tied rank):
Where;
6 is a constant,
D is the difference between a subjects ranks on the
two variables,
n is the number of subjects.
Consider the following example.
) 1 (
) ( 6
1
2
2

=

n n
D
r
s
395
Correlation Analysis Cont..
Countries
MMR
(Per100,00
0LB)
MMR
Rank
Delivery
Service
Coverage
(%)
Rank D D
2
1 315 4 55 6 -2 4
2 450 6 40 5 1 1
3 200 1 70 8 -7 49
4 250 3 79 10 -7 49
5 243 2 75 9 -7 49
6 830 9 25 3 6 36
7 850 10 20 2 8 64
8 656 7 20 1 6 36
9 701 8 30 4 4 16
10 410 5 60 7 -2 4
308
The following table
presents the MMR level
and delivery service
coverage in 10 developing
countries.
= 1- [(6x308)/10(100-1)]
= 1-[1848/990]
= 1-1.87
= -0.87
) 1 (
) ( 6
1
2
2

=

n n
D
r
s
396
3/1/2010
100
Correlation Analysis Cont..
Inference about r
s
For hypothesis testing t score can be calculated (at df of
n-2) using the formula;
For the previous example the t score would be;
If the hypothesis test is a two tailed test at 0.05 level of
significant, we reject the H
0
as 5 > 2.306.
2
1
2

=
n
r
r
t
s
s
5
2 10
) 87 . 0 ( 1
87 . 0
2
=

= t
397
Correlation Analysis Cont..
Partial Correlation
A method used to describe the relationship between two
variables while taking away the effects of another
variable, or several other variables, on this relationship.
Still requires meeting all the usual assumptions of
Pearsonian correlation.
But the covariate may not be necessary numeric.
398
Correlation Analysis Using
SPSS
In order to do correlation analysis using SPSS follow the
following steps;
Analyze > Correlate > Bivariate correlations > Put the
two variables in the variable box > Select Pearson or
Spearman (another option is also there) > OK.
Partial correlation can also be done.
Analyze > Correlate > Partial correlation.
But before that, dont forget the scattered plot.
399
Regression Analysis
In correlation analysis the interest is to show how two
numeric variables are related.
However in regression analysis, we are interested in
explaining or modeling a dependent variable (Y) as a
function of one or more independent variables (X).
Regression analysis is used to:
Assess association between two variables.
Predict/explain the value of a dependent variable
based on the value of at least one independent
variable. (i.e. Mathematical modeling)
Control for confounding factors.
Show possible effect of interaction among variables.
400
3/1/2010
101
Regression Analysis Cont..
The general regression equation is given as:
Y = +
1
X
1
+
2
X
2
.
n
X
n
Where: Y is the value of the dependent variable,
X is the independent variable,
is the intercept,
is the coefficient of the independent variable
If the equation has only one independent variable the
regression is called Simple Regression
If multiple independent variables are involved it is called
Multiple Regression.
In public health the most commonly used types of
regression analysis are: Linear and Logistic Regression
401
Linear Regression
Also known as linear least squares regression.
It is by far the most widely used modeling method.
The dependent variable is assumed to be a linear
function of one or more independent variables plus an
error introduced to account for all other factors.
Where Y is the dependent variable, Xs are the
independent variable and E is the random error term.
The DV (Y) is given in continuous numeric scale while
the IV/s (X) can be of any type. (mostly numeric variable)
+ + + + =
n n
x x x Y .........
2 2 1 1
402
Linear Regression Cont..
The equation provides what value the DV would have for
a given value/s of the IV/s.
For example if we develop a linear model with the DV of
body height and the IV of serum growth hormone, we
can predict height for a person with a given value of
serum GH.
Can be simple or multiple regression.
It attempts to model the relationship between the
dependent and independent variables by fitting a linear
equation to observed data.
403
Linear Regression Cont..
A scattered plot is helpful to assesses the presence of
linear trend of association.
Consider the following data showing the number of
households in China with TV.
ear (k)
(0 represents 2000)
nouseho|ds w|th 1V
(m||||ons)
0 68
1 72
2 80
3 83
404
3/1/2010
102
Linear Regression Cont..
If we plot these data, we get the following graph.
405
Linear Regression Cont..
Although no straight line passes exactly through these
points, there are many straight lines that pass close to
them. Here is one of them.
406
Linear Regression Cont..
How would you draw a line through the points? How do
you determine which line fits best?
The most common method for fitting a regression line is
the method of least-squares.
This method calculates the best-fitting line for the
observed data by minimizing the sum of the squares of
the vertical deviations from each data point to the line.
Best fit means difference between actual Y values &
predicted Y values are minimum.
Hence, linear regression is a method of finding the linear
equation that comes closest to fitting a collection of data
points.
407
Linear Regression Cont..

2
Y
X

1
3

4
^
^
^
^
Y X
2 0 1 2 2
= == = + ++ + + ++ +




Y X
i i
= == = + ++ +
0 1
L S m i n i m i z e s


i
i
n
2
1
1
2
2
2
3
2
4
2
= == = + ++ + + ++ + + ++ +
= == =

408
3/1/2010
103
Linear Regression Cont..
Suppose that we used the line rather than the data
points to estimate the number of households with TV.
Then we would get slightly different values from the
original observed values shown above. These values are
called predicted values.
Year (X)
(0 represents 2000)
Households with TV (millions)
Observed Values
Households with TV (millions)
Predicted Values Residual
0 68 62 6
1 72 70 2
2 80 78 2
3 83 86 -3
409
Linear Regression Cont..
The better our choice of line, the closer the predicted
values will be to the observed values.
The difference between the predicted value and the
observed value is called the residue.
Residue = Observed Value - Predicted Value
The best line is the line with the smallest sum of squares
of error (SSE). (i.e. list square estimation)
SSE = Sum of squares of residues = Sum of (y
observed

y
predicted
)
2
410
Linear Regression Cont..
The manual calculation for the coefficients of linear
regression is possible when we have one independent
variable. i.e.:
Y = + X
As that of correlation analysis, here we should have a
set of paired DV and IV values for all study units.
The line which represent the dataset (Y = + X) is
calculated using the formula:

=
]
) (
[
] [
2
2
n
x
x
n
y x
xy

x y =
411
Linear Regression Cont..
x ?
1 1
2 1
3 2
4 2
3 4
Consider the following data.
First we should plot a scattered diagram.
412
3/1/2010
104
0
1
2
3
4
0 1 2 3 4 5 6
Linear Regression Cont..
Y
X
413
Linear Regression Cont
( )( )
( )
( )( ) 10 . 0 3 70 . 0 2

70 . 0
5
15
55
5
10 15
37

1 0
2
1
2
1
2
1
1
1
1
= = =
=

=
|
.
|

\
|

|
.
|

\
|
|
.
|

\
|

=
=
=
=
=
X Y
n
X
X
n
Y
X
Y X
n
i
n
i
i
i
n
i
i
n
i
i
n
i
i i

414
Linear Regression Cont
One of the indices to measure model goodness of fit for
simple linear regression is R-squared or coefficient of
determination.
It is the proportion of variation explained by the best line
model.
It depends on the ratio of sum of square error from the
regression model (SSE) and the sum of squares difference
around the mean (SST = sum of square total).
Where:
415
Linear Regression Cont
For multiple linear regression adjusted r squared is used.
For general rule of thumb, the R-squared or adjusted R-
squared should be higher than 0.80 to produce a good
linear model.
If your R-squared is less than 0.5, it is recommended
that you consider other type of model rather than linear
model.
416
3/1/2010
105
Linear Regression Cont
Interpretation of linear regression coefficient:
Lets consider the following simple linear reg equation;
Y = + X
represents the slope, and represents the y-intercept.
The slope represents the estimated average change in Y
when X increases by one unit.
The intercept represents the estimated average value of Y
when X equals zero. (Practically less important)
When we represent a binary independent variable (coded
as 0-1), the slope represents the estimated average
change in Y when you switch from 0 to 1.
417
Linear Regression Cont
Example 8.3:
Assume that the duration of breast feeding in weeks (Y)
was found to be positively correlated with maternal age
in years(X). A linear regression model was developed to
explain the association. The equation is given as Y =
5.92 + 0.389X. How do you want to explain the
equation?
418
Linear Regression Cont
Assumptions:
Normal distribution: Regression assumes that variables
have normal distributions.
Homoscedasticity: The variance of the error terms is
constant for each value of x.
Linearity: The relationship between each x and y is linear.
Normally distributed error terms: The error terms follow the
normal distribution.
Independence of error terms: Successive residuals are not
correlated.
No multicolinarity: The independent variables are not
correlated each other.
419
Linear Regression Cont
Hypothesis testing in linear regression:
Questions to be answered through the hypothesis testing
are:
Does the entire set of independent variables contribute
significantly to the prediction of y?
Does the addition of one particular variable of interest
add significantly to the prediction of y achieved by the
other independent variables already in the model?
The null and alternative hypothesis are given as:
H0: 1 = 2 = = p = 0
H1: j 0 for at least one j.
420
3/1/2010
106
Linear Regression Cont
F test and t test are used to test the hypothesis.
F is a test for statistical significance of the regression
equation as a whole. It is obtained by dividing the
explained variance by the unexplained variance.
(Given as ANOVA table)
T test is used to see whether that a specific variable is
significant in explaining the dependant variable or not.
421
Linear Regression Using SPSS
Analyze > Regression > Linear Regression > Put the
dependent and independent variables > Select
appropriate statistics > Ok.
422
Logistic Regression
424
Introduction
Logistic Regression is a model used for prediction the
probability of occurrence of categorical event by fitting data
into a Logistic Curve.
Common dichotomous dependant variables are like
disease status (healthy or ill), clinical outcome (alive or
dead), treatment outcome (success or failure), utilization
health commodities (utilization or non-utilization) etc.
Application:
Modeling for risk prediction, identification of
determinants and health programming,
Controlling confounding and interacting factors.
3/1/2010
107
425
Introduction Cont
Comparative advantage of Logistic Regression
Fewer assumptions,
Mathematically amenable,
Easier interpretation.
Classification of Logistics Regression (LR):
Binomial LR: Dependant variable is dichotomous.
Multinomial LR: Dependant variable with more than
two classes.
Ordinal LR: Dependant variable with multiple and
ranked classes.
426
Logistic Regression Function
Binary dependant variable are coded as 0 or 1.
The probablity of the distribution is equal to the proportion
of 1s in the distribution (P).
The logistic function associates the Independent Variable
(IV) X with the probability of occurrence of the Dependant
Variable (DV) Y.
The function is given as:
427
LR Function Cont
The function is represented by S shaped Sigmoid graph
which is called the Logistic Curve.
Examples:
428
LR Function Cont
Derivation of the function can be demonstrated with an ex.
Suppose, we want to predict the persons sex based on the
person's height.
Let's say the probability of being male at a given ht is 0.9
Odds (P/1-P) of being male = 0.9/0.1 = 9
Odds of being female = 0.1/0.9 = 0.11
However the values look asymmetrical.
Can be corrected by the application of ln.
ln(9) = 2.217 and ln(0.11) = -2.217
The over all transformation is Logit Transformation.
The log of odds is abbreviated as the Logit.
3/1/2010
108
429
LR Function Cont
Mathematically:
x
p
p
+ =
(

1
ln
x
e
p
p
+
=
(

1
x
x
e
e
P


+
+
+
=
1
z
e
P

+
=
1
1
n nx x x z where ........ 2 2 1 1 + + =
430
LR Function Cont
One of the advantages of Logistic Regression: it is
possible to compute OR from its coefficient.
Lets assume a researcher is interested to study the effect
of smocking as predicting variable (X) on dependant
variable lung cancer (Y).
X can be present (X=1) or absent (X=0),
Y can be present (Y=1) or absent (Y=0),
X
Y P
Y P
+ =
(

=
=
) 1 ( 1
) 1 (
log
431
LR Function Cont
Hence;
The OR = Odds of smokers Odds of non-smokers
[ ] ) 1 ( ) 1 / 1 ( log + = = = X Y odds
[ ] ) 0 ( ) 0 / 1 ( log + = = = X Y odds


e
e
OR
) 1 ( +
=

e OR =
432
Assumptions of Logistic Regression
Logistic Regression has fewer assumptions than Linear
Regression:
The DV need not be normally distributed.
Normally distributed error terms are not assumed.
Error terms should not be homoscedastic for each
level of the IVs.
3/1/2010
109
433
Assumptions of LR Cont
But it has the following assumptions:
1. Data type: A dichotomous or polytomous DV.
2. Inclusion of all relevant variables and exclusion of the
irrelevant ones: i.e. Based on scientific framework or
statistical cutoff point (P=0.3).
3. No interaction: LR doesnt consider interaction effects
except when interactions are created as a variable.
4. No outliers and influential cases: Such cases can affect the
model significantly.
434
Assumptions of LR Cont
5. No multicollinearity: As the IVs increase in correlation with
each other, the standard errors become inflated.
A standard error > 2.0.
Examining the correlations and associations b/n IVs
Tolerance and VIF.
6. No outliers and influential cases: Such cases can affect
the model significantly.
7. Large samples:
The minimum Ratio of Valid Cases to Variables
should be at least 10:1. The preferred ratio is 20:1.
435
Assumptions of LR Cont
8. Linearity:
Linear relationship b/n numeric IVs & the logit of the DV.
If not the model underestimates association, lacks power.
Box-Tidwell Test: If there is non linearity for numeric IV
X, [(X)*ln(X)] interaction term become significant in model.
436
Fitting Logistic Model to a Dataset
In Linear Regression, the fitness of the model into the
dataset is achieved through List Square Estimation
(LSE).
In Logistic Regression LSE cant be used.
In its place Maximum Likelihood Estimation (MLE) is
used.
MLE relies on the concept of Likelihood.
The likelihood of a set of data is the probability of obtaining
that particular set of data, using a given model.
3/1/2010
110
437
Fitting Logistic Model Cont
For example:
Dataset B has five cases. Observed values for Y are
(1,0,1,0,1)
The model predicts the probability of occurrence of Y is 0.7
(i.e. Probability of Y=1 is 0.7, and Y=0 is 0.3)
Likelihood of B is the joint probability of predicting the
correct observed value of Y for every case using the model.
i.e. L (B)=(0.7)(0.3)(0.7)(0.3)(0.7)=0.03087

=
n
i
yi yi
p P B L
1
1
) 1 ( ) (
438
Fitting Logistic Model Cont
Mathematically it is easier to work with the Log likelihood.
Maximum Likelihood picks the values of the model
parameters that make the data "more likely" than any
other values of the parameters would make them.
The MLE of the parameter P is that value of P that
maximizes L or ln L.
[ ]

=
+ =
n
i
i i
P y P y B L
1
) 1 ln( 1 ) ln( ) ( ln
439
Fitting Logistic Model Cont
Iteration: Repeated testing of the data and tuning of the
model parameter to provide the best fitting equation.
Once P is determined, then and are estimated.
Probability 440
Interpretation of Reg. Coefficients
is called the Intercept and
1
,
2
, and so on, are called the
Regression Coefficients of x
1
, x
2
,, respectively.
is the value of Z when the value of all risk factors is zero.
A +Ve coefficient means the risk factor increases the
probability of the outcome, while a -Ve means the opposite.
A large coefficient means that the risk factor strongly
influences the probability of the outcome; while a near-zero
means the opposite.
z
e
P

+
=
1
1
n nx x x z where ........ 2 2 1 1 + + =
3/1/2010
111
441
Hypothesis Testing in Logistic Reg.
In Logistics Regression t or F test statistic can not be used
for hypothesis testing since it has Bernoulli Distribution.
Options:
The (log) Likelihood Ratio Statistic (-2LL),
The Wald Test,
All test either of the following null-hypothesis:
Ho: 1 = 2 = 3 = n = 0
Ho: Removing an IV from the model doesnt change its
the predictive ability.
442
Hypothesis Testing LR Cont.
A. Likelihood Ratio Test Statistic (-2LL):
Usually two nested models (the Full and Reduced
Models) are presented.
Reduced model mean a model from which a variable is
purposely omitted.
Ho: The removed variable is not significant in the model.
-2 Log L = -2 [log L Reduced model Log L Full model]
|
|
.
|

\
|
=
mod
mod
log 2
full of L
reduced the of L
statistic LR
443
Hypothesis Testing LR Cont.
If the full model explains the data `much better' than the
reduced model, the difference will be `large:
Reject the Ho that the removed variable is non-
significant.
If the reduced model explains the data as the full model,
the difference will be close to 0:
Accept the Ho that the removed variable is non-
significant.
LRT ~ X
2
df = number of removed variables.
444
Hypothesis Testing LR Cont.
B. Wald Statistic:
Commonly used to test the significance of coefficients for
each independent variable.
H
o
: A particular coefficient is zero.
W ~ X
2
df of 1.
For a particular IV if the W is significant, then the
parameter associated with this variable is not zero, so that
it should be included in the model.

of Varience
test Wald
2
=
3/1/2010
112
445
Pseudo R-Squares
In Linear Regression, R
2
measures proportion of variance
of DV explained by the predictors.
Ranges from 0-1
Logistic Regression doesnt have an equivalent to the R
2
However, there are varieties of Pseudo R
2
which are
designed to simulate the real R
2
.
Common used: Cox & Snell R
2
and Nagelkerke R
2
Pseudo R
2
doesnt mean what R
2
exactly means in Linear
Regression: Interpretation should be made with caution.
446
Pseudo R-Squares Cont.
A. Cox and Snells Pseudo R
2
B. Nagelkerke Pseudo R
2
N
Full
Intercept
M L
M L
R
/ 2
2
) (
) (
1
)
`

=
N
Intercept
N
Full
Intercept
M L
M L
M L
R
/ 2
/ 2
2
) ( 1
) (
) (
1

)
`

=
447
Goodness of Fit Analysis
A. Hosmer-Lemeshow Statistic
The recommended test for overall fitness of a Logistic
Regression model,
A type of chi-square test but considered stronger than the
traditional chi-square test, particularly if continuous
covariates are in the model or sample size is small.
HL statistic first sort observations in increasing order of
their estimated event probability and divides observations
into deciles based on the predicted probabilities.
HL statistic ~ X
2
df of 8.
448
Goodness of Fit Analysis Cont
Where
n
j
is Number of observation in the j
th
group
O
j
is Observed number of cases in the j
th
group
E
j
is Expected number of cases in the j
th
group
Non-significance means the model adequately fits the data.
P value of 0.05 is considered as level of significance.
8
) 1 (
) (
2
10
1
2
2
of df
n
E
E
E O
G
j
j
j
j
j j
HL

=
3/1/2010
113
449
Goodness of Fit Analysis Cont
B. Loglikelihood Statistics
A good model is the one that results in a high likelihood of
the observed results.
This translates into a small value for -2LL.
If a model fits perfectly, the -2LL would be 0.
Since there is no acceptable upper cutoff point for -2LL
test, it is difficult to interpret the meaning of the score.
Less commonly used.
Logistic Regression Using SPSS
Analyze > Regression > Binary Logistic >Put the
dependent and independent variables > Mark categorical
independent variables > check for the options > Ok.
Or
Analyze > Regression > Multinomial Logistic > Put the
dependent variable > Put the independent variables as
factors or covariates depending on their nature > check
for available options > Ok.
450
Analysis of Variance
(ANOVA)
ANOVA
Used to compare mean of a quantitative variable across
different categories of a categorical variable.
The specific type is called One-way ANOVA.
If two covariates are involved it is called Two-way ANOVA.
If the categorical variable has only 2 values: 2-sample t-
test can be used.
ANOVA allows for comparison among 3 or more groups.
ANOVA is helpful because it possess a certain advantage
over a two-sample t-test.
Doing multiple two-sample t-tests would result in a largely
increased chance of committing a type I error.
452
3/1/2010
114
ANOVA Cont
ANOVA functions by checking whether the differences
between the groups are significant depends on:
The difference in the means
The standard deviations of each group
The sample sizes
ANOVA determines P-value from the F statistic.
Hypothesis:
H
0
: The means of all the groups are equal.
H
1
: Not all the means are equal.
Doesnt explain which ones differs.
Once a global difference is detected, it should be follow
up with multiple comparisons (Post hoc test) to identify
specific differences.
453
ANOVA Cont
Assumptions of ANOVA:
Each group is approximately normally distributed,
Observed data constitute independent random samples
from the respective population,
Standard deviations of each group are approximately
equal
Rule of thumb: ratio of largest to smallest sample
standard deviation must be less than 2:1
454
ANOVA Cont
ANOVA is a technique whereby the total variation
present in a dataset is segregated into several
components.
Variation is the sum of the squares of the deviations
between a value and the mean of the value.
Sum of square (SS) is another name for variation.
ANOVA measures two sources of variation in the data
and compares their relative sizes.
Between group variation
Within group variation
455
ANOVA Cont
Between group variation:
Is there some variation between the groups?
Sometimes called the variation due to the factor.
Denoted SS(B) for Sum of Squares (variation) between
the groups.
Calculated as follows (given x double bar is the grand
mean):

=
=
k
i
i i
x x n B SS
1
2
) ( ) (

=
+ =
k
i
n n
x x n x x n x x n B SS
1
2 2
2 2
2
1 1
) ( ......... ) ( ) ( ) (
456
3/1/2010
115
ANOVA Cont
Within group variation:
Is there some variation within the groups?
Sometimes called the error variation as it is the variation
that cant be explained by the factor.
Denoted SS(W) for Sum of Squares (variation) within
the groups.
Calculated as follows given n is the sample size for
every group.

=
=
k
i
i i
s n W SS
1
2
) ( 1 ) (
2 2
2 2
2
1 1
) ( 1 ........ ) ( 1 ) ( 1 ) (
n n
s n s n s n W SS + =
457
ANOVA Cont
Variance:
Based on the variation (SS), variance is calculated for
both categories.
The variance is also called the Mean of the Squares and
abbreviated by MS, often with an accompanying variable
MS(B) or MS(W).
Calculated by dividing the variation by the df
MS = SS / df
The between group df is one less than the number of
groups (k-1)
The within group df is the sum of the individual dfs of
each group. Or in other words it is (n-k)
458
ANOVA Cont
The F distribution:
Used as test of significance in ANOVA.
The F distribution is defined as the distribution of
(Z/n1)/(W/n2), where Z has a chi-square distribution with
n1 df, W has a chi-square distribution with n2 df, and Z
and W are statistically independent.
In ANOVA F test statistic is the ratio of two sample
variances. (MSB/MSW).
The df for the numerator are the df for the between
group (k-1) and the df for the denominator are the df for
the within group (n-k).
A large F is evidence against H
0
, since it indicates that
there is more difference b/n groups than within groups.
459
ANOVA Cont
Example:
Suppose we have three groups:
Group 1: 5.3, 6.0, 6.7
Group 2: 5.5, 6.2, 6.4, 5.7
Group 3: 7.5, 7.2, 7.9
Then we computer ANOVA F statistic in the following
manner.
460
3/1/2010
116
ANOVA Cont
WITHIN BETWEEN
difference: difference
group data - group mean group mean - overall mean
data group mean plain squared plain squared
5.3 1 6.00 -0.70 0.490 -0.4 0.194
6.0 1 6.00 0.00 0.000 -0.4 0.194
6.7 1 6.00 0.70 0.490 -0.4 0.194
5.5 2 5.95 -0.45 0.203 -0.5 0.240
6.2 2 5.95 0.25 0.063 -0.5 0.240
6.4 2 5.95 0.45 0.203 -0.5 0.240
5.7 2 5.95 -0.25 0.063 -0.5 0.240
7.5 3 7.53 -0.03 0.001 1.1 1.188
7.2 3 7.53 -0.33 0.109 1.1 1.188
7.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106
TOTAL/df 0.25095714 2.55275
overall mean: 6.44 F = 2.5528/0.25025 = 10.21575
461
ANOVA Cont
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952
Total 6.884 9
1 less than number
of groups
number of data values -
number of groups
(equals df for each
group added together)
1 less than number of individuals
(just like other situations)
462
ANOVA Using SPSS
463
Analyze > Compare means > One way ANOVA > Put
the continuous variable under Dependent list > Put the
categorical variable under Factor > Select Post hoc
tests > Ok.
Thank You
464

You might also like