You are on page 1of 226

Statistics Made Easy

Hani Tamim, MPH, PhD


Assistant Professor
Epidemiology and Biostatistics
Research Center / College of Medicine
King Saud bin Abdulaziz University for Health Sciences
Riyadh Saudi Arabia

Objective of medical research

Is treatment A better than treatment B for patients with


hypertension?

What is the survival rate among ICU patients?

What is the incidence of Downs syndrome among a certain


group of people?

Is the use of Oral Contraceptives associated with an increased risk


of breast cancer?

Research Process?

Planning

Design

Data collection

Analysis

Data entry

Data cleaning

Data management

Data analysis

Reporting

Statistics is used in .

What is statistics?

Scientific methods for:

Collecting

Organizing

Summarizing

Presenting

Interpreting

data

Definition of some basic terms

Population: The largest collection of entities for which we have


interest at a particular time

Sample: A part of a population

Simple random sample: is when a sample n is drawn from a


population N in such a way that every possible sample of size n
has the same chance of being selected

Definition of some basic terms

Variable: A characteristic of the subjects under observation that


takes on different values for different cases, example: age gender,
diastolic blood pressure

Quantitative variables: Are variables that can convey information


regarding amount

Qualitative variables: Are variables in which measurements


consist of categorization

Types of variables

Categorical variables

Continuous variables

Categorical variables
Nominal: unordered data

Death

Gender

Country of birth

Ordinal: Predetermined order among response classification

Education

Satisfaction

Continuous variables
Continuous: Not restricted to integers

Age

Weight

Cholesterol

Blood pressure

Steps involved (data)

Data collection

Database structure

Data entry

Data cleaning

Data management

Data analyses

Data collection

Data collection:

Collection of information that will be used to answer the research


question
Could be done through questionnaires, interviews, data abstraction,
etc.

Data collection

Database structure

Database structure:

Structure the database (using SPSS) into which the data will be
entered

Data entry

Data entry:

Entering the information (data) into the computer

Most of the times done manually

Single data entry


Double data entry

Data cleaning

Data cleaning:

Identify any data entry mistakes

Correct such mistakes

Data management

Data management:

Create new variables based on different criteria

Such as:

BMI
Recoding
Categorizing age (less than 50 years, and 50 years and above)
Etc.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main


features of a sample

Inferential statistics: is the process of using the sample statistic to


make informed guesses about the value of a population parameter

Data analyses

Data analyses:

Univariate analyses

Bivariate analyses

Multivariate analyses

Bottom line

There are different statistical methods


for different types of variables

Descriptive statistics: categorical variables

Frequency distribution

Graphical representation

Descriptive statistics: categorical variables

Frequency distribution

A frequency distribution lists, for each value (or small range of


values) of a variable, the number or proportion of times that
observation occurs in the study population

Descriptive statistics: categorical variables

Frequency distribution:

How to describe a categorical variable (marital status)?

Descriptive statistics: categorical variables

Construct a frequency distribution

Title

Values

Frequency

Relative frequency (percent)

Valid relative frequency (valid percent)

Cumulative relative frequency (cumulative percent)

Descriptive statistics: categorical variables

Marital status of the 291 patients admitted to the Emergency Department

Valid

Missing
Total

Married
Single
Widow
Total
System

Frequency
266
13
2
281
10
291

Percent
91.4
4.5
.7
96.6
3.4
100.0

Valid Percent
94.7
4.6
.7
100.0

Cumulative
Percent
94.7
99.3
100.0

Example

Example: summarizing data

Descriptive statistics: categorical variables

Graphical representation

A graph lists, for each value (or small range of values) of a variable,
the number or proportion of times that observation occurs in the
study population

Descriptive statistics: categorical variables

Graphical representation:

Two types

Bar chart

Pie chart

Descriptive statistics: categorical variables

Construct a bar or pie chart

Title

Values

Frequency or relative frequency

Properly labelled axes

Descriptive statistics: categorical variables

Descriptive statistics: categorical variables

Descriptive statistics: continuous variables

Central tendency

Dispersion

Graphical representation

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure)?

Central tendency:

Mean

Median

Mode

Descriptive statistics: continuous variables

Mean:

Add up data, then divide by sample size (n)


The sample size n is the number of observations (pieces of
data)
Example
n = 5 Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
120 + 80 + 90 + 110 + 95
X3 = 90
X=
= 99mmHg
5
X4 = 110
X5 = 95

Descriptive statistics: continuous variables

Formula

X=

X
i =1

Summation Sign
Summation sign () is just a mathematical shorthand for add
up all of the observations

X
i=1

= X1 + X 2 + X 3 + ....... + Xn

Descriptive statistics: continuous variables

Also called sample average or arithmetic mean X

Sensitive to extreme values

One data point could make a great change in sample mean

Uniqueness

Simplicity

Descriptive statistics: continuous variables

Median: is the middle number, or the number that cuts the data in
half

80

90

95

110 120

The sample median is not sensitive to extreme values


For example: If 120 became 200, the median would remain the
same, but the mean would change to 115.

Descriptive statistics: continuous variables

If the sample size is an even number

80

90

95

110 120

125

95 + 110
= 102.5 mmHg
2

Descriptive statistics: continuous variables

Median: Formula

n = odd: Median = middle value (n+1/2)


n = even: Median = mean of middle 2 values (n/2 and n+2/2)

Properties:

Uniqueness
Simplicity
Not affected by extreme values

Descriptive statistics: continuous variables

Mode: Most frequently occurring number

80

Mode = 95

90

95

95

120

125

Descriptive statistics: continuous variables

Example:
Statistics
Systolic blood pressure
N
Valid
286
Missing
5
Mean
144.13
Median
144.50
Mode
155

Descriptive statistics: continuous variables

Central tendency measures do not tell the whole story

Example:
21
22
23
23
23
24
Mean = 213/9 = 23.6
Median = 23

15

18

21
21
23
25
Mean = 213/9 = 23.6
Median = 23

24

25

28

25

32

33

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure)


in addition to central tendency?

Measures of dispersion:

Range

Variance

Standard Deviation

Descriptive statistics: continuous variables

Range

Range = Maximum Minimum

Example:

Range = 120 80 = 40

X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95

Descriptive statistics: continuous variables

Sample variance (s2 or var or 2)


The sample variance is the average of the square of the
deviations about the sample mean
n

s2 =

Sample standard deviation (s or SD or )


It is the square root of variance

2
(X

X
)
i
i=1

n 1

s=

2
(X

X
)
i
i=1

n 1

Descriptive statistics: continuous variables

Example: n = 5 systolic blood pressures (mm Hg)

Recall, from earlier: average = 99 mm HG

X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95

2
2
2
2
(X

X
)
=
(120

99)
+
(80

99)
+
(90

99)
i
i=1

+ (110 99)2 + (95 99)2 = 1020

Descriptive statistics: continuous variables

Sample Variance
n

s =
2

2
(X

X
)
i
i=1

n 1

1020
=
= 255
4

Sample standard deviation (SD)


s = s2 = 255 = 15.97 (mm Hg)

Descriptive statistics: continuous variables

The bigger s, the more variability

s measures the spread about the mean

s can equal 0 only if there is no spread

All n observations have the same value

The units of s is the same as the units of the data (for example,
mm Hg)

Descriptive statistics: continuous variables

Example:
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Range
Minimum
Maximum

286
5
144.13
144.50
155
35.312
1246.916
202
55
257

Example: summarizing data

Descriptive statistics: continuous variables

Graphical representation:

Different types

Histogram

Descriptive statistics: continuous variables

Construct a chart

Title

Values

Frequency or relative frequency

Properly labelled axes

Descriptive statistics: continuous variables

Shapes of the Distribution

Three common shapes of frequency distributions:

Symmetrical
and bell
shaped

Positively
skewed or
skewed to
the right

Negatively
skewed or
skewed to
the left

Shapes of Distributions

Symmetric (Right and left sides are mirror images)

Left tail looks like right tail


Mean = Median = Mode

Mean
Median
Mode

Shapes of Distributions

Left skewed (negatively skewed)

Long left tail


Mean < Median

Mean

Median

Mode

Shapes of Distributions

Right skewed (positively skewed)

Long right tail


Mean > Median

Mode
Median

Mean

Shapes of the Distribution

Three less common shapes of frequency distributions:

A
Bimodal

B
Reverse
J-shaped

C
Uniform

Probability

Probability

Definition:

The likelihood that a given event will occur

It ranges between 0 and 1:

0 means the event is impossible to occur

1 means that the event is definitely occurring

How do we calculate it?

Frequentist Approach:

Probability: is the long term relative frequency

Thus, it is an idealization based on imagining what would


happen to the relative frequencies in an indefinite long
series of trials

Application in medicine

How does probability apply in medicine?

Probability is the most important theory behind biostatistics

It is used at different levels

Descriptive

Example: 4% chance of a patient dying after admission to


emergency department (from the previous example)

What do we mean?
Out of each 100 patients admitted to the emergency department, 4
will die, whereas 96 will be discharged alive

Example: 1 in 1000 babies are born with a certain abnormality!

Incidence and prevalence

Associations

Example: the association between cigarette smoking and death


after admission to the emergency department with an MI

Current Cigarrete Smoking in association with death at discharge


Count

Current Cigarrete
Smoking
Total

No
Yes

Death at discharge
Death
Discharged
5
123
5
154
10
277

Probability of being smoker

Total
128
159
287

= 100 / 331

Probability of dying if a smoker

= 5 / 159 = 3.1%

Probability of dying if a non-smoker

= 5 / 128 = 3.9%

Associations

Same is applied to:

Relative risk

Risk difference

Attributable risk

Odds ratio

Etc..

Bottom line

Probability is applied at all levels of statistical analyses

Probability distributions

Probability distributions list or describe probabilities for all possible


occurrences of a random variable

There are two types of probability distributions:

Categorical distributions

Continuous distributions

Probability distributions: categorical variables

Categorical variables

Frequency distribution

Other distributions, such as binomial

Probability distributions: continuous variables

Continuous variables

Continuous distribution

Such as Z and t distributions

Normal Distribution

Properties of a Normal Distribution

Also called Gaussian distribution

A continuous, Bell shaped, symmetrical distribution; both


tails extend to infinity

The mean, median, and mode are identical

The shape is completely determined by the mean and


standard deviation

Normal Distribution

A normal distribution can have any and any :


e.g.: Age: =40 ,
= 10
The area under the curve represents 100% of all the observations

Mean
Median
Mode

Normal Distribution

Normal Distribution
Age distribution for a specific population

50%

50%

Mean=40
SD=10

Normal Distribution
Age distribution for a specific population

Age = 25

Mean=40
SD=10

Normal distribution

The formula used to calculate the area below a certain point in a


normal distribution:

The probability density function of the normal distribution with


mean and variance 2

Normal distribution

Thus, for any normal distribution, once we have the mean and sd,
we can calculate the percentage of subjects:

Above a certain level

Below a certain level

Between different levels

But the problem is:

Calculation is very complicated and time consuming, so:

Standardized Normal Distribution

We standardize to a normal distribution

What does this mean?

For a specific distribution, we calculate all possible probabilities,


and record them in a table

A normal distribution with a = 0, = 1 is called a Standardized


Normal Distribution

Standardized Normal Distribution

Mean=0
SD=1

Area under the Normal Curve from 0 to X

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0

0.00000
0.03983
0.07926
0.11791
0.15542
0.19146
0.22575
0.25804
0.28814
0.31594
0.34134
0.36433
0.38493
0.40320
0.41924
0.43319
0.44520
0.45543
0.46407
0.47128
0.47725
0.48214
0.48610
0.48928
0.49180
0.49379
0.49534
0.49653
0.49744
0.49813
0.49865
0.49903
0.49931
0.49952
0.49966
0.49977
0.49984
0.49989
0.49993
0.49995
0.49997

0.00399
0.04380
0.08317
0.12172
0.15910
0.19497
0.22907
0.26115
0.29103
0.31859
0.34375
0.36650
0.38686
0.40490
0.42073
0.43448
0.44630
0.45637
0.46485
0.47193
0.47778
0.48257
0.48645
0.48956
0.49202
0.49396
0.49547
0.49664
0.49752
0.49819
0.49869
0.49906
0.49934
0.49953
0.49968
0.49978
0.49985
0.49990
0.49993
0.49995
0.49997

0.00798
0.04776
0.08706
0.12552
0.16276
0.19847
0.23237
0.26424
0.29389
0.32121
0.34614
0.36864
0.38877
0.40658
0.42220
0.43574
0.44738
0.45728
0.46562
0.47257
0.47831
0.48300
0.48679
0.48983
0.49224
0.49413
0.49560
0.49674
0.49760
0.49825
0.49874
0.49910
0.49936
0.49955
0.49969
0.49978
0.49985
0.49990
0.49993
0.49996
0.49997

0.01197
0.05172
0.09095
0.12930
0.16640
0.20194
0.23565
0.26730
0.29673
0.32381
0.34849
0.37076
0.39065
0.40824
0.42364
0.43699
0.44845
0.45818
0.46638
0.47320
0.47882
0.48341
0.48713
0.49010
0.49245
0.49430
0.49573
0.49683
0.49767
0.49831
0.49878
0.49913
0.49938
0.49957
0.49970
0.49979
0.49986
0.49990
0.49994
0.49996
0.49997

0.01595
0.05567
0.09483
0.13307
0.17003
0.20540
0.23891
0.27035
0.29955
0.32639
0.35083
0.37286
0.39251
0.40988
0.42507
0.43822
0.44950
0.45907
0.46712
0.47381
0.47932
0.48382
0.48745
0.49036
0.49266
0.49446
0.49585
0.49693
0.49774
0.49836
0.49882
0.49916
0.49940
0.49958
0.49971
0.49980
0.49986
0.49991
0.49994
0.49996
0.49997

0.01994
0.05962
0.09871
0.13683
0.17364
0.20884
0.24215
0.27337
0.30234
0.32894
0.35314
0.37493
0.39435
0.41149
0.42647
0.43943
0.45053
0.45994
0.46784
0.47441
0.47982
0.48422
0.48778
0.49061
0.49286
0.49461
0.49598
0.49702
0.49781
0.49841
0.49886
0.49918
0.49942
0.49960
0.49972
0.49981
0.49987
0.49991
0.49994
0.49996
0.49997

0.02392
0.06356
0.10257
0.14058
0.17724
0.21226
0.24537
0.27637
0.30511
0.33147
0.35543
0.37698
0.39617
0.41308
0.42785
0.44062
0.45154
0.46080
0.46856
0.47500
0.48030
0.48461
0.48809
0.49086
0.49305
0.49477
0.49609
0.49711
0.49788
0.49846
0.49889
0.49921
0.49944
0.49961
0.49973
0.49981
0.49987
0.49992
0.49994
0.49996
0.49998

0.02790
0.06749
0.10642
0.14431
0.18082
0.21566
0.24857
0.27935
0.30785
0.33398
0.35769
0.37900
0.39796
0.41466
0.42922
0.44179
0.45254
0.46164
0.46926
0.47558
0.48077
0.48500
0.48840
0.49111
0.49324
0.49492
0.49621
0.49720
0.49795
0.49851
0.49893
0.49924
0.49946
0.49962
0.49974
0.49982
0.49988
0.49992
0.49995
0.49996
0.49998

0.03188
0.07142
0.11026
0.14803
0.18439
0.21904
0.25175
0.28230
0.31057
0.33646
0.35993
0.38100
0.39973
0.41621
0.43056
0.44295
0.45352
0.46246
0.46995
0.47615
0.48124
0.48537
0.48870
0.49134
0.49343
0.49506
0.49632
0.49728
0.49801
0.49856
0.49896
0.49926
0.49948
0.49964
0.49975
0.49983
0.49988
0.49992
0.49995
0.49997
0.49998

0.03586
0.07535
0.11409
0.15173
0.18793
0.22240
0.25490
0.28524
0.31327
0.33891
0.36214
0.38298
0.40147
0.41774
0.43189
0.44408
0.45449
0.46327
0.47062
0.47670
0.48169
0.48574
0.48899
0.49158
0.49361
0.49520
0.49643
0.49736
0.49807
0.49861
0.49900
0.49929
0.49950
0.49965
0.49976
0.49983
0.49989
0.49992
0.49995
0.49997
0.49998

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

Mean = , SD =

TRANSFORM
Z

x-

Mean = 0, SD = 1

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

Mean = 40, SD = 10

Z(40)

x - = 40 - 40 = 0

10

TRANSFORM

Mean = 0, SD = 1

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

30
Mean = 40, SD = 10

Z(40)

x - = 30 - 40 = -1

10

TRANSFORM

-1
Mean = 0, SD = 1

Standardized Normal Distribution: summary

For any normal distribution, we can

Transform the values to the standardized normal distribution (Z)

Use the Z table to get the following areas

Above a certain level

Below a certain level

Between different levels

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

30
Mean 1SD

68%

50

Mean + 1SD

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

20
Mean 2SD

95%

60
Mean + 2SD

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

10
Mean 3SD

99.7%

70
Mean + 3SD

Practical example

Practical example

The 68-95-99.7 Rule for the Normal Distribution

68% of the observations fall within one standard deviation of the


mean

95% of the observations fall within two standard deviations of the


mean

99.7% of the observations fall within three standard deviations of


the mean

When applied to real data, these estimates are considered


approximate!

Distributions of Blood Pressure


.4

.3

68%
.2

Mean = 125 mmHG


s = 14 mmHG

95%
99.7%

.1

0
83

97

111

125

139

153

167

The 68-95-99.7 rule applied to the distribution


of systolic blood pressure in men.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main


features of a sample

Inferential statistics: is the process of using the sample statistic to


make informed guesses about the value of a population parameter

Why do we carry out research?

population

sample

Inference: Drawing
conclusions on certain
questions about a
population from sample data

Inferential statistics

Since we are not taking the whole population, we have to draw


conclusions on the population based on results we get from the
sample

Simple example: Say we want to estimate the average systolic


blood pressure for patients admitted to the emergency department
after having an MI

Other more complicated measures might be quality of life,


satisfaction with care, risk of outcome, etc.

Inferential statistics

What do we do?

Take a sample (n=291) of patients admitted to emergency


department in a certain hospital

Calculate the mean and SD (descriptive statistics) of systolic blood


pressure
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Std. Deviation

286
5
144.13
35.312

Inferential statistics

The next step is to make a link between the estimates we observed


from the sample and those of the underlying population (inferential
statistics)

What can we say about these estimates as compared to the


unknown true ones???

In other words, we trying to estimate the average systolic blood


pressure for ALL patients admitted to the emergency department
after an MI

Inferential statistics

Sample data

N=291

Mean=144
SD=35

Inference

In statistical inference we usually encounter TWO issues

Estimate value of the population parameter. This is done through


point estimate and interval estimate (Confidence Interval)

Evaluate a hypothesis about a population parameter rather than


simply estimating it. This is done through tests of significance
known as hypothesis testing (P-value)

1-

Confidence Interval

Confidence Intervals

A point estimate:
A single numerical value used to estimate a population parameter.

Interval estimate:
Consists of 2 numerical values defining a range of values that with
a specified degree of confidence includes the parameter being
estimated.
(Usually interval estimate with a degree of 95% confidence is
used)

Example

What is the average systolic blood pressure for patients admitted


to emergency departments after an MI?

Select a sample

Point estimate

Interval estimate = 95% CI = (140 148)

95% Confidence Interval:


- Upper limit =
- Lower limit =

= mean

= 144

35
144 1.95

291

x + z (1-/2) SE
x + z (1-/2) SE
x z (1-/2) SE

Sampling distribution of mean

N = 291

- 2SE

95%

+ 2SE

Standard error
Standard error

= sd / n

As sample size increases the standard error decreases

The estimation as measured by the confidence interval will be


better, ie narrower confidence interval

Interpretation
95% Confidence Interval

There is 95% probability that the true parameter is within the

calculated interval
Thus, if we repeat the sampling procedure 100 times, the above

statement will be:

correct in 95 times (the true parameter is within the interval)

wrong in 5 times (the true parameter is outside the interval) (also called
error)

Notes on Confidence Intervals

Interpretation

It provides the level of confidence of the value for the population


average systolic blood pressure

Are all CIs 95%?

No

It is the most commonly used

A 99% CI is wider

A 90% CI is narrower

Notes on Confidence Intervals

To be more confident you need a bigger interval

For a 99% CI, you need 2.6 SEM

For a 95% CI, you need 2 SEM

For a 90% CI, you need 1.65 SEM

2-

P-value

Inference
P-value

Is related to another type of inference

Hypothesis testing

Evaluate a hypothesis about a population parameter rather than


simply estimating it

Hypothesis testing

Back to our previous example

We want to make inference about the average systolic blood


pressure of patients admitted to emergency department after MI

Assume that the normal systolic blood pressure is 120

The question is whether the average systolic blood pressure for


patients admitted to emergency departments is different than the
normal, which is 120

Hypothesis testing

Two types of hypotheses:

Null hypothesis: is a statement consistent with no difference

Alternative hypothesis: is a statement that disagrees with the null


hypothesis, and is consistent with presence of difference

The logic of hypothesis testing

To decide which of the hypothesis is true

Take a sample from the population

If the data are consistent with the null hypothesis, then we do not
reject the null hypothesis (conclusion = no difference)

If the sample data are not consistent with the null hypothesis, then
we reject the null (conclusion = difference)

Hypothesis testing

Example: is the systolic blood pressure for patients admitted to


emergency department after an MI normal (ie =120)?
-

Ho: = 120

Ha: 120

How do we answer this question?

We take a sample and find that the mean is 144 years

Can we consider that the 144 is consistent with the normal value
(120 years)?

Hypothesis testing

N = 291
mean
144

Ho: = 120

It looks like it is consistent with the null hypothesis


Is it still consistent with the null hypothesis?

mean
144

Hypothesis testing

N = 291
mean

mean

2.5%

2.5%
95%
- 2SE

Ho: = 120

+ 2SE

Test statistic

It is the statistic used for deciding whether the null hypothesis


should be rejected or not

Used to calculate the probability of getting the observed results if


the null hypothesis is true.

This probability is called the p-value.

How to decide

We calculate the probability of obtaining a sample with mean of


144 if the true mean is 120 due to chance alone (p-value)

Based on p-value we make our decision:

If the p-value is low then this is taken as evidence that it is unlikely


that the null hypothesis is true, then we reject the null hypothesis (we
accept alternative one)

If the p-value is high, it indicates that most probably the null


hypothesis is true, and thus we do not reject the Ho

Problem!

We could be making the wrong decisions


Decision
Do not reject Ho
Reject Ho

Ho True

Ho False

Correct decision

Type II error

Type I error

Correct decision

Type I error: is rejecting the null hypothesis when it is true

Type II error: is not rejecting the null hypothesis when it is false

Error

Type I error:

Referred to as

Probability of rejecting a true null hypothesis

Type II error:

Referred to as

Probability of accepting a false null hypothesis

Power:

Represented by 1-

Probability of correctly rejecting a false null hypothesis

Significance level

The significance level, , of a hypothesis test is defined as the


probability of making a type I error, that is the probability of
rejecting a true null hypothesis
It could be set to any value, as:

0.05

0.01

0.1

Statistical significance

If the p-value is less then some pre-determined cutoff (e.g. .05),


the result is called statistically significant

This cutoff is the -level

The -level is the probability of a type I error

It is the probability of falsely rejecting H0

Back to the example

To test whether the average systolic blood pressure for patients


admitted to the emergency department after an MI is different
than 120 (which is the normal blood pressure)

We carry out a test called one sample t-test which provides a pvalue based on which we accept or reject the null hypothesis.

Back to the example


One-Sample Statistics
N
Systolic blood pressure

286

Mean
144.13

Std. Deviation
35.312

Std. Error
Mean
2.088

One-Sample Test
Test Value = 120

Systolic blood pressure

t
11.558

df
285

Sig. (2-tailed)
.000

Mean
Difference
24.133

95% Confidence
Interval of the
Difference
Lower
Upper
20.02
28.24

Since p-value is less than 0.05, then the conclusion will be that the
systolic blood pressure for patients admitted to emergency
department after an MI is significantly higher than the normal
value which is 120

p-values

p-values are probabilities (numbers between 0 and 1)

Small p-values mean that the sample results are unlikely when the
null is true

The p-value is the probability of obtaining a result as/or more


extreme than you did by chance alone assuming the null
hypothesis H0 is true

t-distribution

The t-distribution looks like a standard normal curve

A t-distribution is determined by its degrees of freedom (n-1), the


lower the degrees of freedom, the flatter and fatter it is
Normal (0,1)
t35

t15

75%

80%

85%

90%

95%

97.5%

99%

99.5%

99.75%

99.9%

99.95%

1.000

1.376

1.963

3.078

6.314

12.71

31.82

63.66

127.3

318.3

636.6

0.816

1.061

1.386

1.886

2.920

4.303

6.965

9.925

14.09

22.33

31.60

0.765

0.978

1.250

1.638

2.353

3.182

4.541

5.841

7.453

10.21

12.92

0.741

0.941

1.190

1.533

2.132

2.776

3.747

4.604

5.598

7.173

8.610

0.727

0.920

1.156

1.476

2.015

2.571

3.365

4.032

4.773

5.893

6.869

0.718

0.906

1.134

1.440

1.943

2.447

3.143

3.707

4.317

5.208

5.959

0.711

0.896

1.119

1.415

1.895

2.365

2.998

3.499

4.029

4.785

5.408

0.706

0.889

1.108

1.397

1.860

2.306

2.896

3.355

3.833

4.501

5.041

0.703

0.883

1.100

1.383

1.833

2.262

2.821

3.250

3.690

4.297

4.781

10

0.700

0.879

1.093

1.372

1.812

2.228

2.764

3.169

3.581

4.144

4.587

11

0.697

0.876

1.088

1.363

1.796

2.201

2.718

3.106

3.497

4.025

4.437

12

0.695

0.873

1.083

1.356

1.782

2.179

2.681

3.055

3.428

3.930

4.318

13

0.694

0.870

1.079

1.350

1.771

2.160

2.650

3.012

3.372

3.852

4.221

14

0.692

0.868

1.076

1.345

1.761

2.145

2.624

2.977

3.326

3.787

4.140

15

0.691

0.866

1.074

1.341

1.753

2.131

2.602

2.947

3.286

3.733

4.073

16

0.690

0.865

1.071

1.337

1.746

2.120

2.583

2.921

3.252

3.686

4.015

17

0.689

0.863

1.069

1.333

1.740

2.110

2.567

2.898

3.222

3.646

3.965

18

0.688

0.862

1.067

1.330

1.734

2.101

2.552

2.878

3.197

3.610

3.922

19

0.688

0.861

1.066

1.328

1.729

2.093

2.539

2.861

3.174

3.579

3.883

20

0.687

0.860

1.064

1.325

1.725

2.086

2.528

2.845

3.153

3.552

3.850

100

0.677

0.845

1.042

1.290

1.660

1.984

2.364

2.626

2.871

3.174

3.390

120

0.677

0.845

1.041

1.289

1.658

1.980

2.358

2.617

2.860

3.160

3.373

0.674

0.842

1.036

1.282

1.645

1.960

2.326

2.576

2.807

3.090

3.291

Hypothesis Testing

Different types of hypothesis:

Mean (a) = Mean (b)

Proportion (a) = Proportion (b)

Variance (a) = Variance (b)

OR = 1

RR = 1

RD = 0

Test of homogeneity

Etc..

Example

Comparing two means: paired testing

In the previous example, is the heart rate at admission different


than the heart rate at discharge among the patients admitted to the
emergency department after an MI?
Statistics

N
Mean
Std. Deviation

Valid
Missing

Heart Rate at
admission
286
5
82.64
22.598

Heart Rate at
discharge
77
214
76.99
17.900

Is this decrease in heart rate statistically significant?

Thus, we have to make inference.

Comparing two means: paired testing

What type of test to be used?

Since the measurements of the heart rate at admission and at


discharge are dependent on each other (not independent), another
type of test is used

Paired t-test

Comparing two means: paired testing


Paired Samples Statistics

Pair
1

Mean
81.16
76.72

Heart Rate at admission


Heart Rate at discharge

N
75
75

Std. Deviation
23.546
17.973

Std. Error
Mean
2.719
2.075

Paired Samples Test


Paired Differences

Mean
Pair
1

Heart Rate at admission Heart Rate at discharge

4.440

Std. Deviation

Std. Error
Mean

25.302

2.922

95% Confidence
Interval of the
Difference
Lower
Upper
-1.381

10.261

t
1.520

df

Sig. (2-tailed)
74

.133

95%CI = 4.4 1.95 2.9


H0: b - a = 0
HA: b - a 0

P-value = 0.133, thus no significant difference

How Are p-values Calculated?

sample mean 0
t=
SEM
4 .4
t =
= 1 . 52
2 .9
The value t = 1.52 is called the test statistic
Then we can compare the t-value in the table and get the
p-value, or get it from the computer (0.13)

Interpreting the p-value

The p-value in the example is 0.133

Interpretation: If there is no difference in heart rate between


admission and discharge to an emergency department, then the
chance of finding a mean difference as extreme/more extreme as 4.4
in a sample of 291 patients is 0.133

Thus, this probability is big (bigger than 0.05) which leads to saying
that the difference of 4.4 is due to chance

Notes

How to decide on significance from the 95% CI?

3 scenarios
-15

-10

-5

10

15

-15

-10

-5

10

15

-15

-10

-5

10

15

Comparing two means: Independent sample testing

In the previous example, is the systolic blood pressure different


between males and females among the patients admitted to the
emergency department after an MI?
Group Statistics

Systolic blood pressure

Sex
Male
Female

N
240
44

Mean
145.05
138.64

Std. Deviation
35.162
35.753

Std. Error
Mean
2.270
5.390

Is this difference in systolic blood pressure statistically significant?

Thus, we have to make inference.

Comparing two means: Independent sample testing

Null hypothesis:

Ho: Mean SBP(Males) = Mean SBP (Females)

Ho: Mean SBP (Males) - Mean SBP (Females) = 0

Alternative hypothesis:

Ha: Mean SBP(Males) Mean SBP (Females)

Ha: Mean SBP(Males) - Mean SBP (Females) 0

Comparing two means: Independent sample testing

Thus, we carry out a test called: independent samples t-test

Formula to use is:

Comparing two means: Independent sample testing

What we need to know is that we can calculate a p-value out of the


t-test (based on the t-distribution)

Based on this p-value, make the decision:

P-value > 0.05, then do no reject the null (the two means are equal)

P-value < 0.05, then reject the null (the two means are different)

Comparing two means: Independent sample testing


Group Statistics

Systolic blood pressure

Sex
Male
Female

Mean
145.05
138.64

240
44

Std. Deviation
35.162
35.753

Std. Error
Mean
2.270
5.390

Independent Samples Test


Levene's Test for
Equality of Variances

F
Systolic blood pressure

Equal variances
assumed
Equal variances
not assumed

.044

Sig.
.835

Two formulas for calculation of t-test


1- when variances are equal
2- when variances are not equal

t-test for Equality of Means

df

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

95% Confidence
Interval of the
Difference
Lower
Upper

1.109

282

.269

6.409

5.781

-4.970

17.789

1.096

59.267

.278

6.409

5.848

-5.292

18.111

To know which one to use (hypothesis test)


Ho: variancemales = variancefemales
Ha: variancemales = variancefemales

1- If p-value > 0.05 then variances are equal


2- If p-value < 0.05 then variances are not equal

Example

T-test

Ho: Mean1 = Mean2

T-test: P-value = 0.89

Ha: Mean1 Mean2

No significant difference

Chi square

Example

In the MI example, we would like to check if hypertension is


associated with gender.

In other words, are males at higher or lower risk of having


hypertension?
Sex * Hypertension Crosstabulation
Count

Sex
Total

Male
Female

Hypertension
No
Yes
191
52
24
20
215
72

Total
243
44
287

Example

Sex * Hypertension Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypertension
No
Yes
191
52
78.6%
21.4%
24
20
54.5%
45.5%
215
72
74.9%
25.1%

Total
243
100.0%
44
100.0%
287
100.0%

Example

To answer the question, we do a hypothesis test:

H0: P1 = P2

(P1 - P2 = 0)

Ha: P1 P2

(P1 - P2 0)

(Pearsons) Chi-Square Test (2)

Calculation is easy (can be done by hand)

Works well for big sample sizes

Can be extended to compare proportions between more than two


independent groups in one test

The Chi-Square Approximate Method

(0 - E)
=
E
4 cells

Looks at discrepancies between observed and expected cell counts


Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
O = observed

E = expected =

row total column total


grand total

The Chi-Square Approximate Method

The distribution of this statistic when the null is a chi-square


distribution with one degree of freedom

We can use this to determine how likely it was to get such a big
discrepancy between the observed and expected by chance alone

Probability
.4

.6

.8

Distribution: Chi-Square with One Degree of Freedom

.2

2 = 3.84 p = 0.05

10
Chi-squared Value

15

20

Example of Calculations of
Chi-Square 2x2 Contingency Table

Test statistic

(0 - E)
=
E
4 cells

Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases

Value
11.471b
10.227
10.366
11.431

df
1
1
1
1

Asymp. Sig.
(2-sided)
.001
.001
.001

Exact Sig.
(2-sided)

Exact Sig.
(1-sided)

.001

.001

.001

287

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.
04.

= 11.471
2

.2

Probability
.4

.6

.8

Sampling Distribution: Chi-Square with One Degree of


Freedom

10
Chi-squared Value

15

20

Example

The value that corresponds to 95%, or 5% error is 5.991.

Thus we reject Ho since 11.471 is < 5.991

We conclude that Ho is false and that there is a relationship


between gender and diagnosis with hypertension

The p-value is = 0.001

Chi-square

Ho: Proportion1 = Proportion2


Ha: Proportion1 Proportion2

ChiSquare: P-value = 0.96


No significant difference

Relative Risk (RR):

Study the association between Vioxx use and Myocardial Infarction

MI
Yes

No

Vioxx

71

52

Placebo

29

48

Drug

Ho: RR = 1

RR=1.5, 95% CI = (1.1 - 1.9) (p-value = 0.01)

Ha: RR 1
Significant association

Notes

How to decide on significance from the 95% CI?

3 scenarios
0

Example

Example

Chi-square

Ho: Proportion1 = Proportion2


Ha: Proportion1 Proportion2

ChiSquare: P-value = 0.96


No significant difference

Example

We would like to check if there is an association between gender


and both Hypertension and diabetes combined.
Sex * Hypterension and Diabetes combined Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypterension and Diabetes


combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%

Total
240
100.0%
44
100.0%
284
100.0%

Ho: HPV status and stage of HIV infection are independent.


Ha: the two variables are not independent.
Ho: P1 = P2 = P3
Ha: P1 P2 P3

Example

Conclusion
Sex * Hypterension and Diabetes combined Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypterension and Diabetes


combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%

Chi-Square Tests

Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases

Value
28.691a
24.336
25.341

2
2

Asymp. Sig.
(2-sided)
.000
.000

.000

df

284

a. 0 cells (.0%) have expected count less than 5. The


minimum expected count is 7.28.

Total
240
100.0%
44
100.0%
284
100.0%

Example

The value that corresponds to 95%, or 5% error is 5.991.

Thus we reject Ho since 28.691 is < 5.991

We conclude that Ho is false and that there is a relationship


between gender and diagnosis with hypertension and/or diabetes

The p-value is < 0.0001

ANOVA

The problem

We have samples from a number of independent groups.

We have a single numerical or ordinal variable and are interested


in whether the values of the variable vary between the groups.

Example: Is systolic blood pressure vary between men of


different smoking status.

The problem

One-way ANOVA can answer the question be comparing the


group means.

So the null and alternative hypotheses are:


H0: all group means in the population are equal
HA : at least two of the means are not equal

ANOVA is an extension of 2 independent groups.

But 2 groups technique can not be used.

The problem
- If 5 groups is available then 10 t-test of 2 groups to perform.
- The high Type I error rate, resulting from the large number of
comparisons, means that we may draw incorrect conclusions.

Assumptions
Analysis of variance requires the following assumptions:

Independent random samples have been taken from each


population.

The populations are normal.

The population variances are all equal.

The ANOVA Table

ANOVA table summaries the calculation needed to test the main


hypothesis.

Sources

df

SS

MS

Factor

k 1

SS(factor)

MS(factor)=

Error

n k

SS(error)

MS(error)=

SS ( factor)
k 1

MS ( factor)
MS (error )

SS (error )
n k

___________________________________________________________
Total
n 1 SS(total)

Rationale

One-way ANOVA separate the total variability (SS(total) in the


data into:

Differences between the individuals from the different groups


(between-group variation) SS(factor)

The random variation between the individuals within each group


(within-group variation) SS(error) called also unexplained

Rationale

These components of variation are measured using variances,


hence the name analysis of variance (ANOVA).

Under the null hypothesis that the group means are the same,
SS(factor) will be similar to SS(error).

The test is based on the ratio of these two variances.

If there are differences between-groups, then between-groups


variance will be larger than within-group variance.

Example

A new variable is created which combines diagnosis with


Hypertension and Diabetes together as follows:

Hypterension and Diabetes combined

Valid

Missing
Total

None
Either HT or DM
Both HT and DM
Total
System

Frequency
159
80
47
286
5
291

Percent
54.6
27.5
16.2
98.3
1.7
100.0

Valid Percent
55.6
28.0
16.4
100.0

Cumulative
Percent
55.6
83.6
100.0

Example

We would like to check whether the systolic blood pressure is the


same for the three groups defined by their HT and DM status.

Ho: Mean1 = Mean2 = Mean3

Ha: Mean1 Mean2 Mean3


Hypterension and Diabetes combined

Valid

Missing
Total

None
Either HT or DM
Both HT and DM
Total
System

Frequency
159
80
47
286
5
291

Percent
54.6
27.5
16.2
98.3
1.7
100.0

Valid Percent
55.6
28.0
16.4
100.0

Cumulative
Percent
55.6
83.6
100.0

Example
Descriptives
Systolic blood pressure

N
None
Either HT or DM
Both HT and DM
Total

155
79
47
281

Mean
144.52
142.97
146.55
144.43

Std. Deviation
32.789
39.634
36.360
35.319

Std. Error
2.634
4.459
5.304
2.107

95% Confidence Interval for


Mean
Lower Bound Upper Bound
139.32
149.73
134.10
151.85
135.88
157.23
140.28
148.57

Minimum
78
56
55
55

Maximum
248
257
235
257

ANOVA
Systolic blood pressure

Between Groups
Within Groups
Total

Sum of
Squares
380.517
348908.2
349288.8

df
2
278
280

Mean Square
190.259
1255.066

F
.152

Sig.
.859

We conclude that the average systolic blood pressures for the


three groups are the same.

Conclusion

We conclude that the average systolic blood pressure for the three
groups is the same.

Bivariate analyses
DEPENDENT
(outcome)

INDEPENDENT
(exposure)

2 LEVELS

> 2 LEVELS

CONTINUOUS

2 LEVELS

X2
(chi square test)

X2
(chi square test)

t-test

> 2 LEVELS

X2
(chi square test)

X2
(chi square test)

ANOVA

t-test

-Correlation
-Linear
Regression

CONTINUOUS

ANOVA

New scenario

If the dependent and independent variables are continuous, then


we cant use the t-test, and we cannot use the chi squared.

Regression and Correlation

Describing association between two continuous variables

Scatterplot

Correlation coefficient

Simple linear regression

Correlation

Correlation

It is a measure of linear correlation

Called Pearson correlation coefficient (r)

Ranges between:

+1.0 (perfect positive correlation)

-1.0 (perfect negative correlation)

Scatter plot and correlation

The Correlation Coefficient (r)

Measures the direction and strength of the linear association


between x and y

The correlation coefficient is between -1 and +1

r>0

Positive association

r<0

Negative association

r=0

No association

r = 0.01
Y

r = 0.68

r = 0.98

10

12

r = -0.9

Correlation in the Plasma Example

Y, plasma volume (liters)

3.5

r = .76
3

2.5
55

60

65

X, body weight (kg)

70

75

Correlation

Study the association between Heart Rate and Systolic Blood


Pressure
Ho: Correlation = 0

Scatter plot for assocation between HR and SBP

Ha: Correlation 0
A
A

250

BP systolic

200

150

100

50

A
A
A
A
A
A
A
AA
A
A A
AAA
A A
A
A
A
A
AA A A
AA
A
A AA
A
A A
A
A
A A AA
AA
A
A A
A
AA
A
A
A A A
A
A
AA
A
A
A
A
A
A A A
AAA
A
A
AA AA
A
AA
A
A AAA
A
A
A
A
A
A
A
A
A A
A
AA A
A
AA
A
A
A
A
AA A
A
AA A
A
A A
AA
AA
A
A
A
A
A
A
A
A
A
A AA
A A
A
A
A
A A A A
AA A A
A
A AA A A
AA A
A
A
A
A
A
AA
AAAA A
A
A
A
A
A AA
AA
A
A
A
A
A
A
A
A A AAAAA
A
AA A A
AA
A
AA
A
AA
AA
AA A
A
A
A
AA
A
AA A A
AA
AA
A
A AA AAA
A
A
A A
AA
A A A AA
A
AA
A AAAA
A
A
A
A
A A A
A
A
A
A
A A
A
A
A
A
A

40

80

Correlation: = 0.190
P-value = 0.001

Systolic blood pressure

Heart Rate at admission


A

Heart Rate

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Systolic blood
pressure
1

Heart Rate at
admission
.190**
.001
286
285
.190**
1
.001
285
286

**. Correlation is significant at the 0.01 level (2-tailed).

120

Correlations

160

Significant correlation

Problem

Important to note that


correlation measures
strength of linear
association
There could be a strong
non-linear relationship
between y and x, and r
may not catch it

r= 0

Correlation Coefficient

Outliers can really affect correlation coefficient


One extreme point can change r sizably

r = .7

Simple linear regression

Simple linear regression

Used to quantify the association between two variables

It is simple in terms of

having only 1 variable

The association is thought to be linear in nature

Formula: Dependent = 0 + 1 (independent)

The Equation of a Line

0 and 1 are called regression coefficients

These two quantities are estimated by the least squares method

The intercept 0 is the estimated expected value of y when x is 0

The slope 1 is the estimated expected change in y corresponding


to a unit increase in x

The Slope

The slope 1 is the expected change in y corresponding to a unit


increase in x

1 = 0

No association between y and x

1 > 0

Positive association (as x increases y tends to increase)

1 < 0

Negative association (as x increases y tends to decrease)

The Equation of a Line


y
y

y = b0 + b1 x

b1
b0

The Slope
y
1 > 0

1 = 0

1 < 0
0

Simple linear regression

Systolic blood pressure and age


Model Summary
Model
1

R
.054a

R Square
.003

a. Predictors: (Constant), Age

Correlation:
R = 0.054

Adjusted
R Square
-.001

Std. Error of
the Estimate
35.387

Simple linear regression

Simple linear regression

Coefficientsa

Model
1

(Constant)
Age

Unstandardized
Coefficients
B
Std. Error
136.400
8.812
.148
.162

Standardized
Coefficients
Beta
.054

t
15.479
.910

Sig.
.000
.364

a. Dependent Variable: Systolic blood pressure

Simple linear regression:


SBP = 136.400 + 0.148 (Age)
If age = 0, then SBP = 136.400 + 0 = 136.400
As age increase by 1 year, SBP increases by 0.148 units

Simple Linear Regression

How do we decide if there is significant association between age


and SBP?

Hypothesis test
Ho: 1 = 0
Ha: 1 0

SBP = 0 + 1 (Age)

If reject Ho, then as age changes, SBP changes significantly

If Ho is not rejected, then if as changes, there is no effect on SBP

Multiple Linear Regression

The important aspect of linear regression is that we can include


more than 1 independent variable

This is to control for the effect of another variable

Study the association between Age and SBP while controlling for
gender

SBP = 0 + 1 (Age) + 2 (Gender)

Multiple Linear Regression


Coefficientsa

Model
1

(Constant)
Age
Sex

Unstandardized
Coefficients
B
Std. Error
143.090
9.742
.216
.171
-8.992
6.123

Standardized
Coefficients
Beta
.080
-.093

t
14.688
1.261
-1.469

Sig.
.000
.208
.143

a. Dependent Variable: Systolic blood pressure

Multiple linear regression:


SBP = 143.090 + 0.216 (Age) + -8.992 (Gender)
As age increase by 1 year, SBP increases by 0.216 units
after adjusting for gender
Difference in SBP between males and females is 8.992 units

Choosing the right statistical test

Choosing a statistical test

Choosing the right statistical test depends on:

Nature of the data

Sample characteristics

Inferences to be made

Choosing a statistical test

A consideration of the nature of data includes:

Number of variables

not for entire study, but for the specific question at hand

Type of data

numerical, continuous

dichotomous, categorical information

Choosing a statistical test

A consideration of the sample characteristics includes:

Number of groups

Sample type

normal distribution (parametric) or not (non-parametric)

independent or dependent

Choosing a statistical test

A consideration of the inferences to be made includes:

Data represent the population

The group means are different

There is a relationship between variables

Choosing a statistical test

Before choosing a statistical test, ask:

How many variables?

How many groups?

Is the distribution of data normal?

Are the samples (groups) independent?

What is your hypothesis or research question?

Is the data continuous, ordinal, or categorical?

Descriptive analyses

Type of variable

Measure

Categorical

Proportion (%)

Continuous
(Normal)

Mean (SD)

Continuous
(Not Normal)

Median
Inter-quartile range
-

Different types of statistics

Parametric vs non-parametric analyses

Parametric:

Assume data follows a specific probability distribution

More powerful

Non-parametric:

Also called distribution free

No assumptions required for data

But are robust

Univariate analyses

Type of variable

Measure

Categorical

Z proportions

Continuous
(Normal)

T-test

Continuous
(Not Normal)

n > 30 t-test
n < 30 Kolmogorov-Smirnov Test
-

Bivariate analyses

Type of
variable

2 levels

> 2 levels

Continuous

2 levels

Chi squared

Chi squared

T-test

> 2 levels

Chi squared

Chi squared

Anova

Continuous

T-test

Anova

Correlation
linear regression
-

Bivariate analyses

Type of
variable
2 levels

2 levels
Fishers test
McNemars test
-

> 2 levels

Continuous

> 2 levels

Fishers test

Fishers test

Mann-Whitney
- Wilcoxin test

Fishers test

Kruskal-Wallis
- Friedman test

Continuous
Mann-Whitney
- Wilcoxin test

Kruskal-Wallis
- Friedman test

Correlation
Regression

Multivariate analyses

Type of variable

Measure

Categorical

Logistic regression

Continuous
(Normal)

Multinomial regression

Continuous
(Not Normal)

Linear regression

Overview
Measurement
(Gaussian)

Ordinal or
Measurement (NonGaussian)

Binomial

Survival Time

Describe one group

Mean, SD

Median, interquartile
range

Proportion

Kaplan Meier survival


curve

Compare two unpaired


groups

Unpaired t test

Mann-Whitney test

Fisher's test
Chi-square

Log-rank test or
Mantel-Haenszel*

Compare two paired groups

Paired t test

Wilcoxon test

McNemar's test

Conditional
proportional hazards
regression*

Compare three or more


unmatched groups

One-way ANOVA

Kruskal-Wallis test

Chi-square test

Cox regression

Compare three or more


matched groups

Repeated-measures
ANOVA

Friedman test

Cochrane Q**

Conditional
proportional hazards
regression*

Quantify association between


two variables

Pearson correlation

Spearman correlation

Contingency
coefficients**

Predict value from another


measured variable

Simple linear
regression

Nonparametric
regression**

Simple logistic
regression*

Cox regression

Predict value from several


measured or binomial
variables

Multiple linear
regression*

Multiple logistic
regression*

Cox regression

Sample size calculation

Sample size and power calculation

Important step in designing a study

If it is not done, then sample size might be high or low:

If it is low: lack precision to provide reliable answers

If it is high: resources will be wasted for minimal gain

Sample size and power calculation

This step addresses two questions:

How precise will my parameter estimates tend to be if I select a


particular sample size?

How big a sample do I need to attain a desirable level of precision?

Sample size and power calculation: example

A cross-sectional survey of the prevalence of diabetes (diagnosed


or undiagnosed) among native Americans would require a sample
size of 1421 to allow estimation of the prevalence within a
precision of 0.02 with 90% confidence, assuming a true
prevalence no larger than 30%.

Sample size and power calculation

Should be done at the DESIGN stage, ie before data is collected

Drives the whole study

To determine the sample size:

Objectives should be clearly defined

Main exposure and outcome should be specified

Analyses plan should be clarified

Sample size and power calculation

Different equations are used:

Depends on:

Study design

Objectives (prevalence, risk, etc.)

Types of variables

Following is an example of sample size calculation for comparing


the means in two groups

Sample size and power calculation: example

A randomized clinical trial of a new drug treatment vs. placebo


for decreasing blood pressure would require 126 patients for a
two-sided test at = 0.05 to provide 80% power to detect a 5%
difference in blood pressure.

Sample size calculation: comparing two means

2 *SD * (z + z )
2

N = the number of subjects in each group

= level of significance (error)

1 - = power

Difference = Minimal significant difference

Sample size calculation: comparing two means

N = the number of subjects in each group

N = more power or less

N = less power or more

Sample size calculation: comparing two means

= level of significance (error)

= more power or smaller N

= less power or larger N

Sample size calculation: comparing two means

1 - = power

1 - = less or larger N

1 - = more or smaller N

Sample size calculation: comparing two means

Difference = Minimal significant difference

Difference = larger power or smaller N

Difference = smaller power or larger N

Sample size calculation: comparing two means

N = to be found

= level of significance (error) = 0.05 or 5%

1 - = power = 0.80 or 80%

Difference = Minimal significant difference

Thank you

You might also like