You are on page 1of 75

CHI-SQUARE TEST

by Dr. M.Supriya
Moderator:Dr.B.Aruna,M.D.(H).

Page 1

PLAN OF STUDY
NEED FOR STUDY
INTRODUCTION
APPLICATION
REQUIREMENTS
CHI-SQUARE DISTRIBUTION
CHI-SQUARE TEST
EXAMPLES
CONCLUSION
Page 2

INTRODUCTION
It is a non-parametric test.
It is useful for assessment of the
association between the discrete
data.

Page 3

DEVELOPED BY
KARL
PEARSON

Page 4

APPLICATIONS
Proportion
Association
Goodness of fit

Page 5

REASONS FOR CALLING IT A


DISTRIBUTION FREE STATISTICS
Rigid assumptions are not necessary in
regard to the type of population distribution.
Calculation of mean and SD are not needed. It
is based only on Df.
It is simple to understand.
Used to simple ranking of values also.
Used where data is not exact.
Used with small samples not more than 50.

Page 6

Chi-Square Test Requirements


Quantitative data.
One or more categories.
Independent observations.
Adequate sample size (at least 10).
Simple random sample.
Data in frequency form.
All observations must be used
Page 7

Chi-Square Distribution

Page 8

The distribution of the


chi-square statistic is
called the chi-square
distribution.

Page 9

The Chi-Square Statistic


select a random sample of sizenfrom
a normal population, having a standard
deviation equal to . We find that the
standard deviation in our sample is
equal tos. Given these data, we can
define astatistic, calledchi-square,
using the following equation:
2= [ ( n - 1 ) * s2] / 2
Page 10

Properties of chi-square
distribution :
The mean of the distribution
is equal to the number of
degrees of freedom: =v.
The variance is equal to two
times the number of degrees
of freedom: 2= 2 *v
When the degrees of freedom
are greater than or equal to
2, the maximum value for Y
occurs when 2=v- 2.
As the degrees of freedom
increase, the chi-square
curve approaches a normal
distribution.

Page 11

PROBLEM
The Acme Battery Company has developed a
new cell phone battery. On average, the
battery lasts 60 minutes on a single
charge. The standard deviation is 4
minutes. Suppose the manufacturing
department runs a quality control test.
They randomly select 7 batteries. The
standard deviation of the selected batteries
is 6 minutes. What would be the chi-square
statistic represented by this test?
Page 12

SOLUTION
The standard deviation of the population is 4 minutes.
The standard deviation of the sample is 6 minutes.
The number of sample observations is 7.
To compute the chi-square statistic,
where 2is the chi-square statistic,nis the sample
size,sis the standard deviation of the sample, and
is the standard deviation of the population.
2= [ ( n - 1 ) * s2] / 2
2= [ ( 7 - 1 ) * 62] / 42=
13.5
Page 13

CALCULATION OF CHISQUARE:
consists of four steps:
(1)state the hypotheses
(2) formulate an analysis plan
(3) analyze sample data
(4) interpret results.

Page 14

State the Hypotheses


Every hypothesis test requires the analyst to state anull
hypothesisand analternative hypothesis. The
hypotheses are stated in such a way that they are
mutually exclusive. That is, if one is true, the other must
be false; and vice versa.
For a chi-square , the hypotheses take the following form.
H0: The data are consistent with a specified distribution.
Ha: The data arenotconsistent with a specified
distribution.
Typically, the null hypothesis specifies the proportion of
observations at each level of the categorical variable.
The alternative hypothesis is thatat leastone of the
specified proportions is not true.
Page 15

"Thenull hypothesisin a chi-square


goodness-of-fit test states that the
sample of observed
frequenciessupports the claimabout
the expected frequencies.
Thealternative hypothesisstates that
there isno support for the
claimpertaining to the expected
frequencies."
This deviates from our normal approach
to place our expected (preferred)
outcome in the alternative hypothesis.
Just be aware of this.
Page 16

Formulate an Analysis Plan

The analysis plan describes


how to use sample data to
accept or reject the null
hypothesis.

Page 17

COMPONENTS
Significance level
Often, researchers
choosesignificance levelsequal to 0.01,
0.05, or 0.10; but any value between 0
and 1 can be used.
Test method.
Use to determine whether observed
sample frequencies differ significantly
from expected frequencies specified in
the null hypothesis.

Page 18

Analyze Sample Data

Page 19

DEGREES OF FREEDOM
Degrees of freedom. Thedegrees of
freedom(DF) is equal to the number of
levels (k) of the categorical variable minus 1:
DF = k - 1 .
Degrees of freedom.Thedegrees of
freedom(DF) is equal to:
DF = (r - 1) * (c - 1)
where r is the number of levels for one
catagorical variable, and c is the number of
levels for the other categorical variable.
Page 20

Expected frequency counts


The expected frequency counts at each
level of the categorical variable are equal to
the sample size times the hypothesized
proportion from the null hypothesis
Ei= npi
where Eiis the expected frequency count for
theith level of the categorical variable, n is
the total sample size, and piis the
hypothesized proportion of observations in
leveli.
Page 21

Test statistic
The test statistic is a chi-square
random variable (2) defined by the
following equation.
2= [ (Oi- Ei)2/ Ei]
where Oiis the observed frequency
count for theith level of the categorical
variable, and Eiis the expected
frequency count for theith level of the
categorical variable.
Page 22

P-value
The P-value is the probability of
observing a sample statistic as
extreme as the test statistic. Since
the test statistic is a chi-square, use
theChi-Square Distribution
Calculatorto assess the probability
associated with the test statistic. Use
the degrees of freedom computed
above.
Page 23

probability level (alpha)

PROBABILITY LEVEL
Df

0.5

0.10

0.05

0.02

0.01

0.001

0.455

2.706

3.841

5.412

6.635

10.827

1.386

4.605

5.991

7.824

9.210

13.815

2.366

6.251

7.815

9.837

3.357

7.779

9.488

4.351

9.236

11.345 16.268

11.668 13.277 18.465

11.070 13.388 15.086 20.517

Page 24

Interpret Results

Reject the Hoif


calculated X2
critical X2

Page 25

Types of Data:

Page 26

2 x 2 Contingency
Table
Variable 2

Data type 1

Data type 2

Totals

Category 1

a+b

Category 2

c+d

b+d

a+b+c+d=
N

Total

a+c

Page 27

FORMULAE

Page 28

EXAMPLE
A drug trial on a group of animals and you
hypothesized that the animals receiving the
drug would show increased heart rates
compared to those that did not receive the
drug.
Ho: The proportion of animals whose heart
rate increased is independent of drug
treatment.
Ha: The proportion of animals whose heart
rate increased is associated with drug
treatment.
Page 29

Hypothetical drug trial results.


HeartRate
Increased

NoHeartRat
e
Increase

Treated

36

14

50

Not treated

30

25

55

Total

66

39

105

Total

Page 30

FORMULAE

Page 31

Chi square = 105 [(36)(25) - (14)(30)] 2/ (50)(55)(39)


(66) =
3.418

Page 32

Chi Square-Goodness of Fit


This test allows us to compare a
collection of categorical data with
some theoretical expected
distribution.
This test is often used in genetics to
compare the results of a cross with
the theoretical distribution based on
genetic theory.
Page 33

problem
Acme Toy Company prints baseball cards.
The company claims that 30% of the
cards are rookies, 60% veterans, and
10% are All-Stars. The cards are sold in
packages of 100.
Suppose a randomly-selected package of
cards has 50 rookies, 45 veterans, and 5
All-Stars. Is this consistent with Acme's
claim? Use a 0.05 level of significance.
Page 34

State the hypotheses


Null hypothesis: The proportion of rookies,
veterans, and All-Stars is 30%, 60% and 10%,
respectively.

Alternative hypothesis: At least one of the


proportions in the null hypothesis is false

Page 35

Formulate an analysis plan

the significance level is 0.05

Page 36

Analyze Sample Data

Page 37

DF = k - 1 = 3 - 1 = 2
(Ei) = n * pi
(E1) = 100 * 0.30 = 30
(E2) = 100 * 0.60 = 60
(E3) = 100 * 0.10 = 10

Page 38

2= [ (Oi- Ei)2/ Ei]

2= [ (50 - 30)2/ 30 ] + [ (45 - 60)2/ 60 ] + [ (5 10)2/ 10 ]

2= (400 / 30) + (225 / 60) + (25 /


10) =
13.33 + 3.75 + 2.50
= 19.58

Page 39

probability level (alpha)

PROBABILITY LEVEL
Df

0.5

0.10

0.05

0.02

0.01

0.001

0.455

2.706

3.841

5.412

6.635

10.827

1.386

4.605

5.991

7.824

9.210

13.815

2.366

6.251

7.815

9.837

3.357

7.779

9.488

4.351

9.236

11.345 16.268

11.668 13.277 18.465

11.070 13.388 15.086 20.517

Page 40

Results of a monohybrid cross between two


heterozygotes for the 'a' gene.

Totals

10

42

52

33

15

48

Totals

43

57

100

Page 41

The penotypic ratio 85 of the A type


and 15 of the a-type (homozygous
recessive). In a monohybrid cross
between two heterozygotes,
however, we would have predicted
a 3:1 ratio of phenotypes.
In other words, we would
have expected to get 75 A-type
and 25 a-type. Are or results
different?
Page 42

Observe Expecte
d
d

(O E) (O E)2

(O
E)2/ E

A-type

85

75

10

100

1.33

a-type

15

25

10

100

4.0

Total

100

100

5.33

Page 43

probability level (alpha)

PROBABILITY LEVEL
Df

0.5

0.10

0.05

0.02

0.01

0.001

0.455

2.706

3.841

5.412

6.635

10.827

1.386

4.605

5.991

7.824

9.210

13.815

2.366

6.251

7.815

9.837

3.357

7.779

9.488

4.351

9.236

11.345 16.268

11.668 13.277 18.465

11.070 13.388 15.086 20.517

Page 44

Chi Square Test of


Independence
applied when you have
twocategorical
variablesfrom a single
population.
Page 45

Category I Category II Category III Row Totals

Sample A

a+b+c

Sample B

d+e+f

Sample C

g+h+i

Column
Totals

a+d+g

b+e+h

c+f+i

a+b+c+d+e
+f+g+h+i=N

Page 46

PROBLEM
A public opinion poll surveyed a
simple random sample of 1000
voters. Respondents were
classified by gender (male or
female) and by voting preference
(Republican, Democrats, or
Independent).
Page 47

Voting Preferences
Republican

Indep
Demo enden
crat
t

Row total

Male

200

150

50

400

Female

250

300

50

600

Column total

450

450

100

1000

Page 48

DF = (r - 1) * (c - 1) =
(2 - 1) * (3 - 1) =
2

Page 49

Er,c= (nr* nc) / n


E1,1= (400 * 450) / 1000 =
180000/1000 =
180
E1,2= (400 * 450) / 1000 =
180000/1000 =
180
E1,3= (400 * 100) / 1000 =
40000/1000 =
40
E2,1= (600 * 450) / 1000 =
270000/1000 =
270
E2,2= (600 * 450) / 1000 =
270000/1000 =
270
E2,3= (600 * 100) / 1000 =
60000/1000 =

Page 50

2= [ (Or,c- Er,c)2/ Er,c]


2= (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40
+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/40
2= 400/180 + 900/180 + 100/40 + 400/270 + 900/270
+ 100/60
2= 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 =
16.2
Page 51

probability level (alpha)

PROBABILITY LEVEL
Df

0.5

0.10

0.05

0.02

0.01

0.001

0.455

2.706

3.841

5.412

6.635

10.827

1.386

4.605

5.991

7.824

9.210

13.815

2.366

6.251

7.815

9.837

3.357

7.779

9.488

4.351

9.236

11.345 16.268

11.668 13.277 18.465

11.070 13.388 15.086 20.517

Page 52

South
Totals
America

Asia

Africa

Malaria
A

31

14

45

90

Malaria
B

53

60

Malaria
C

53

45

100

Totals

86

64

100

250

Page 53

Observed Expected

|O -E|

(O E)2/
(O E)2
E

31

30.96

0.04

0.0016

0.0000516

14

23.04

9.04

81.72

3.546

45

36.00

9.00

81.00

2.25

20.64

18.64

347.45

16.83

15.36

10.36

107.33

6.99

53

24.00

29.00

841.00

35.04

53

34.40

18.60

345.96

10.06

45

25.60

19.40

376.36

14.70

40.00

38.00

1444.00

36.10
Page 54

probability level (alpha)

PROBABILITY LEVEL
Df

0.5

0.10

0.05

0.02

0.01

0.001

0.455

2.706

3.841

5.412

6.635

10.827

1.386

4.605

5.991

7.824

9.210

13.815

2.366

6.251

7.815

9.837

3.357

7.779

9.488

4.351

9.236

11.345 16.268

11.668 13.277 18.465

11.070 13.388 15.086 20.517

Page 55

Chi-Square Test of
Homogeneity
applied to a
singlecategorical
variablefrom two different
populations.

Page 56

problem
Viewing Preferences
Row total

Lone
Ranger

Sesame
Street

The
Simpsons

Boys

50

30

20

100

Girls

50

80

70

200

Column
total

100

110

90

300

Page 57

Null hypothesis: The null hypothesis states that


the proportion of boys who prefer the Lone
Ranger is identical to the proportion of girls.
Similarly, for the other programs. Thus,
H0: Pboys who prefer Lone Ranger= Pgirls who prefer Lone Ranger
H0: Pboys who prefer Sesame Street= Pgirls who prefer Sesame Street
H0: Pboys who prefer The Simpsons= Pgirls who prefer The Simpsons

Alternative hypothesis: At least one of the null
hypothesis statements is false.

Page 58

DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c= (nr* nc) / n
E1,1= (100 * 100) / 300 = 10000/300 = 33.3
E1,2= (100 * 110) / 300 = 11000/300 = 36.7
E1,3= (100 * 90) / 300 = 9000/300 = 30.0
E2,1= (200 * 100) / 300 = 20000/300 = 66.7
E2,2= (200 * 110) / 300 = 22000/300 = 73.3
E2,3= (200 * 90) / 300 = 18000/300 = 60.0
2= [ (Or,c- Er,c)2/ Er,c]
2= (50 - 33.3)2/33.3 + (30 - 36.7)2/36.7 + (20 - 30)2/30
+ (50 - 66.7)2/66.7 + (80 - 73.3)2/73.3 + (70 - 60)2/60
2= (16.7)2/33.3 + (-6.7)2/36.7 + (-10.0)2/30 + (-16.7)2/66.7 +
(3.3)2/73.3 + (10)2/60
2= 8.38 + 1.22 + 3.33 + 4.18 + 0.61 + 1.67 =
19.39

Page 59

Heads

Tails

Total

Observed

108

92

200

Expected

100

100

200

Total

208

192

400

Page 60

Chi-squared = (100-108)2/100 + (100-92)2/100 =

(-8)2/100 + (8)2/100 =
0.64 + 0.64 =

1.28

Page 61

(Observe
d
counts)

Colours
Red

Yellow

Green

Blue

Totals

Introvert
personal
ity

20

30

44

100

Extrover
t
personal
ity

180

34

50

36

300

Totals

200

40

80

80

400

Page 62

H0: Colour preference is not associated


with personality, and
H1: Colour preference is associated with
personality

Page 63

(Expecte
d counts)

Colours
Red

Yellow

Green

Blue

Totals

Introvert
personali
ty

50

10

20

20

100

Extrovert
personali
ty

150

30

60

60

300

Totals

200

40

80

80

400

Page 64

The chi-squared test statistic is 71.20

Page 65

School Area
Goals | Rural Suburban Urban Total
-------------------------------------------Grades | 57
87
24
168
Popular | 50
42
6
98
Sports | 42
22
5
69
-------------------------------------------Total | 149
151
35
335

Page 66

Barplots comparing the percentages of students' choices


by school area appear below:

Page 67

H0assumes that there is no


association between the
variables (in other words, one
variable does not vary
according to the other
variable), while the alternative
hypothesisHaclaims that
some association does exist
Page 68

below observed counts


Rural Suburban Urban
Total
1
57
87
24
168
74.72 75.73 17.55
2
50
42
6
98
43.59 44.17 10.24
3
42
22
5
69
30.69 31.10
7.21
Total
149
151
35
335

Page 69

Chi-Sq = 4.203 + 1.679


+ 2.369 +
0.943 + 0.107 +
1.755 +
4.168 + 2.663 +
0.677 = 18.564
DF = 4, P-Value = 0.001
Page 70

Applications of the X2 Statistic in


Epidemiology

Cohort study (2 samples)


Case-control study (2 samples)
Matched case-control study
(paired cases and controls)

Page 71

The chi-squared statistic provides a


test of the association between two
or more groups, populations, or
criteria
The chi-square test can be used to
test the strength of the association
between exposure and disease in a
cohort study, an unmatched casecontrol study, or a cross-sectional
study
Page 72

CONCLUSION
The chi-square test of significance is
useful as a tool to determine whether
or not it is worth the researcher's
effort to interpret a contingency
table.
A significant result of this test
means that the cells of a contingency
table should be interpreted.
Page 73

A non-significant test means that no


effects were discovered and chance
could explain the observed
differences in the cells. In this case,
an interpretation of the cell
frequencies is not useful.

Page 74

bibliography
METHODS IN BIOSTATISTICSBKMAHAJAN.
METHODS IN BIOSTATISTICST.BHASKARA RAO
http://stattrek.com

Page 75

You might also like