You are on page 1of 52

Categorical Data Analysis

Suprihatin
Department of Agro-Industrial Technology
Faculty of Agricultural Tecnology
Bogor Agricultural University
Odd Semester 2017/2018

Categorical Data Analysis 1


Objectives:
Student is able to:
 Define categorical data
 Understand the principles of the contingency
table
 Quadrat Chi (2) Test for ‘Goodness of Fit’
 Quadrat Chi (2) Test for the test of
independence between two categorical
variables (Association/Correlation test)
Categorical Data Analysis 2
Non-Parametric Statistics
• The method of free distribution
• Used for small samples, if n <30, and the if the
population is not normally distributed
• If both parametric and non-parametric tests
are applicable, then the parametric method
should be applied because it will be more
efficient

Categorical Data Analysis 3


Categorical Data
Categorical data consists of:
• Nominal scale
• Ordinal scale

Categorical Data Analysis 4


Nominal Data:
• The numbers serve as a substitute for names or as
designations only; just a symbol
• Also called classification scales (categories), the numbers are
used only to classify an object, nature, or type.
• Among the categories there is no sequence. The nominal data
are also called the categorical scale. Nominal scale is a
distinguishing scale of measurement only.
• For example: men = 1, and women = 2; yes = 1 and not = 0
• The types of statistics that are suitable for processing this data
are descriptive statistics, for example: calculation of its
occurrence frequency, percentage, mode, and proportion

Categorical Data Analysis 5


Ordinal Data:
• Observational data are classified into categories, and there is
a sequence between the categories.
– the nature of data: discriminating and sorting
– as a name replacement, also shows ratings.
• For example, someone is asked to sort the three products
based on the level of preference for the product.
• It is also called the ranking scale or ranking.
• For example: Rank 1, 2, and 3, or good = 3, medium = 2, and
bad = 1
• When expressed in a numerical value, the distance of one
value with another value is not equal, or does not exist.
• The types of statistics that are suitable for processing this data
are descriptive statistics, or non-parametric statistics.

Categorical Data Analysis 6


Contingency table
• Data are often presented in the table to facilitate
analysis od the data
• Table is generally in the form of rows and columns,
which describe the variables and its frequencies.
• Categorical data consists of several variables presented
in contingency tables, in the form of rows and columns
• Contingency table can be used to present two or more
variables, with multiple categories.

Categorical Data Analysis 7


Chi squared (2) Test for goodness of fit
Goodness of fit test
 This method was developed by Pearson in 1900 that is also
called as Pearson Test.
 The null hypothesis is a provision on the expected pattern of
frequencies in a specific category (or categories)
 Comparing the observed frequency with theoretical
frequency (Expected Frequency)

 The observed frequency deviates from the expected


frequency (theoretical frequency).

• If the value of 2 is small, meaning that both frequencies are


very close, leading to the acceptance of the null hypothesis
(Ho). Categorical Data Analysis 8
2 Test

• To test the descriptive hypothesis, if k> 2, and the


data are in nominal form.
• To test whether the observed frequencies are close
to the expected frequencies (theoretical value)
• Ho: the proportion of objects falls within each
category within the expected population.

Categorical Data Analysis 9


Ho is tested using:

k
2   i

O  E 2
i , where
i E
i

O  number of cases observed in the i - th category


i
E  expected number of cases under Ho
i
k
  Number of all categories
i 1

Categorical Data Analysis 10


• If the observed and expected frequencies are not
much different, then the difference (Oi - Ei)2 will
be small and consequently 2 will be small.
• The greater the difference (Oi - Ei)2 the greater is
the 2, and the more likely that the observed
frequencies do not come from the expected
population according to Ho.

Categorical Data Analysis 11


Example:
A distributor of rice mill divides the market share into 4
regions (A, B, C, and D). There is information that the
sold rice mill is evenly distributed in each region. To
prove the statement, data collection of 40 samples of
rice mill is conducted. Of the 40 samples is obtained
information that there are rice mills as much as 6, 12,
14 and 8 units in the regions A, B, C and D respectively.
Use a 5 percent significance level to test the hypothesis
that the distribution of the rice mills in all four regions
is evenly distributed

Categorical Data Analysis 12


Problem statement: the researcher will prove that the
rice mill market is evenly distributed
1. Hypothesis
Ho : the distribution of rice mills in the four regions is evenly distributed
H1 : the distribution of rice mills in the four regions is not evenly
distributed

2. Critical value
The number of category (k) is 4, namely A, B, C, and D  db = k – 1 = 4 -
1= 3
Significant level used is 0.05(5%),

 Critical area: 2(0.05;3) > 7.81

Categorical Data Analysis 13


3. Calculation

Region
Total
A B C D
Observation Data (O) 6 12 14 8 40
Expectation (E) 10 10 10 10 40

2 is calculated from:

2
  
k 
O
i
 E 
i
2

6  102 12  102 14  102 8  102
    4.0
i E 10 10 10 10
i

Categorical Data Analysis 14


4. Conclusion

2-caculated = 4 < 7.81  Accept Ho, so that the


distribution of the sold rice mills in the four regions is
evenly (or not significantly different)

Categorical Data Analysis 15


Example:

A study aimed to find out whether the system of


teaching A and teaching system B are equally favored
by the students. For this purpose, a 300 randomly
students sample were studied. From the sample, 200
student chose the teaching system A and 100 students
chose the teaching system B. Perform statistical tests to
determine if the difference is significant (use  = 0.05)

Categorical Data Analysis 16


Answer:
1. Ho: there is no difference in student preferences in
teaching system A and B.
2. H1: there is difference in student preferences in teaching
system A and B (the frequency of O1, and O2 are similar)
3.  = 0.05
4. Critical area : Where Ho is rejected, if 2-calculated greater
than 2-table at a specified significant level. From Table A.5
with df = 2-1 = 1 and  = 0.05  2-table = 3.84.

Categorical Data Analysis 17


5. Calculation

Alternatives Observed Expected


Frequency (Oi) Frequency (Ei)

System A 200 150


System B 100 150

Total 300 300

2
   i
 
k O  E 2 200 - 1502 100 - 1502
i    33.3
i E 150 150
i

6. Conclusion: 2-calculated (= 33.3) > 2-table (= 3.84), so Ho


is rejected, and is to concluded that the students tend to
choose the teaching system A.

Categorical Data Analysis 18


Example:
A food industry wants to know the most
preferred color of product packaging. For this
purpose, a randomized sample of 3000 people
was selected and asked their preference. The
survey results show that 1000 people prefer the
light blue color, 900 people prefer the heart red
color, 600 people like white, and 500 people like
the brown color.

Categorical Data Analysis 19


Jawab:
1. Ho: There is no difference in the preferences of the
four packaging colors
2. H1: There is difference in the preferences of the four
packaging colors
3.  = 0.05
4. Rejection criteria (Critical area): Ho is rejected, if 2-
calculated greater than 2-table at a specified
significant level. From table A.5 with df = 4 – 1 = 3 and
 = 0.05  2-table = 7.81

Categorical Data Analysis 20


5. Calculation:

Alternatif Frekuensi yang Frekuensi yang


diamati (Oi) diharapkan (Ei)
Biru muda 1000 750
Merah hati 900 750
Putih 600 750
Coklat 500 750
Jumlah 3000 3000

i
 i 

k O  E 2 1000 - 7502 900 - 7502 600 - 7502 500 - 7502
2      226.7
i E 750 750 750 750
i

6. Conclusion : 2-calculated > 2-table, so Ho is rejected, and it is to conclude that the


consumer has a different preference on the colors of the product packaging, and the
light blue color is the favorite color.

Categorical Data Analysis 21


Example:

• 144 randomly selected panelists were asked to


determine the type of the most preferred product of
8 types of products available. The number of voters
for each product type is presented in the following
table. Of 144 panelists, 30 panelists prefer the type
1, 19 panelist chose the product 2, 18 panelists
chose the product 3, and so on (see the following
table). Test with a 2 test to see if there are
differences in terms of multiple voters for the eight
products.

Categorical Data Analysis 22


Product type

1 2 3 4 5 6 7 8
Number of
observed people
30 19 18 24 17 8 17 11
that prefer the
product type (Oi)

Categorical Data Analysis 23


Anwer:
• 2-calculated = 18.44
• 2-table =14.067
• Conclusion: Reject Ho, the is a significant
difference on product preference

Categorical Data Analysis 24


Freadman Test
• Also known as a two-sided Anova Freadman Test
• It can be used to test the comparative hypothesis k
paired sample (related), if the data is ordinal (ranking).
• The interval or ratio data should be converted into
ordinal form.
• For example, in a measurement obtained as follows:
4, 7, 9 and 6. This data is the interval data. The data
have to convert into ordinal form, so it becomes: 1, 3,
4, and 2.

Categorical Data Analysis 25


Formula to be used:

12 k 2  3N(k  1)
2   j R
Nk(k  1) i  1

• Where N = number of group, k = number of category,


and Rj = sum of rank in each i-th category.
• If 2-calculated  2-table (Table A.5), Ho is rejected and
H1 is accepted.

Categorical Data Analysis 26


Example
The research aims to know the level of consumer preference on
three types of products (products A, B and C). The level of
preference is measured by an instrument, consisting of 20 criteria.
Each criteria used is given a score of 1, 2, 3 or 4, which means very
dislike, dislike, like, and like very much. So each criteria has the
chance to get the highest score 4 x 20 = 80, and the lowest 1 x 20 =
20. For that purpose, as many as 15 randomly selected are used as
panelists. The results of the assessment by the panelists on the
three types of products are presented in the following table.

Categorical Data Analysis 27


Product type
Panelist / Group
A B C

1 76 70 75
2 71 65 77
3 56 57 74
4 67 60 59
5 70 56 76

6 77 71 73
7 45 47 78
8 60 67 62
9 63 60 75
10 60 59 74

11 61 57 60
12 56 60 75
13 59 54 70
14 74 72 71
15 66 63 65
Total 961 918 1064

Categorical Data Analysis 28


Answer:
• Ho: Consumers have the similar degree of preference in all three
types of products
• H1: Consumers have the different degree of preference in all
three types of products
•  = 0,05
• Critical region: 2  5,991, that is from table with df = k – 1 = 3 –
1 =2 and  = 0.05
• Calculation: For analysis purpose, scores of the three product
types, that are in interval form, are converted into ordinal data.
For example, for panelist/group 1, the scores 76, 70, and 75 are
converted into 3, 1, and 2. The conversion results are presented
in following table:

Categorical Data Analysis 29


Type of Product
Panelist/Group
A B C

1 3 1 2
2 2 1 3
3 1 2 3
4 3 2 1
5 2 1 3

6 3 1 2
7 1 2 3
8 1 3 2
9 2 1 3
10 2 1 3

11 3 1 2
12 1 2 3
13 2 1 3
14 3 2 1
15 3 1 2

Jumlah 32 22 36

Categorical Data Analysis 30


12 k 2  3N(k  1)
2   j R
Nk(k  1) i  1

2 
12
(15)(3)(3  1)
 
322  222  362  3(15)(3  1)  6.93

6. Conclusion:
The 2-calculated is greater than 2 -table (this is located within the
critical area), then Ho is rejected , and it is concluded that the three
product types have different preferences

Categorical Data Analysis 31


Rank Correlation Coefficient

• Known also as Spearman Rank Correlation Coefficient


• This is a non-parametric measure of the relationship
between two variables X and Y
• It is calculated by the equation:

n
6 di2
rs  1  i 1
n(n  1)
2

• di = the rank difference given to xi and yi, and n = number of


pairs of data

Categorical Data Analysis 32


• rs = - 1 to +1
• rs = - 1 or +1 means perfect correlation between correlation X
and Y
• If rs close 0  variable X and Y is not correlated.
• The sign + for identical ranks, sign - for opposite ranks

• Example:
The following table shows the levels of nicotine and tar found
in 10 brands of cigarettes. Calculate the correlation coefficient
of rank to determine the degree of correlation of nicotine and
tar contents in the cigarettes

Categorical Data Analysis 33


• Nicotine and tar contents:

Brand Nicotine content Tar content


(mg/kg) (mg/kg)
Viceroy 14 0.9
Marlboro 17 1.1
Chesterfield 28 1.6
Cool 17 1.3
Kent 16 1.0
Raleigh 13 0.8
Old Gold 24 1.5
Philip Morris 25 1.4
Oasis 18 1.2
Player 31 2.0

Categorical Data Analysis 34


Merk xi yi di di2

Viceroy 2 2 0 0
Marlboro 4.5 4 0.5 0.25
Chesterfield 9 9 0 0
Kool 4.5 6 -1.5 2.25
Kent 3 3 0 0
Raleign 1 1 0 0
Old Gold 7 8 -1 1
Philip Morris 8 7 1 1
Oasis 6 5 1 1
Player 10 10 0 0
Jumlah 5.5

Categorical Data Analysis 35


n
6 di2
rs  1  i 1 ( 6)( 5.5)
rs  1   0.97
n(n  1)
2
10( 10  1)
2

• rs = 0.97 shows a high positive correlation between nicotine


and tar contents

The advantages of rs compared to r (correlation coefficient in


linear regression):
- It does not assume the linear relationship between X & Y
- Without assumption that the distribution of X and Y is normal
- If measurement can not be expressed in numerical
measurement, but in rank

Categorical Data Analysis 36


• To test Ho:  = 0 (there is no correlation between X and Y), it
uses the critical area as in Table A.14
• For a two-sided alternate hypothesis, the critical area of  falls
in two sides, if H1 is negative, the critical region entirely on the
left, and if H1 is positive, the critical area is entirely on the
right

Example:
With the data in the above example, test the hypothesis at the
0.01 significance level that there is no correlation between
nicotine and tar contents ( = 0) with alternative hypothesis that
the correlation is greater than 0

Categorical Data Analysis 37


Answer:
1) Ho:  = 0
2) H1:  > 0
3)  = 0,01
4) Critical area: rs > 0.745, from Spearman Table with n = 10 and  = 0.01
5) Calculation : From above example rs = 0.97
6) Decision : Reject Ho

If n > 30, hypothesis test of correlation significance is done by calculating:

and comparing with the critical rs  0 of normal distribution in Table A.3.


z  area  rs n  1
1/ n 1

Categorical Data Analysis 38


Test for Independence
(2 independent test of two factors)
 It is a hypothesis test on the presence / absence of
relationship (association) or the correlation between two
factors.
 Example: Does Student Performance Achievement (GPA)
relate to the timeliness of completion of the student final
project (skripsi)?
• If there is no connection between the two factors, then it is
said that the two factors are independent each other
• The hypothesis states that the two factors are independent
(unbound, unrelated, unrelated, uncorrelated)
• Therefore, Ho : There is no relationship / association between
X and Y
Categorical Data Analysis 39
Contingency Coefficient

• Used to calculate the relationship between two variables, if


the data are in nominal form.
• Has a close connection with the 2 test for testing
comparative hypothesis for k > 2 independent samples.
• The formula used to calculate the contingency coefficient is:

2
C , where
N  2
2
O  E 
2 r k  ij ij 
   
i 1j1 Eij

Categorical Data Analysis 40


• N = number of samples
• k = number of independent variable level A
• r = number of dependent variable level B
• i = i-th level of independent variable level A
• j = j-th dependent variable level B
• Oij = Observation at i-th level of independent variable A, and j-
th level of dependent variable B
• Eij = Expectation at i-th level of independent variable A, and j-
th level of dependent variable B
• Criteria : Ho is accepted, if 2-calculated is less than 2-table
with df = (k-1)(r-1)

Categorical Data Analysis 41


Example:
A study was conducted to obtain information about the relationship between
GPA (Grade Point Average) with the duration of student final project (skripsi)
completion. The results of the study of 100 students are presented in the
table below (α = 0.05).
a) Determine whether there is a correlation between the GPA with the
duration of skripsi completion, and
b) Determine the contingency coefficient of the correlation

Duration of skripsi completion


IPK In time More time
(≤ 6 months) (> 6 months)
Very satisfactory 55 20
(IPK ≥ 2,75)
Less satisfactory 10 15
(IPK < 2,75)
Categorical Data Analysis 42
Observation (O):
Duration of skripsi completion
IPK In time More time Total
(≤ 6 months) (> 6 months)
Very satisfactory 55 20 75
(IPK ≥ 2,75)
Less satisfactory 10 15 25
(IPK < 2,75)
Total 65 35 100

Expectation (E):
Duration of skripsi completion
IPK In time More time Total
(≤ 6 months) (> 6 months)
Very satisfactory 75
Eij=75*65/100=48.70 Eij=75*35/100=26.25
(IPK ≥ 2,75)
Less satisfactory 25
Eij=25*65/100=16.25 Eij=25*35/100=8.75
(IPK < 2,75)
Total 65 Categorical Data Analysis 35 100 43
2
O  E 
2 r k  ij ij  55  48.752 20  26.252 10  16.252 15  8.752
         9.15751
i 1j1 Eij 48.75 26.25 16.25 8.75

2 9.15751
C   0.8764
N  2 100  9.15751

 = 0,05
df = (k-1)x(r-1)=(2-1)x(2-1)=1
2(1,0.05) = 3.84 (Chi square table)

2(1;0.05) < 2(calculated)  reject Ho

 the two variables are not independent, meaning there is a


relationship between student achievement with the length of time
completion of the thesis.

The coefficient of contingency (C) = 0.874  it is significant


compared to 0.
Categorical Data Analysis 44
Example
A study aims to determine whether there is a relationship
between the type of profession with the type of sports that
are often done. Types of professions are grouped into four,
namely doctors, lawyers, lecturers, and businessman. Type of
sport is grouped into four as well, namely Golf, Tennis,
Badminton, and Soccer. For this purpose a random sampling
of 58 doctors, 75 lawyers, 68 lecturers, and 82 business were
sampled. The data of the research are as follows:

Categorical Data Analysis 45


Profession (A)
Type of sport (B)
Total
Doctor Lawyer Lecturer Business
men
Golf 17 23 10 30 80
Tennis 23 14 17 26 80
Badminton 12 26 18 14 70
Football 6 12 23 11 52

Total 58 75 68 81 282

Categorical Data Analysis 46


Answer:
1. Ho: There is no relationship between the type of
profession and the type of sport that is done
2. H1: There is a relationship between the type of
profession and the type of sport that is done
3.  = 0.01
4. Criteria : Ho is accepted, if 2-calculated is less than
2-table
5. Calculation:

Categorical Data Analysis 47


Type of sport Total Portion
Golf (B1) 80 80/282 = 0.284
Tennis (B2) 80 0.284
Badminton (B3) 70 0.248
Football (B4) 52 0.184
Total 282 1.000

Categorical Data Analysis 48


Type of Doctor (A1) Lawyer (A2) Lecturer Businessmen
sports (A3) (A4)

Oij Eij Ojij Eij Ojij Eij Ojij Eij

Golf 17 16.472 23 21.300 10 19.312 30 23.004


(B1)

Tennis 23 16.472 14 21.300 17 19.312 26 23.004


(B2

Badmint 12 14.384 26 18.600 18 16.864 14 20.088


on (B3)

Football 6 10.672 12 13.800 23 12.512 11 14.904


(B4)

Jumlah 58 75 68 81

Categorical Data Analysis 49


Profession
Type of Total
sport
Doctor Lawyer (A2) Lecturer Businessme
(A1) (A3) n
(A4)
Jumlah 58 75 68 81 282

Golf Fre- 0.284*58= 0.284*75= 0.284*68= 0.284*81=


(B1) kuensi 16.472 21.300 19.312 23.004 80

Tennis Fre- 0.284*58= 0.284*75= 0.284*68= 0.284*81=


(B2 kuensi 16.472 21.300 19.312 23.004 80

Badmint Fre- 0.248*58= 0.248*75= 0.248*68= 0.248*81=


on (B3) kuensi 14.384 18.600 16.864 20.088 70

Football Fre- 0.184*58= 0.184*75= 0.184*68= 0.184*81=


(B4) kuensi 10.672 13.800 12.512 14.904 52

Categorical Data Analysis 50


In this case, O (observastion) = fo and E (expected) = fh

2
O  E 
2 r k  ij ij  17  16.472 2 23  21.3002 11  14.904 2
       ...   29.881
i 1j1 Eij 16.472 21.300 14.904

2 29.881
C   0.31
N  2 282  29.881

2-table = 16.92 (with  = 0.05 and df = (k-1)(r-1) = (4-1)(4-1) = 9)


 calculated (=29.88) > 2-table (= 16.92)  Reject Ho
6. Conclusion: The profession has a relationship with the type of sport that
is done. The coefficient of contingency of 0.31 is significant compared to 0.

Categorical Data Analysis 51


Categorical Data Analysis 52

You might also like