Intermediate R - Analysis of Categorical Data

Types of data
Categorical data (classification data)

Presentation Title Goes Here o Nominal
Analysis of Categorical
…presentation subtitle. Data o Ordinal
Quantitative data (measurement or scale data)
o Interval
o Ratio
Violeta I. Bartolome
Senior Associate Scientist-Biometrics
PBGB-CRIL
v.bartolome@cgiar.org
Categorical Data Nominal Data

The objects being studied are grouped into categories based on A type of categorical data in which objects fall into
some qualitative trait. unordered categories.
They are often recorded as counts of objects in each category.
Examples:
Examples:
o Hair color
o Rice variety group
• blonde, brown, red, black • indica, japonica, javanica

o Growth duration o Gender
• early, medium long • Male, Female
o Smoking status
o Smoking status
• smoker, non-smoker
• smoker, non-smoker
Ordinal Data Binary Data
A type of categorical data in which order is important. A type of categorical data in which there are only
Examples: two categories.
Binary data can either be nominal or ordinal.
o Growth duration
Examples:
• early, medium, long
o Gender
o Degree of resistance
• male, female
• resistant, moderately resistant, susceptible o Attendance
o Nitrogen Rate • present, absent
• none, low, high o Insect state after treatment application
• dead, alive
Quantitative Data Example: Adoption of Nitrogen Fertilizer
The objects being studied are “measured” based on Level of Adoption

Total
some quantitative trait. Education No Yes
The resulting data are set of numbers. Low 51 22 73
Types of quantitative data
High 6 21 27
o Interval data – ordinal and distances between
values are comparable (e.g. temperature, IQ) Total 57 43 100
o Ratio data – interval data and have true zero point

as its origin (e.g. grain yield, age, number of trees Is there an association between level of
in a forest) education and adoption of nitrogen fertilizer?
If no association, the observed frequencies should be chisq.test()
the same as the expected frequencies.
chisq.test(x, # a vector or matrix
Expected frequencies: y = NULL, # a vector, ignored if x is a matrix
Level of Adoption correct = TRUE, # a logical indicating whether to

Total # apply continuity correction when
education No Yes # computing the test statistic for
57 * 73 43 * 73 # 2x2 tables: one half is subtracted
Low = 41.61 = 31.39 73
100 100 # from all |O-E| differences. No
# correction is done if
57 * 27 43 * 27 # simulate.p.value = TRUE
High = 15.39 = 11.61 27
100 100
simulate.p.value = FALSE, # a logical indicating whether to
Total 57 43 100 # compute p-values by Monte Carlo
# simulation
Chi-square test compares the observed and expected frequencies.
( (O − E) − .5 )
2 B = 2000) # an integer specifying the number
χ =∑
2
= 16.3596, P < .0001 # of replicates used in the Monte
E # Carlo test
fisher.test()
fisher.test(x, # a vector or matrix
Sample data set
y = NULL, # a vector, ignored if x is a matrix
or = 1, # the hypothesized odds ratio.

# Only used in the 2 by 2 case
conf.int = TRUE, # logical indicating if a confidence

# interval should be computed
conf.level = 0.95, # confidence level for the returned

# confidence interval. Only used in
# the 2 by 2 case if conf.int = TRUE.
simulate.p.value = FALSE,
B = 2000)
Read data and tabulate Frequencies to Percentage
The second argument in the prop.table function

is marginal index, 1 for rows and 2 for columns.
Chi-square test Logistic Regression

Used to predict a two-category outcome from a set of
independent variables
Response variable –binary: 0, 1
Can handle more that 1 independent variable which can
be
o Categorical
o Quantitative
o Mixture of both
Test is significant indicates that there is no independence between
adoption and level of education.
Logistic Regression in R
Logit Model Odds – ratio of success to failure
using 2 x 2 data
p=probability of being an adopter
p
ln( ) = α + βx x=1 if high level of education
1− p
x=0 if low level of education
exp(α + β x )
p=
1 + exp( α + β x )
Logistic model
1
or p =
1 + exp( −α − β x )
Test significance of the model R Output
Test is significant indicating non-independence between p

ln(odds) = −0.8408 + 2.0935x = e −0.8408e 2.0935 x
adoption and level of education 1− p
( 0.43136 )( 8.113) x low : x = 0; p = 0.3014

p=
1 + ( 0.43136 )( 8.113) x high : x = 1; p = 0.7778
Another Example Past Smokers:
Effect of insecticide usage (low, high) and smoking history on Pulmonary Ailment
Insecticide
the incidence of pulmonary ailments in farm workers. Total
Rate No Yes
Never Smoked:
Low 4310 63 4373
High 4276 105 4381
Insecticide Pulmonary Ailment
Total
Rate No Yes
Low 5376 55 5431 Current Smokers:
High 5392 96 5488 Pulmonary Ailment
Insecticide
Total
Rate No Yes
Shows strong evidence that rate of insecticide has Low 1192 21 1213
χ 2 = 10.86 P < .001 an effect on pulmonary ailment.
High 1188 37 1225
Logit Model with more than 1 R Script for logistic regression with
Independent Variable more than 1 independent variable
ln(odds ) = b 0 + b1R +b 2 P + b3C
R=1 if using high rate of insecticide, R=0 if using low rate of insecticide
P=1 if past smoker, P=0 if not a past smoker
C=1 if current smoker, C=0 if not a current smoker
From R:
b0=-4.574 b1=0.541 b2=0.335 b3=0.219
ln(odds ) = −4.574 + 0.541R + 0.335P + 0.219C
p
= e − 4.574 e 0.541R e 0.335P e 0.219C
1− p
Test significance of model
Result indicates that model is significant.
Chi-square to test effects Parameter estimates
ln(odds) = −4.5739 + 0.5407R + 0.3346P + 0.2188C
p
= e −4.5739e 0.5407 R e 0.3346 Pe 0.2188 C
1− p
p
= ( 0.01032)(1.71721R )(1.39738 P )(1.24482C )
1− p
Indicates that rate of insecticide and past smoking has an effect on
pulmonary ailment. Current smoking has no significant effect on
pulmonary ailment.
p
p = ( 0.01032)(1.71721R )(1.39738 P )(1.24482C )
= ( 0.01032)(1.71721R )(1.39738 P )(1.24482C ) 1− p
1− p
For non-smokers (P=0,C=0) For past smokers (P=1, C=0) PA

Rate Total
PA No Yes
p Rate Total p
= ( 0.01032)(1.71721R ) No Yes = ( 0.01442)(1.71721R ) Low 4310 63 4373
1− p 1− p
Low 5376 55 5431 High 4276 105 4381
θˆ = 1.71721 High 5392 96 5488 θˆ = 1.71721
R = 0 : p̂ = 0.01032 R = 0 : p̂ = 0.01442 R = 1 : p̂ = 0.0248

Values are very close, an
55 indication that model fits well.
actual p = = 0.0101 63
5431 actual p = = 0.0144
R = 1 : p̂ = 0.0177 4373
Conditions R P C Est. p Actual p

Predict probabilities of success High Rate
1 0 0 0.0174 0.0174
Never Smoked
High Rate
1 1 0 0.0242 0.0240
Past Smoker
High Rate
1 1 1 0.0299 0.0299
Current Smoker
Low Rate
0 0 0 0.0102 0.0101
Never Smoked
Low Rate
0 1 0 0.0142 0.0144
Past Smoker
Low Rate
0 1 1 0.0176 0.0176
Current Smoker
On the average, the expected probability of having pulmonary ailment is

higher for those who use high rate of insecticide. The probability is
further increased if the farmer is a past or current smoker.
Another Example R script for logistic reg with indep
Effect of two types of treatment to control disease incidence. variable with more than 2 levels
Treatment No Disease With Disease Total Read data
Control 3 17 20
A 15 5 20
B 14 6 20
Total 32 28 60
Create two dummy binary variables for treatment:

T1=1 if treatment=A, T1=0 otherwise
T2=1 if treatment=B, T2=0 otherwise
p
Logit model: ln( ) = b 0 + b1T1 + b 2 T2
1− p
Logit model
Test significance of model
A vs control
B vs control
From R Output:
bo=1.735 b1=-2.833 b2=-2.582
Script to compute expected
probabilities
p
ln( ) = 1.735 − 2.833T1 − 2.582T2
1− p
(5.669)(0.0588T1 )(0.0756T2 )
p=
1 + (5.669)(0.0588T1 )(0.0756T2 )
Treatment T1 T2 Expected Probabilities

A 1 0 0.25
B 0 1 0.30
Control 0 0 0.85
The probability of a disease incidence is higher if no treatment is applied.

However, probability of a disease incidence is slightly higher if treatment B
was used than if treatment A was used.
Logistic Models with quantitative Why Regression Analysis is not

appropriate when dependent
Independent Variable variable is binary.
1
o May produce predicted values
which are negative or greater
Example: Effect of farm size on the adoption of than 1.
Use of improve fallow

improved fallow o Predicted values of Y can
o Y: adoption (1=Yes, 0=No) assume a continuous range of
values but Y could only be 0
o X: farm size
or 1.
What should be done?
o Fit a model on the probability
0
0 1 2 3 4 5 6 7 8 9
Farm Size of adoption
R script for logistic regression with
Probability of adoption quantitative independent variable
increases as farm size
1
increases.
Read data
Curve is not a straight line but
0.75
an S-shape curve.
Model for this curve is:
Probability
0.5
exp( α + β x )
p=
0.25
1 + exp(α + βx )
0 Estimate α and β using logit
0 1 2 3 4 5 6 7 8 9
Farm Size regression equation:
p
ln( ) = α + βx
1− p
Logit model
Parameter estimates
p
ln( ) = −2.503 + 0.453x
1− p
(0.0818)(1.573x )
p=
1 + (0.0818)(1.573x )
If farm size=5 acres, p=0.44, that is probability
that the farmer is an adopter is 0.44.
Modeling Responses with More than
Two Categories
Re-organize or ignore some of the categories
temporarily, to reduce to a binary response.
Divide categories into a series of binary categories.
Use multinomial logistic regression as an extension
of binary logistic regression. Thank you!
Use log-linear models if all variables are categorical.

Intermediate R - Analysis of Categorical Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intermediate R - Analysis of Categorical Data

Uploaded by

Copyright:

Available Formats

Types of data

Categorical data (classification data)

Categorical Data Nominal Data

• blonde, brown, red, black • indica, japonica, javanica

Quantitative Data Example: Adoption of Nitrogen Fertilizer

The objects being studied are “measured” based on Level of Adoption

o Ratio data – interval data and have true zero point

Level of Adoption correct = TRUE, # a logical indicating whether to

or = 1, # the hypothesized odds ratio.

conf.int = TRUE, # logical indicating if a confidence

conf.level = 0.95, # confidence level for the returned

The second argument in the prop.table function

Chi-square test Logistic Regression

Test significance of the model R Output

Test is significant indicating non-independence between p

( 0.43136 )( 8.113) x low : x = 0; p = 0.3014

Result indicates that model is significant.

Chi-square to test effects Parameter estimates

ln(odds) = −4.5739 + 0.5407R + 0.3346P + 0.2188C

For non-smokers (P=0,C=0) For past smokers (P=1, C=0) PA

R = 0 : p̂ = 0.01032 R = 0 : p̂ = 0.01442 R = 1 : p̂ = 0.0248

Conditions R P C Est. p Actual p

On the average, the expected probability of having pulmonary ailment is

Create two dummy binary variables for treatment:

Treatment T1 T2 Expected Probabilities

The probability of a disease incidence is higher if no treatment is applied.

Logistic Models with quantitative Why Regression Analysis is not

Use of improve fallow

You might also like