You are on page 1of 12

Types of data

 Categorical data (classification data)


Presentation Title Goes Here o Nominal
Analysis of Categorical
…presentation subtitle. Data o Ordinal
 Quantitative data (measurement or scale data)
o Interval
o Ratio
Violeta I. Bartolome
Senior Associate Scientist-Biometrics
PBGB-CRIL
v.bartolome@cgiar.org

Categorical Data Nominal Data


 The objects being studied are grouped into categories based on  A type of categorical data in which objects fall into
some qualitative trait. unordered categories.
 They are often recorded as counts of objects in each category.
 Examples:
 Examples:
o Hair color
o Rice variety group

• blonde, brown, red, black • indica, japonica, javanica


o Growth duration o Gender
• early, medium long • Male, Female
o Smoking status
o Smoking status
• smoker, non-smoker
• smoker, non-smoker
Ordinal Data Binary Data
 A type of categorical data in which order is important.  A type of categorical data in which there are only
 Examples: two categories.
 Binary data can either be nominal or ordinal.
o Growth duration
 Examples:
• early, medium, long
o Gender
o Degree of resistance
• male, female
• resistant, moderately resistant, susceptible o Attendance
o Nitrogen Rate • present, absent
• none, low, high o Insect state after treatment application
• dead, alive

Quantitative Data Example: Adoption of Nitrogen Fertilizer

 The objects being studied are “measured” based on Level of Adoption


Total
some quantitative trait. Education No Yes
 The resulting data are set of numbers. Low 51 22 73
 Types of quantitative data
High 6 21 27
o Interval data – ordinal and distances between
values are comparable (e.g. temperature, IQ) Total 57 43 100

o Ratio data – interval data and have true zero point


as its origin (e.g. grain yield, age, number of trees Is there an association between level of
in a forest) education and adoption of nitrogen fertilizer?
If no association, the observed frequencies should be chisq.test()
the same as the expected frequencies.
chisq.test(x, # a vector or matrix
Expected frequencies: y = NULL, # a vector, ignored if x is a matrix

Level of Adoption correct = TRUE, # a logical indicating whether to


Total # apply continuity correction when
education No Yes # computing the test statistic for
57 * 73 43 * 73 # 2x2 tables: one half is subtracted
Low = 41.61 = 31.39 73
100 100 # from all |O-E| differences. No
# correction is done if
57 * 27 43 * 27 # simulate.p.value = TRUE
High = 15.39 = 11.61 27
100 100
simulate.p.value = FALSE, # a logical indicating whether to
Total 57 43 100 # compute p-values by Monte Carlo
# simulation
Chi-square test compares the observed and expected frequencies.
( (O − E) − .5 )
2 B = 2000) # an integer specifying the number
χ =∑
2
= 16.3596, P < .0001 # of replicates used in the Monte
E # Carlo test

fisher.test()
fisher.test(x, # a vector or matrix
Sample data set
y = NULL, # a vector, ignored if x is a matrix

or = 1, # the hypothesized odds ratio.


# Only used in the 2 by 2 case

conf.int = TRUE, # logical indicating if a confidence


# interval should be computed

conf.level = 0.95, # confidence level for the returned


# confidence interval. Only used in
# the 2 by 2 case if conf.int = TRUE.

simulate.p.value = FALSE,
B = 2000)
Read data and tabulate Frequencies to Percentage

The second argument in the prop.table function


is marginal index, 1 for rows and 2 for columns.

Chi-square test Logistic Regression


 Used to predict a two-category outcome from a set of
independent variables
 Response variable –binary: 0, 1
 Can handle more that 1 independent variable which can
be
o Categorical
o Quantitative
o Mixture of both
Test is significant indicates that there is no independence between
adoption and level of education.
Logistic Regression in R
Logit Model Odds – ratio of success to failure
using 2 x 2 data
p=probability of being an adopter
p
ln( ) = α + βx x=1 if high level of education
1− p
x=0 if low level of education

exp(α + β x )
p=
1 + exp( α + β x )
Logistic model
1
or p =
1 + exp( −α − β x )

Test significance of the model R Output

Test is significant indicating non-independence between p


ln(odds) = −0.8408 + 2.0935x = e −0.8408e 2.0935 x
adoption and level of education 1− p

( 0.43136 )( 8.113) x low : x = 0; p = 0.3014


p=
1 + ( 0.43136 )( 8.113) x high : x = 1; p = 0.7778
Another Example Past Smokers:
Effect of insecticide usage (low, high) and smoking history on Pulmonary Ailment
Insecticide
the incidence of pulmonary ailments in farm workers. Total
Rate No Yes
Never Smoked:
Low 4310 63 4373
High 4276 105 4381
Insecticide Pulmonary Ailment
Total
Rate No Yes
Low 5376 55 5431 Current Smokers:
High 5392 96 5488 Pulmonary Ailment
Insecticide
Total
Rate No Yes
Shows strong evidence that rate of insecticide has Low 1192 21 1213
χ 2 = 10.86 P < .001 an effect on pulmonary ailment.
High 1188 37 1225

Logit Model with more than 1 R Script for logistic regression with
Independent Variable more than 1 independent variable
ln(odds ) = b 0 + b1R +b 2 P + b3C
R=1 if using high rate of insecticide, R=0 if using low rate of insecticide
P=1 if past smoker, P=0 if not a past smoker
C=1 if current smoker, C=0 if not a current smoker

From R:
b0=-4.574 b1=0.541 b2=0.335 b3=0.219
ln(odds ) = −4.574 + 0.541R + 0.335P + 0.219C
p
= e − 4.574 e 0.541R e 0.335P e 0.219C
1− p
Test significance of model

Result indicates that model is significant.

Chi-square to test effects Parameter estimates

ln(odds) = −4.5739 + 0.5407R + 0.3346P + 0.2188C

p
= e −4.5739e 0.5407 R e 0.3346 Pe 0.2188 C
1− p
p
= ( 0.01032)(1.71721R )(1.39738 P )(1.24482C )
1− p
Indicates that rate of insecticide and past smoking has an effect on
pulmonary ailment. Current smoking has no significant effect on
pulmonary ailment.
p
p = ( 0.01032)(1.71721R )(1.39738 P )(1.24482C )
= ( 0.01032)(1.71721R )(1.39738 P )(1.24482C ) 1− p
1− p

For non-smokers (P=0,C=0) For past smokers (P=1, C=0) PA


Rate Total
PA No Yes
p Rate Total p
= ( 0.01032)(1.71721R ) No Yes = ( 0.01442)(1.71721R ) Low 4310 63 4373
1− p 1− p
Low 5376 55 5431 High 4276 105 4381
θˆ = 1.71721 High 5392 96 5488 θˆ = 1.71721

R = 0 : p̂ = 0.01032 R = 0 : p̂ = 0.01442 R = 1 : p̂ = 0.0248


Values are very close, an
55 indication that model fits well.
actual p = = 0.0101 63
5431 actual p = = 0.0144
R = 1 : p̂ = 0.0177 4373

Conditions R P C Est. p Actual p


Predict probabilities of success High Rate
1 0 0 0.0174 0.0174
Never Smoked
High Rate
1 1 0 0.0242 0.0240
Past Smoker
High Rate
1 1 1 0.0299 0.0299
Current Smoker
Low Rate
0 0 0 0.0102 0.0101
Never Smoked
Low Rate
0 1 0 0.0142 0.0144
Past Smoker
Low Rate
0 1 1 0.0176 0.0176
Current Smoker

On the average, the expected probability of having pulmonary ailment is


higher for those who use high rate of insecticide. The probability is
further increased if the farmer is a past or current smoker.
Another Example R script for logistic reg with indep
Effect of two types of treatment to control disease incidence. variable with more than 2 levels
Treatment No Disease With Disease Total Read data
Control 3 17 20
A 15 5 20
B 14 6 20
Total 32 28 60

Create two dummy binary variables for treatment:


T1=1 if treatment=A, T1=0 otherwise
T2=1 if treatment=B, T2=0 otherwise

p
Logit model: ln( ) = b 0 + b1T1 + b 2 T2
1− p

Logit model
Test significance of model

A vs control

B vs control
From R Output:
bo=1.735 b1=-2.833 b2=-2.582
Script to compute expected
probabilities
p
ln( ) = 1.735 − 2.833T1 − 2.582T2
1− p

(5.669)(0.0588T1 )(0.0756T2 )
p=
1 + (5.669)(0.0588T1 )(0.0756T2 )

Treatment T1 T2 Expected Probabilities


A 1 0 0.25
B 0 1 0.30
Control 0 0 0.85

The probability of a disease incidence is higher if no treatment is applied.


However, probability of a disease incidence is slightly higher if treatment B
was used than if treatment A was used.

Logistic Models with quantitative  Why Regression Analysis is not


appropriate when dependent
Independent Variable variable is binary.
1
o May produce predicted values
which are negative or greater
 Example: Effect of farm size on the adoption of than 1.

Use of improve fallow


improved fallow o Predicted values of Y can
o Y: adoption (1=Yes, 0=No) assume a continuous range of
values but Y could only be 0
o X: farm size
or 1.
 What should be done?
o Fit a model on the probability
0
0 1 2 3 4 5 6 7 8 9
Farm Size of adoption
R script for logistic regression with
 Probability of adoption quantitative independent variable
increases as farm size
1
increases.
Read data
 Curve is not a straight line but
0.75
an S-shape curve.
 Model for this curve is:
Probability

0.5

exp( α + β x )
p=
0.25
1 + exp(α + βx )
0 Estimate α and β using logit
0 1 2 3 4 5 6 7 8 9
Farm Size regression equation:

p
ln( ) = α + βx
1− p

Logit model
Parameter estimates

p
ln( ) = −2.503 + 0.453x
1− p
(0.0818)(1.573x )
p=
1 + (0.0818)(1.573x )
If farm size=5 acres, p=0.44, that is probability
that the farmer is an adopter is 0.44.
Modeling Responses with More than
Two Categories
 Re-organize or ignore some of the categories
temporarily, to reduce to a binary response.
 Divide categories into a series of binary categories.
 Use multinomial logistic regression as an extension
of binary logistic regression. Thank you!
 Use log-linear models if all variables are categorical.

You might also like