Professional Documents
Culture Documents
What is Classification?
Assigning an object to a certain class based on its similarity to
previous examples of other objects
Can be done with reference to original data or based on a model of
that data
E.g: Me: Its round, green, delicious and crunchy You: Its an
apple!
Examples
Classifying transactions as genuine or fraud e.g credit card usage,
insurance claims, cell phone calls
Classifying prospects as good or bad customers
Classifying engine faults by their symptoms
Classifying healthy and sick people based on the symptoms
Classifying tumor and normal cell line based on the DNA mutation,
Gene expression
(Un)Certainty
As with most data mining solutions, a classification usually comes
with a degree of certainty.
It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class
Techniques
Non-parametric, e.g. K nearest neighbour
Mathematical models, LDA, logistic regression e.g. neural networks
Rule based models, e.g. decision trees
Support vector Machine
Etc
Prediction:
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Data transformation
Generalize and/or normalize data
Robustness
Scalability
Interpretability:
Goodness of rules
Classification Algorithms
_
_
.
+
+
xq
_
+
.
.
Logistic Regression
In logistic regression the outcome variable is a binary variable
The purpose is to assess the effects of multiple explanatory variables,
which can be numeric and/or categorical, on the outcome variable.
Used because having a categorical outcome variable violates the
assumption of linearity in normal regression
P
1 P
Logistic Regression
Logistic regression analysis requires that the dependent variable be
dichotomous.
Logistic Regression
Response - Presence/Absence of characteristic
Predictor - Numeric variable observed for each case
Model - p(x) Probability of event (P)
bx
e
p ( x)
bx
1 e
b = 0 P(Presence) is the same at each level of x
Odds Ratio
Interpretation of Regression Coefficient (b):
In linear regression, the slope coefficient is the change in the mean
response as x increases by 1 unit
In logistic regression, we can show that:
odds( x 1)
eb
odds( x)
p ( x)
odds( x)
1 p ( x)
If b = 0, the odds and probability are the same at all x levels (eb=1)
If b > 0 , the odds and probability increase as x increases (eb>1)
If b < 0 , the odds and probability decrease as x increases (eb<1)
Adjusted Odds ratio for raising xi by 1 unit, holding all other predictors
constant:
ORi e b i
Many models have nominal/ordinal predictors, and widely make use of
dummy variables
Y lnPY 1 Y ln1 PY
i
i1
Hosmer-Lemeshow Statistic
Measure of lack of fit
Null hypothesis:there is no difference between observed and modelpredicted values,
If the H-L goodness-of-fit test statistic is greater than 0.05:
model's estimates fit the data at an acceptable level.
indicating model prediction that is not significantly different from observed
values.
df
knew kbaseline
If the significance of the chi-square statistic is less than .05, then the model
is a significant fit of the data
b
SE b
Example
A trial (based on 2,000 patients) with probability of dying at 30 days as response
() and age, sex (F=1, M=0), and treatment (C=0, Tr=1) as regressors.
Estimated multiple logistic regression model
logit () = -7.65+ 0.073 age + 0.69 sex + 0.17 treatment
(P<0.0001) (P=0.007)
(P=0.45)
Interpretation:
Treatments have no significant different effect (taking into account age & sex)
The older the patient the higher the probability of dying before (or at) 30 days
(taking into account sex and treatment)
Women have a significantly higher 30 day mortality rate (taking into account
age and treatment)
Example
Further interpretation:
e0.17 = 1.19 odds ratio for rt-PA (but NS)
e0.69 = 1.99 odds ratio for sex = female
e0.073 = 1.08 odds ratio for an increase of age with 1 yr
e0.73 = 2.08 odds ratio for an increase of age with 10 yrs
For all odds ratios are controlled for the other factors
Another example
The following table is the results from a diagnostic test like x-ray or
computer tomographic (CT) scan and the true disease or condition of
the patient is known (Altman and Bland, 1994a).
The PPV and the NPV are dependent on the prevalence of the disease
in the patient population being studied (Altman and Bland, 1994b).
Prevalence
The probability of currently having the disease regardless of the
duration of time one has had the disease.
Obtained by dividing the number of people who currently have the
disease by the number of people in the study population.
Cumulative incidence
The probability that a person with no prior disease will develop a new
case of the disease over some specified time period.
Another Example
Suppose 84% of hypertensives and 23% of normotensives are
classified as hypertensive by an automated blood pressure
machine. What are the PV+ and PV of the machine,
assuming 20% of the adult population is hypertensive?
ROC Curve
ROC
For the example the area under the ROC curve is 0.89
This means that the radiologist reading the CT scan has an 89%
chance of correctly distinguishing a normal from an abnormal patient
based on the ordering of the CT ratings.
In case they are similar, and if the Gleason score is both easier to measure and cost
effective relative to other two factors, then one can go with a parsimonious model of
using just the Gleason score as a predictor.
Penalized
Logistic Regression
Logistic model:
-Penalization
(Lasso):
76
AUC: 0.86
78