Logistic Regression

Discriminative classifier and
logistic regression
Le Song
Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2015
Classification
Represent the data
A label is provided for each data point, eg.,
1, 1
Classifier
Boys vs. Girls (demo)
How to come up with decision boundary

Given class conditional distribution:
1 , and class prior:
1 ,
;
1 ,
1
1
,
?
;
1
,
Use Bayes rule

likelihood

posterior
Prior
normalization constant
Bayes Decision Rule

Learning: prior:
,class conditional distribution :
The poster probability of a test point
Bayes decision rule:

If
, then
, otherwise
Alternatively:
If ratio
|!"
|!"
!"
!"
, then
Or look at the log-likelihood ratio h x
, otherwise
ln
'(
')
More on Bayes error of Bayes rule

Bayes error is the lower bound of probability of classification
error
Bayes decision rule is the theoretically best classifier that
minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in
general a very complex problem. Why?
Need density estimation
Need to do integral, eg. *+
1 7
What do people do in practice?

Use simplifying assumption for
Assume
Assume
1 is Gaussian,
1 is fully factorized
! , !
Use geometric intuitions

k-nearest neighbor classifier
Support vector machine
Directly go for the decision boundary h x
ln
'(
')
Logistic regression
Neural networks
Nave Bayes Classifier

Use Bayes decision rule for classification

But assume
1 is fully factorized
.
"
Or the variables corresponding to each dimension of the data

are independent given the label
9
Nave Bayes classifier is a generative model

Once you have the model, you can generate sample from it:
For each data point :
Sample a label,
Sample the value of
1,2 , with according to the class prior

from class conditional
Nave Bayes: conditioned on , generate first dimension

dimension , ., independently
, second
Difference from mixture of Gaussian models

Purpose is different (density estimation vs. classification)
Data different (with/without labels)
label
Learning different (em/or not)
dimensions
1
10
K- nearest neighbors
k-nearest neighbor classifier: assign a label by taking a
majority vote over the 2 training points closest to
For 3 4 1 , the k-nearest neighbor rule generalizes the nearest
neighbor rule
To define this more mathematically:
I6
If
as:
indices of the 2 training points closest to .

71, then we can write the 2-nearest neighbor classifier
86
9 :;
<
=>
11
Example
K=1
12
Example
K=3
13
Example
K=5
14
Example
K = 25
15
Example
K = 51
16
Example
K = 101
17
Computations in K-NN
Similar to KDE, essentially no training or learning phase,
computation is needed when applying the classifier
Memory: ? @-
Finding the nearest neighbors out of a set of millions of examples

is still pretty hard
Test computation ? @-
Use smart data structures and algorithms to index training data

Memory: ? @Training computation: ? @ log @
Test computation: ? log @
KD-tree, Ball tree, Cover tree
18
Discriminative classifier
Directly estimate decision boundary h x
|
posterior distribution
'(
ln
')
or
Logistic regression, Neural networks

Do not estimate
| and
or 8
is a function of , and
does not have probabilistic meaning for ,

hence can not be used to sample data points
Why discriminative classifier?

Avoid difficult density estimation problem
Empirically achieve better classification results
19
What is logistic regression model

Assume that the posterior distribution
particular form
1 ,E
Logistic function 8 I
take a
1
exp E H
JKLM NO
20
Learning parameters in logistic regression

Find E, such that the conditional likelihood of the labels is
maximized
max E :
R
log .
"
,E
Good news: E is concave function of E, and there is a single

global optimum.
Bad new: no closed form solution (resort to numerical method)

21
The objective function E

logistic regression model
1 ,E
Note that
0 ,E
1
exp E H
Plug in
E :
<
1
exp E H
log .
1 EH
"
log 1
exp E H
1 exp E H
,E
exp
EH
22
The gradient of E
E :
<
Gradient
U E
UE
log .
"
1 EH
<
,E
log 1
exp
EH
exp E H
1 exp E H
Setting it to 0 does not lead to closed form solution
23
Gradient descent/ascent
One way to solve an unconstrained optimization problem is
gradient descent
Given an initial guess, we iteratively refine the guess by taking
the direction of the negative gradient
Think about going down a hill by
taking the steepest direction
at each step
Update rule
V6 W8 6
V6 is called the step size or learning rate
6J
24
Gradient Ascent/Descent algorithm

Initialize parameter E X
Do
E YJ E Y
While the ||E YJ
[<
E Y ||
exp E H
1 exp E H
25
Boys vs girls (demo)
26
Nave Bayes vs. logistic regression

Consider
1, 1 ,
]1
Number of parameters
Nave Bayes :
2; 1, when all random variables are binary
4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior
logistic regression:
;
1: EX , E , E , , E1
27
Nave Bayes vs logistic regression II

When model assumptions correct
Both Nave Bayes and logistic regression produce good classifiers
When model assumptions incorrect

logistic regression is less biased does not assume conditional
independence
logistic regression has fewer parameters
expected to outperform Nave Bayes in practice
28
Nave Bayes vs logistic regression III

Estimation method:
Nave Bayes parameter estimates are decoupled (super easy)
Logistic regression parameter estimates are coupled (less easy)
How to estimate the parameters in logistic regression?

Maximum likelihood estimation
More specifically, maximize the conditional likelihood the label
29
Handwritten digits (demo)
30
Multiclass logistic regression

Assign input vector
1, , a
1, , @ into one of classes `, `
Assume that the posterior distribution take a particular form:

exp Ec H

`| , E , , Eb
cd exp Ecd H
Now, lets introduce some notations:
Ic
c | , E , , Eb
f
`
c
31
Learning parameters in multiclass logistic regression

Given all the input data
,
,
,,
S,
The log-likelihood can be written as:

S
"
c"
E log . . Ic
S
<<
S
<<
" c"
" c"
h
c Ec
!g(
c logIc
S
log < < exp Echi

" c i"
32
Learning parameters in multiclass logistic regression

Find E such that the conditional likelihood of the labels is
maximized
E also known as cross-entropy error function for
multiclass
Compute the gradient of 8 E with respect to one parameter
vector E :
U8
UEc
< Ic
33

Logistic Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression

Uploaded by

Copyright:

Available Formats

Discriminative classifier and

A label is provided for each data point, eg.,

Boys vs. Girls (demo)

How to come up with decision boundary

Use Bayes rule

Bayes Decision Rule

,class conditional distribution :

The poster probability of a test point

Bayes decision rule:

Or look at the log-likelihood ratio h x

More on Bayes error of Bayes rule

What do people do in practice?

Use geometric intuitions

Directly go for the decision boundary h x

Nave Bayes Classifier

Or the variables corresponding to each dimension of the data

Nave Bayes classifier is a generative model

1,2 , with according to the class prior

Nave Bayes: conditioned on , generate first dimension

Difference from mixture of Gaussian models

indices of the 2 training points closest to .

Finding the nearest neighbors out of a set of millions of examples

Use smart data structures and algorithms to index training data

Logistic regression, Neural networks

does not have probabilistic meaning for ,

Why discriminative classifier?

What is logistic regression model

Learning parameters in logistic regression

Good news: E is concave function of E, and there is a single

Bad new: no closed form solution (resort to numerical method)

The objective function E

Setting it to 0 does not lead to closed form solution

Gradient Ascent/Descent algorithm

Boys vs girls (demo)

Nave Bayes vs. logistic regression

Nave Bayes vs logistic regression II

When model assumptions incorrect

Nave Bayes vs logistic regression III

How to estimate the parameters in logistic regression?

Handwritten digits (demo)

Multiclass logistic regression

1, , @ into one of classes `, `

Assume that the posterior distribution take a particular form:

Learning parameters in multiclass logistic regression

The log-likelihood can be written as:

log < < exp Echi

Learning parameters in multiclass logistic regression

You might also like