Professional Documents
Culture Documents
Statistical Approach to PR
Various PR approaches
1
2/21/2017
2
2/21/2017
Discrimination function
Case1: That we must make a decision without seeing the fish, i.e. no features
Discriminating function:
if P(1 ) > P(2 ) : choose 1 otherwise choose 2
P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]
e.g. if we always choose c 1 ...
P(error) = 0 * 0.9 + 1 * 0.1 = 0.1
i.e. Probability of error ... 10% ==> minimal error
3
2/21/2017
Bayes' Theorem
To derive Bayes' theorem, start from the definition of conditional probability. The
probability of the event A given the event B is
Discarding the middle term and dividing both sides by P(B), provided that neither
P(B) nor P(A) is 0, we obtain Bayes' theorem:
----------(1)
Bayes' Theorem
Bayes' theorem is often completed by noting that, according to the Law of Total
Probability
-------(1)
More generally, the Law of Total Probability states that given a partition or class
{Ai}, of the event space
-------(3)
4
2/21/2017
x ( x1 , x2 ,
vector. T
, xd )
Given a pattern x = (x1, x2, , xd)T, assign it to
one of n classes in set C.
C ={1, 2, , n}
2
1 P(2)
P(1) P(x|2)
x
P(x| 1)
3
P(3)
P(x| 3)
5
2/21/2017
Posterior Probablity
P(i|X): Posterior probability that a test pattern X belongs to class i can be given
as:
P( X | i ) * P(i )
P(i | X )
P( X )
Likelihood * Pr ior
Posterior
Evidence
i.e. To classify a test pattern with attribute vector X, we assign it to the class most
probable for X.
This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the
pattern to the class with highest posterior probability. It is just like to maximize the
P(i|X).
P( X | i ).P(i )
Bayes classifier: P(i | X )
P( X )
So, maximizing this term means to maximize the right hand side of the
above equation. All the term can be calculated from training data
6
2/21/2017
7
2/21/2017
P(i | x) Where : P( x) p( x | i ). p( i )
P( x) i 1
P(1 | x) p( 2 | x) 1
Decision Rule:
1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)
c
2 otherise 2 otherise
8
2/21/2017
Probability of Error
Remember that the goal is to minimize error.
whenever we observe a particular x, the probability of error is :
P(error|x) = P( 1|x) if we decide 2 in place of 1
P(error|x) = P( 2|x) if we decide 1 in place of 2
Therefore,
P(error|x) = min [P( 1|x), P( 2|x)]
Fig: Components for error with equal prior and non-optimal decision point x*.
The complete pink area (including triangled area) corresponds to the probability of
error for deciding 1 when the class is 2 and gray area corresponds to converse.
If we select the xB (as decision boundary) in place of x* then we can eliminate the
reducible error portion and minimize the error.
9
2/21/2017
Generalization:
Allowing more than two features:
replaces scalar x with a vector x from a d-dimensional feature
space Rd
10
2/21/2017
11
2/21/2017
Posterior probability:
Since p(x) is not taken into account, because it is the same for all classes and it does not
affect the decision. Furthermore, if the a priori probabilities are equal, then, its also
does not affect the comparison of two posterior values.
So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule
i.e. The search for the maximum now rests on the values of the conditional pdfs
evaluated at x.
12
2/21/2017
13
2/21/2017
-------(1)
14
2/21/2017
Since P(x) is constant for all classes we can drop the term p(x), So
gi(x) = p(x|i).p(i)
----(3)
----(4)
Case1:
i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance
matrix for all classes and proportional to I.
Case2:
i = =arbitrary covariance matrix but identical for all classes
Case3:
i = arbitrary covariance matrix & not identical for all classes.
15
2/21/2017
Assuming equal covariance matrices (i.e. class is only through the mean
vectors i ). We can drop 2nd term as class-independent constant
biases, So,
-------(5)
1 12 .I
x i
2
( x i ) .( x i ) x i
T 2 g i ( x) log( p(i )) --------(6)
2 2
If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant
additive constant that can be ignored.
x i
2
g i ( x)
2 2
To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and
each of the mean vector. And assign x to the category of the nearest means class.
Such classifier is called a minimum-distance classifier.
16
2/21/2017
x i
2
g i ( x)
2 2
x i xT .x 2.i x .i .i
2 T T
Since,
xT .x 2.i x .i .i
T T
gi( x)
then, 2 2
The quadratic term xTx is the same for all i, making it an ignorable additive constant.
Thus we obtain the equivalent linear discriminant function.
where, wi 0 21 2 i .i and
T
wi 2i
17
2/21/2017
x i
2
Since,
g i ( x) log( p(i ))
2 2
So,
If P(i) = P(j), then point x0 is halfway between the means, and the
hyperplane is the perpendicular bisector of the line between the
means.
18
2/21/2017
1-D Case
19
2/21/2017
2-D Case
3-D Case
20
2/21/2017
Due to identical covariance matrix, we can drop 2nd term and get:
--------(6)
If the prior probabilities P( i) are the same for all classes, then
logP(i) term can be ignored.
In this case the influence of covariance matrix will affect the distance
measurement, so this is not the simple Euclidean distance but it is
called as Mahalanobis distance from feature vector x to i. It is given
as:
21
2/21/2017
2-D Case
22
2/21/2017
3-D Case
Also, due to arbitrary and class specific covariance matrix in first term, the resulting
discriminant functions are inherently quadratic. By expanding the first term we get:
23
2/21/2017
Decision boundaries
In the two-category case, the decision surface are hyperquadrics,
which can assume any of the general form of various type as:
1) Hyperplanes,
2) Pairs of hyperplanes,
3) Hyperspheres,
4) Hyperellipsoids,
5) Hyperparaboloids, and
6) Hyperhyperboloids
1-D Case
24
2/21/2017
2-D Case
Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general
hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian
distributions whose Bayes decision boundary is that hyperquadric.
3-D Case
25
2/21/2017
Fig: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.
26
2/21/2017
In some other cases, error due classification mistakes are not equally
costly e.g. some mistakes lead to more costlier than others. e.g.
27
2/21/2017
let = denote the action or decision taken as per the above four
cases.
e.g. 1 = fire a missile if signal (1)
2 = Not to fire a missile if noise (2)
28
2/21/2017
But this bayes formulation for error does not describe the risk. So for risk:
The probability that we select i action, when j is the true class is given as:
P(ij) = P(i|j).P(j)
Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn
depends on x. Then the conditional risk is given as:
n n
R( i | x) ( i | j ). p( i | j ). p( j ) . p(
ij j | x)
j1 j1
29
2/21/2017
30
2/21/2017
------------(2)
31
2/21/2017
32
2/21/2017
Thanks
33