Professional Documents
Culture Documents
+
=
k
1 i
i
2
,.... , b
C
2
1
min
k 1
w
w,
0
i
>
0 C> >
( ) 0 1 b x w y
i i i
> + +
Cortes and Vapnik (1995)
Nonlinear SVM
Final classification function:
Nonlinear classification via linear separation in higher
dimensional space:
http://www.youtube.com/watch?v=9NrALgHFwTo
SVM with polynomial kernel visualization:
http://www.youtube.com/watch?v=3liCbRZPrZA
( ) ( ) ( ) ( )
u u =
j , i
j i j i j i
i
i
y y
2
1
L x x
( ) ( ) ( )
|
|
.
|
\
|
+ =
b y sign f
i
i i j i
K K x x x
Advantages
Margin theory suggest no affect of
dimensionality of input space
uses fewer number of training data (called
support vectors)
QP solution, so no chance of local minima
Not many user-defined parameters
But with real data:
55
60
65
70
75
80
85
90
95
5 10 15 20 25 30 35 40 45 50 55 60 65
C
l
a
s
s
i
f
i
c
a
t
i
o
n
a
c
c
u
r
a
c
y
(
%
)
Number of features
8 pixels 15 pixels
25 pixels 50 pixels
75 pixels 100 pixels
Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral data by
SVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.
Training set size per class
8 pixels 15 pixels 25 pixels 50 pixels 75 pixels 100 pixels
Peak accuracy,
% (number of
features)
74.79 (35) 81.21 (35) 84.45 (35) 88.47 (40) 91.13 (50) 92.53 (50)
Accuracy with
65 features (%)
69.79 77.05 81.66 87.58 90.63 91.76
Difference in
accuracy (%)
5.00 4.16 2.79 0.89 0.50 0.77
Z value 6.04 5.35 4.02 1.69 1.48
2.22
Disadvantages
Designed for two class problem
Different methods to create multi-class
classifier.
Choice of kernel function and kernel specific
parameters
The kernel function is required to satisfy the
Mercer condition
Choice of Parameter C
Output is not naturally probabilistic
Multiclass results
Multiclass approach Classification
accuracy (%)
Training time
one against one 87.90 6.4 sec
one against rest 86.55 30.37sec
Directed Acyclic Graph 87.63 6.5 sec
Bound constrained approach 87.29 79.6 sec
Crammer and Singer approach 87.43 347 min 18 sec
ECOC (exhaustive approach) 89.00 806.6 min
Choice of kernel function
Parameter selection
Grid search and trial & error methods
commonly used approach
computationally expensive
Other approaches
Genetic algorithm
Particle swarm optimization
Their combination with grid search.
SVR
http://www.saedsayad.com/support_vector_machine_reg.htm
Relevance vector Machines
Based on a Bayesian formulation of a linear
model (Tipping, 2001).
Produce a sparse solution than that of SVM
(i.e. less number of relevance vectors)
Ability to use non-Mercer kernels
Probabilistic output
No need to define the parameter C
For a 2-class problem, The maximum a
posteriori estimate of the weights can be
obtained by maximizing the following
objective function:
http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf
http://www.tristanfletcher.co.uk/RVM%20Explained.pdf
( ) ( ) ( )
=
=
-
=
n
i
i i i i
n
i
n
w p log w c p log w w w f
1 1
2 1
o ,........, ,
RVM
The solution involves in calculating the gradient
of f with respect to w.
Only those training data having non-zero
coefficients w
i
(called relevance vectors) will
contribute to the decision function.
An iterative analysis is followed to find the set of
weights that maximizes the objective function
Major difference from SVM
Selected points are anti-boundary (away from
Boundary)
Support vectors represent the least
prototypical examples (closer to boundary,
difficult to classify)
Relevance vectors are the most prototypical
(more representative of class)
Location of the useful training cases for
classifications by SVM & RVM
40
50
60
70
80
90
100
110
70 80 90 100
B
a
n
d
5
Band 1
Wheat
Sugar beet
Oilseed rape
40
50
60
70
80
90
100
110
70 80 90 100
B
a
n
d
5
Band 1
Wheat
Sugar beet
Oilseed rape
MAHESH PAL AND G.M FOODY, Evaluation of SVM, RVM and SMLR for accurate image classification with limited
ground data, IEEE journal of selected topics in applied earth observations and remote sensing, 5( 5), 2012
Class (number of useful
training cases)
Difference of two
smallest
Mahalanobis
distances
Mahalanobis distance to class centroid
Wheat Sugar beet Oilseed rape
Support vectors
Wheat 1(4) 4.8697 15.8246 100.2179 10.9549
Sugar beet(8) 51.9803 3.9906 47.6909 31.0740
Oilseed rape(7) 89.3444 20.9320 6.2782 15.8113
Relevance vectors
Wheat(1) 12.9498 31.8135 171.6667 18.8637
Sugar beet(2) 68.8468 4.4170 144.2734 64.4298
Oilseed rape(4) 112.0943 35.5128 4.3981 31.1147
Disadvantages
Requires large computation cost in
comparison to SVM.
Designed for 2-class problem- similar to
SVM.
Choice of kernel
May have a problem of local minima
Random forest algorithm
A multistage or hierarchical algorithm
Break up of complex decision into a union of
several simpler decision
Use different subset of features/data at
various decision levels.
Tree based Algorithm
Root node
Internal
node
Terminal
node
A tree based algorithm requires
Splitting rules/tree creation [called attribute selection]
Most popular are:
a) Gain ratio criterion (Quinlan, 1993)
b) Gini Index (Breiman, et. al., 1984)
Termination rules/ pruning rules
Most popular are:
a) Error-based pruning (Quinlan, 1993)
b) Cost-Complexity pruning (Brieman, et. al., 1984)
Information Gain
Information Gain
ratio
Gini Index
Chi-square
measure
Accuracy 83.7 84.54 83.9 83.65
83
84
85
A
c
c
u
r
a
c
y
(
%
)
Attribute selection measure
Mahesh Pal and P.M. Mather, 2003, An Assessment of the Effectiveness of Decision Tree Methods for
Land Cover Classification. Remote Sensing of Environment. 86, 554-565
Random forest
An ensemble of tree based algorithm
Uses a random set of features (i.e. input
variables)
Uses a bootstrapped sample of original data
Bootstrapped sample consists of ~63% of
original data
Remaining ~37% is left out and called out of
bag data (OOB).
Multiclass and require no pruning
Parameters
a) Number of tree to grow
b) Number of attributes (features) for each tree
87.78
87.48
88.37
88.27
88.07
87.92
86.5
87
87.5
88
88.5
89
1 2 3 4 5 6
Number of features used
T
e
s
t
d
a
t
a
a
c
c
u
r
a
c
y
(
%
)
87
87.2
87.4
87.6
87.8
88
88.2
88.4
88.6
88.8
89
0 2000 4000 6000 8000 10000 12000 14000
Number of trees
T
e
s
t
d
a
t
a
a
c
c
u
r
a
c
y
(
%
)
Mahesh Pal, 2005, Random Forest Classifier for Remote Sensing Classifications. International Journal of
Remote sensing, 26(1), 217-222.
Classification Results
Classifier used Random forest classifier Support vector machines
Accuracy (%) and Kappa value 88.37 (0.86) 87.9 (0.86)
Training time 12.98 seconds on P-IV 0.30 minutes on sun machine
Can be used for:
Feature selection
Clustering of data
Outlier detection
Predictions/regression
Can handle categorical data and the data with
missing values
Performance - comparable to SVM
Computationally efficient
Mahesh Pal, 2006, Support Vector Machines Based Feature Selection for land cover classification: a case
study with DIAS Hyperspectral Data. International Journal of Remote Sensing, 27(14), 28772894
Outliers
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0 500 1000 1500 2000 2500 3000
O
u
t
l
i
e
r
v
a
l
u
e
samples
class 1
class 2
class 3
class 4
class 5
class 6
class 7
An outlier is an observation that lies at an abnormal distance from other values in
the dataset
Clustering
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
I
I
n
d
s
c
a
l
i
n
g
c
o
o
r
d
i
n
a
t
e
Ist scaling coordinate
class 1
class 2
class 3
class 4
class 5
class 6
class 7
Extreme Learning Machines
Comparison of ELM with SVR for reservoir permeability prediction
Modelling Permeability prediction using ELM
A neural network classifier
Use one hidden layer only
No parameter except number of hidden nodes
Global solution
Performance comparable to SVM and better
than back-propagation neural network
Very fast
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
HUANG, G.-B., ZHU, Q.-Y. and SIEW, C.-K., 2006, Extreme learning machine: Theory and
applications, Neurocomputing, 70, 489501.
=1
=
Disadvantages
Weights are randomly assigned. Large variation in
accuracy using same number of hidden nodes with
different trials.
Difficult to replicate results
Mahesh Pal, 2009, Extreme learning machine based land cover classification, International Journal of
Remote Sensing, 30(14), 38353841.
70
74
78
82
86
90
25 50 75 100 150 200 250 300 350 400 450
Number of nodes in hidden layer
C
l
a
s
s
i
f
i
c
a
t
i
o
n
a
c
c
u
r
a
c
y
(
%
)
Extreme learning
machine
1.25 sec
Back propagation
neural network
336.20 sec
Kernlised ELM
Kernel function can be used in place of hidden layer by
modifying the optimization problem.
Multiclass
Can be used for classification and regression
Same Kernel function as used with SVM/RVM can be
used.
Encouraging results for classification and
prediction- better than SVM in terms of accuracy
and computational cost
Huang, G-B. Zhou H. Ding X. and Zhang R. 2012, Extreme Learning Machine for Regression and Multiclass
Classication. IEEE Transactions on Systems, Man, and CyberneticsPart B: Cybernetics 42: 513-529.
NO free Lunch Theorem
No algorithm performs better than any other when their
performance is averaged uniformly over all possible
problems of a particular type(Wolpert and Macready, 1995)
Algorithm must be designed for a particular domain and
there is no such thing as a general purpose algorithm.
Data dependent nature
http://www.tristanfletcher.co.uk/SVM%20Explained.pdf
http://www.youtube.com/watch?v=eHsErlPJWUU
{SVM by Prof. Yasser, CalTech}
http://www.youtube.com/watch?v=s8B4A5ubw6c
{SVM by Prof. Andrew Ng, Stanford}
http://videolectures.net/mlss03_tipping_pp/
{ RVM, Video lecture by Tipping}
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
Questions?