You are on page 1of 66

Beyond Neural Network: New Algorithms for

Classification and Prediction



MAHESH PAL


Department of Civil Engineering
National Institute of Technology
Kurukshetra, 136119, INDIA


Neural network
Support vector machines
Relevance vector Machines
Random forest classifier
Extreme Learning machines
3D GEOLOGICAL MODELING: SOLVING AS A CLASSIFICATION PROBLEM
WITH THE SUPPORT VECTOR MACHINE

3-D SEISMIC-BASED LITHOLOGY PREDICTION USING IMPEDANCE
INVERSION AND NEURAL NETWORKS APPLICATION: CASE-STUDY
FROM THE MANNVILLE GROUP IN EAST-CENTRAL ALBERTA, CANADA

EVALUATING CLASSIFICATION TECHNIQUES FOR MAPPING VERTICAL
GEOLOGY USING FIELD-BASED HYPERSPECTRAL SENSORS

FLOW UNIT PREDICTION WITH LIMITED PERMEABILITY DATA USING
ARTIFICIAL NEURAL NETWORK ANALYSIS (WVU, PhD, 2002)

SUBSURFACE CHARACTERIZATION WITH SUPPORT VECTOR
MACHINES

SUPPORT VECTOR MACHINES FOR DELINEATION OF GEOLOGIC
FACIES FROM POORLY DIFFERENTIATED DATA

SUPERIORITIES OF SUPPORT VECTOR MACHINE IN FRACTURE
PREDICTION AND GASSINESS EVALUATION
DYNAMICS OF WATER TRANSPORT THROUGH CATCHMENT OF DANUBE
RIVER TRACED BY
3
H AND
18
O -THE NEURAL NETWORK APPROACH

A COMBINED STABLE ISOTOPE AND MACHINE LEARNING APPROACH TO
QUANTIFY AND CLASSIFY NITRATE POLLUTION SOURCES IN WATER

USING GEOCHEMISTRY AND NEURAL NETWORKS TO MAP GEOLOGY
UNDER GLACIAL COVER

POROSITY AND PERMEABILITY ESTIMATION USING NEURAL NETWORK
APPROACH FROM WELL LOG DATA

ILLINOIS STATEWIDE MONITORING WELL NETWORK FOR PESTICIDES IN
SHALLOW GROUNDWATER (AQUIFER SENSITIVITY TO CONTAMINATION
BY PESTICIDE LEACHING USING NN).

APPLICATION OF ARTIFICIAL NEURAL NETWORKS IN HYDROGEOLOGY:
IDENTIFICATION OF UNKNOWN POLLUTION SOURCES IN
CONTAMINATED AQUIFERS


Classification has been a major research using
remote sensing images.

A major input in GIS based studies.

Several approaches are used.
Classification Algorithms
Supervised - requires labelled training data


Unsupervised- searches for natural groups of
data, called clusters.
Parametric

Maximum likelihood classifier



Nonparametric

Neural network, Support vector machines,
Relevance vector machines, Random Forest
classifier, extreme learning machine
For classification/regression, training sample is
made available to the learning algorithm (like
Neural network, SVM, RVM, Random forest,
extreme learning machines etc).

After training, learning algorithm outputs a
model or function, which is called
the hypothesis.

This Hypothesis can be considered as a
machine that outputs the prediction for a new
test data.
Training samples
Model/ function
Learning algorithm
Output values
Testing samples
Also called as
Hypothesis
Hypothesis can be considered as a machine that provides the prediction for test
data
Neural Network

A major research area within 1990-2000 for
classification/regression, still in use.

No assumption about data distribution.

Works well with different data including remote sensing
data.
ij
w
k
j
i
jk
w
Input
Layer
Hidden
Layer
Output
Layer
The interconnecting weights are determined during the
training process.
Number of algorithms can be used to adjust the
interconnecting weights.
Back-propagation is the most commonly used methods
The error between actual and predicted values is fed
backwards through the network towards the input layer.
Connecting weights changes in relation to the
magnitude of the error.
use an iterative process to minimize the error.
Problems
Identifying user-defined parameters:
Number of hidden layer and nodes
Learning rate
Momentum factor
Iterations
Local minima due to the use of non-
convex, unconstrained minimization
problem
http://mnemstudio.org/neural-networks-multilayer-perceptron-design.htm
Support Vector Machines (SVM)
Basic Theory: in 1965
Margin based classifier: in 1992
Support vector network: In 1995

Since 1998, support vector network called as
Support Vector Machines (SVM) - used as an
alternative to neural network.

First application, Gualtieri and Cromp, (1998)
for hyperspectral image classification
SVM: structural risk minimisation (SRM)
statistical learning theory proposed in 1960s
by Vapnik and co-workers.
SRM: Minimise the probability of
misclassifying an unknown data drawn
randomly

Neural network: Empirical risk minimisation
Minimise the misclassification error on
training data
SVM
Map data from the original input feature
space to a very high dimensional feature
space (even infinite).
Data becomes linearly separable but problem
becomes computationally difficult to solve.
Kernel function allows SVM to work in feature
space, without knowing mapping and
dimensionality of feature space.
A Kernel Function:


SVM kernels need to satisfy Mercer
Theorem: Any continuous, symmetric, positive
semi-definite kernel function can be expressed
as a dot product in a high-dimensional space.

The linear classification in the new space is
equivalent to non-linear classification in the
original space.
( ) ( ) ( )
j i j i
K x x x x u u =
Linearly separable class
For a 2-class classification problem, Training
patterns are linearly separable if:
for all y = 1
for all y = -1
w provide orientation of discriminating plane and
b, the offset from origin.
The classification function will be:
1 b + > +
i
x w
1 b s +
i
x w
( ) b sign f
b ,
+ = x w
w
To classify the dataset
There can be a large number of
discriminating planes.

SVM tries to find a plane farthest from
both classes.

Assume two supporting planes,
maximise the distance (called margin)
between them.

A plane supports a class if all
points in that class are on
one side of that plane.
Use convex optimisation
problem.
Push parallel planes apart
until they collides with few
data points for each class.

Data points are called
Support vectors.
Other training examples are
of no use
margin
w
Origin
x
i
x
w.x + b = 1
Optimal
hyperplane
The margin is defined by : 2/
Maximising the margin is equivalent to
minimising the following quadratic program:
/2
subject to
Solved by QP techniques using Lagrangian
multipliers.
2
w
( ) 0 1 b y
i
> +
i
x w
w
( ) ( )

=
j , i
j i j i j i
i
i
y y
2
1
L x x
0
i
>
for
Linearly Non-separable data
New optimisation problem:


with and
C is a positive constant such that
Larger C means higher penalty to errors.

(
(

+

=

k
1 i
i
2
,.... , b
C
2
1
min
k 1
w
w,
0
i
>
0 C> >
( ) 0 1 b x w y
i i i
> + +
Cortes and Vapnik (1995)
Nonlinear SVM




Final classification function:



Nonlinear classification via linear separation in higher
dimensional space:

http://www.youtube.com/watch?v=9NrALgHFwTo

SVM with polynomial kernel visualization:
http://www.youtube.com/watch?v=3liCbRZPrZA
( ) ( ) ( ) ( )

u u =
j , i
j i j i j i
i
i
y y
2
1
L x x
( ) ( ) ( )
|
|
.
|

\
|
+ =

b y sign f
i
i i j i
K K x x x
Advantages
Margin theory suggest no affect of
dimensionality of input space
uses fewer number of training data (called
support vectors)
QP solution, so no chance of local minima
Not many user-defined parameters

But with real data:
55
60
65
70
75
80
85
90
95
5 10 15 20 25 30 35 40 45 50 55 60 65
C
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y

(
%
)
Number of features
8 pixels 15 pixels
25 pixels 50 pixels
75 pixels 100 pixels
Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral data by
SVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.
Training set size per class
8 pixels 15 pixels 25 pixels 50 pixels 75 pixels 100 pixels
Peak accuracy,
% (number of
features)
74.79 (35) 81.21 (35) 84.45 (35) 88.47 (40) 91.13 (50) 92.53 (50)
Accuracy with
65 features (%)
69.79 77.05 81.66 87.58 90.63 91.76
Difference in
accuracy (%)
5.00 4.16 2.79 0.89 0.50 0.77
Z value 6.04 5.35 4.02 1.69 1.48

2.22

Disadvantages
Designed for two class problem
Different methods to create multi-class
classifier.
Choice of kernel function and kernel specific
parameters
The kernel function is required to satisfy the
Mercer condition
Choice of Parameter C
Output is not naturally probabilistic
Multiclass results
Multiclass approach Classification
accuracy (%)
Training time
one against one 87.90 6.4 sec
one against rest 86.55 30.37sec
Directed Acyclic Graph 87.63 6.5 sec
Bound constrained approach 87.29 79.6 sec
Crammer and Singer approach 87.43 347 min 18 sec
ECOC (exhaustive approach) 89.00 806.6 min
Choice of kernel function
Parameter selection

Grid search and trial & error methods

commonly used approach
computationally expensive

Other approaches

Genetic algorithm
Particle swarm optimization
Their combination with grid search.
SVR
http://www.saedsayad.com/support_vector_machine_reg.htm
Relevance vector Machines

Based on a Bayesian formulation of a linear
model (Tipping, 2001).
Produce a sparse solution than that of SVM
(i.e. less number of relevance vectors)
Ability to use non-Mercer kernels
Probabilistic output
No need to define the parameter C
For a 2-class problem, The maximum a
posteriori estimate of the weights can be
obtained by maximizing the following
objective function:




http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf
http://www.tristanfletcher.co.uk/RVM%20Explained.pdf
( ) ( ) ( )

=
=
-
=
n
i
i i i i
n
i
n
w p log w c p log w w w f
1 1
2 1
o ,........, ,
RVM

The solution involves in calculating the gradient
of f with respect to w.
Only those training data having non-zero
coefficients w
i
(called relevance vectors) will
contribute to the decision function.
An iterative analysis is followed to find the set of
weights that maximizes the objective function
Major difference from SVM

Selected points are anti-boundary (away from
Boundary)

Support vectors represent the least
prototypical examples (closer to boundary,
difficult to classify)

Relevance vectors are the most prototypical
(more representative of class)
Location of the useful training cases for
classifications by SVM & RVM

40
50
60
70
80
90
100
110
70 80 90 100
B
a
n
d

5

Band 1
Wheat
Sugar beet
Oilseed rape
40
50
60
70
80
90
100
110
70 80 90 100
B
a
n
d

5

Band 1
Wheat
Sugar beet
Oilseed rape
MAHESH PAL AND G.M FOODY, Evaluation of SVM, RVM and SMLR for accurate image classification with limited
ground data, IEEE journal of selected topics in applied earth observations and remote sensing, 5( 5), 2012
Class (number of useful
training cases)
Difference of two
smallest
Mahalanobis
distances
Mahalanobis distance to class centroid
Wheat Sugar beet Oilseed rape
Support vectors
Wheat 1(4) 4.8697 15.8246 100.2179 10.9549
Sugar beet(8) 51.9803 3.9906 47.6909 31.0740
Oilseed rape(7) 89.3444 20.9320 6.2782 15.8113

Relevance vectors
Wheat(1) 12.9498 31.8135 171.6667 18.8637
Sugar beet(2) 68.8468 4.4170 144.2734 64.4298
Oilseed rape(4) 112.0943 35.5128 4.3981 31.1147
Disadvantages
Requires large computation cost in
comparison to SVM.

Designed for 2-class problem- similar to
SVM.

Choice of kernel

May have a problem of local minima

Random forest algorithm
A multistage or hierarchical algorithm
Break up of complex decision into a union of
several simpler decision
Use different subset of features/data at
various decision levels.
Tree based Algorithm
Root node
Internal
node
Terminal
node
A tree based algorithm requires
Splitting rules/tree creation [called attribute selection]
Most popular are:
a) Gain ratio criterion (Quinlan, 1993)
b) Gini Index (Breiman, et. al., 1984)
Termination rules/ pruning rules
Most popular are:
a) Error-based pruning (Quinlan, 1993)
b) Cost-Complexity pruning (Brieman, et. al., 1984)

Information Gain
Information Gain
ratio
Gini Index
Chi-square
measure
Accuracy 83.7 84.54 83.9 83.65
83
84
85
A
c
c
u
r
a
c
y
(
%
)
Attribute selection measure
Mahesh Pal and P.M. Mather, 2003, An Assessment of the Effectiveness of Decision Tree Methods for
Land Cover Classification. Remote Sensing of Environment. 86, 554-565
Random forest

An ensemble of tree based algorithm
Uses a random set of features (i.e. input
variables)
Uses a bootstrapped sample of original data
Bootstrapped sample consists of ~63% of
original data
Remaining ~37% is left out and called out of
bag data (OOB).
Multiclass and require no pruning

Parameters

a) Number of tree to grow
b) Number of attributes (features) for each tree

87.78
87.48
88.37
88.27
88.07
87.92
86.5
87
87.5
88
88.5
89
1 2 3 4 5 6
Number of features used
T
e
s
t

d
a
t
a

a
c
c
u
r
a
c
y

(
%
)
87
87.2
87.4
87.6
87.8
88
88.2
88.4
88.6
88.8
89
0 2000 4000 6000 8000 10000 12000 14000
Number of trees
T
e
s
t

d
a
t
a

a
c
c
u
r
a
c
y

(
%
)
Mahesh Pal, 2005, Random Forest Classifier for Remote Sensing Classifications. International Journal of
Remote sensing, 26(1), 217-222.
Classification Results
Classifier used Random forest classifier Support vector machines
Accuracy (%) and Kappa value 88.37 (0.86) 87.9 (0.86)
Training time 12.98 seconds on P-IV 0.30 minutes on sun machine
Can be used for:
Feature selection
Clustering of data
Outlier detection
Predictions/regression
Can handle categorical data and the data with
missing values
Performance - comparable to SVM
Computationally efficient
Mahesh Pal, 2006, Support Vector Machines Based Feature Selection for land cover classification: a case
study with DIAS Hyperspectral Data. International Journal of Remote Sensing, 27(14), 28772894
Outliers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0 500 1000 1500 2000 2500 3000
O
u
t
l
i
e
r

v
a
l
u
e

samples
class 1
class 2
class 3
class 4
class 5
class 6
class 7
An outlier is an observation that lies at an abnormal distance from other values in
the dataset
Clustering
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
I
I
n
d

s
c
a
l
i
n
g

c
o
o
r
d
i
n
a
t
e

Ist scaling coordinate
class 1
class 2
class 3
class 4
class 5
class 6
class 7
Extreme Learning Machines
Comparison of ELM with SVR for reservoir permeability prediction

Modelling Permeability prediction using ELM
A neural network classifier
Use one hidden layer only
No parameter except number of hidden nodes
Global solution
Performance comparable to SVM and better
than back-propagation neural network
Very fast
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
HUANG, G.-B., ZHU, Q.-Y. and SIEW, C.-K., 2006, Extreme learning machine: Theory and
applications, Neurocomputing, 70, 489501.

=1
=


Disadvantages
Weights are randomly assigned. Large variation in
accuracy using same number of hidden nodes with
different trials.

Difficult to replicate results
Mahesh Pal, 2009, Extreme learning machine based land cover classification, International Journal of
Remote Sensing, 30(14), 38353841.
70
74
78
82
86
90
25 50 75 100 150 200 250 300 350 400 450
Number of nodes in hidden layer
C
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y

(
%
)
Extreme learning
machine
1.25 sec
Back propagation
neural network
336.20 sec
Kernlised ELM
Kernel function can be used in place of hidden layer by
modifying the optimization problem.
Multiclass
Can be used for classification and regression
Same Kernel function as used with SVM/RVM can be
used.


Encouraging results for classification and
prediction- better than SVM in terms of accuracy
and computational cost
Huang, G-B. Zhou H. Ding X. and Zhang R. 2012, Extreme Learning Machine for Regression and Multiclass
Classication. IEEE Transactions on Systems, Man, and CyberneticsPart B: Cybernetics 42: 513-529.
NO free Lunch Theorem
No algorithm performs better than any other when their
performance is averaged uniformly over all possible
problems of a particular type(Wolpert and Macready, 1995)

Algorithm must be designed for a particular domain and
there is no such thing as a general purpose algorithm.

Data dependent nature


http://www.tristanfletcher.co.uk/SVM%20Explained.pdf
http://www.youtube.com/watch?v=eHsErlPJWUU
{SVM by Prof. Yasser, CalTech}
http://www.youtube.com/watch?v=s8B4A5ubw6c
{SVM by Prof. Andrew Ng, Stanford}
http://videolectures.net/mlss03_tipping_pp/
{ RVM, Video lecture by Tipping}
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf



Questions?

You might also like