Supervised Learning With Hyperspectral Imagery

SUPERV
U VISED LEARNING WITH
HYPE L DATA
ERSPECTRAL A
B
By
SOUM
MYADIIP CHA
ANDRA

DEPA
ARTMENT OF CIVIL ENGINEE
ERING
INDIAN
N INSTITU
UTE OF TECHN
NOLOGY KANPU
UR
Julyy 2010
i
SUPERV
U VISED LEARNING WITH
HYPE L DATA
ERSPECTRAL A
A Dissertat
D bmitted In Partiial Fulfilllment of
tion Sub o the
Requireements for the Degree
D o
of
Ma
aster of Techno
ology
B
By
SOUM
MYADIIP CHA
ANDRA
(Y81
103044)

DEPA
ARTMENT OF CIVIL ENGINEE
ERING
INDIAN
N INSTITU
UTE OF TECHN
NOLOGY KANPU
UR
Julyy 2010
i
ii

ABSTRACT
Hyperspectral data (HD) has ability to provide large amount of spectral
information than multispectral data. However, it suffers from problems like curse
of dimensionality and data redundancy. The size of data set is also very large.
Consequently, it is difficult to process these datasets and obtain satisfactory
classification results.
The objectives of this thesis are to find the best feature extraction (FE)
techniques and improvement in accuracy and time for classification of HD by
using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-
nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In
order to achieve these objectives, experiments were performed with different FE
techniques like segmented principal component analysis (SPCA), kernel principal
component analysis (KPCA), orthogonal subspace projection (OSP) and projection
pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations
in this thesis work.
From the experiments performed with the parametric and non-parametric
classifier, the GML classifier was found gave the best results with an overall
kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)
per class and 45 bands on SPCA feature extracted data set.
SVM algorithm with quadratic programming (QP) optimizer gave the best results
amongst all optimizers and approaches. The overall k-value of 96.91% was
achieved by using 300 TP per class and 20 bands of SPCA feature extracted data
set. However, the supervised FE techniques like KPCA and OSP failed to improve
results obtained by SVM significantly.
The best results obtained for GML, KNN and SVM were compared by the
one-tailed hypothesis testing. It was found that SVM classifier performed
significantly better than the GML classifiers for statistically large set of TP (300).
For statistically exact (100) and sufficient (200) set of TP, the performance of SVM
on SPCA extracted data set is statistically not better than the performance of
GML classifier.
iii

ACKNOWLEDGEMENTS
I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for
his involvement, motivation and encouragement throughout and beyond the thesis
work. His expert directions have inculcated in my qualities which I will treasure
throughout my life. His patient hearing, critical comments approach to the research
problem made me do better every time. His valuable suggestions to all stages of the
thesis work helped me to improvise various sorts of my shortcomings of my thesis
work. I also express my sincere thanks for his effort in going through the
manuscript carefully and making it more readable. It has been a great learning
and life changing experience working with him.
I would like to express my sincere tribute to Dr. Bharat Lohani for his
friendly nature, excellent guidance and teaching during my stay at IITK.
I would like to thank specially to Sumanta Pasari for his valuable
comments and corrections of the manuscript of my thesis.
I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,
Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI
peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,
pleasant and memorable one.
In closure, I express my cordial homage to my parents and my best friend
for their unwavering support and encouragement to complete my study at IITK
SOUMYADIP CHANDRA
July 2010
iv

CONTENTS
CERTIFICATE………………………………………………………………………….. ii
ABSTRACTS........................................................................................................... iii
ACKNOWLEDGEMENTS……………………………………………………………. iv
CONTENTS………………………………………………………………………………...v
LIST OF TABLES………………………………………………………………………..ix
LIST OF FIGURES..................................................................................................x
LIST OF ABBREVIATIONS…………………………………………………………xiii
CHAPTER 1 - Introduction......................................................................... 1
1.1 High dimensional space....................................................................................... 2
1.1.1 What is hyperspectral data? ......................................................................... 2
1.1.2 Characteristics of high dimensional space .................................................. 3
1.1.3 Hyperspectral imaging ................................................................................. 4
1.2 What is classification? ......................................................................................... 5
1.2.1 Difficulties in hyperspectral data classification.......................................... 5
1.3 Background of work ............................................................................................. 6
1.4 Objectives ............................................................................................................. 7
1.5 Study area and data set used .............................................................................. 7
1.6 Software details ................................................................................................... 9
v

1.7 Structure of thesis ............................................................................................... 9
CHAPTER 2 – Literature Review ........................................................ 10
2.1 Dimensionality reduction by feature extraction .................................................. 10
2.1.1 Segmented principal component analysis (SPCA) ........................................ 11
2.1.2 Projection pursuit (PP) ............................................................................... 11
2.1.3 Orthogonal subspace projection (OSP) ..................................................... 12
2.1.4 Kernel principal component analysis (KPCA) ......................................... 12
2.2 Parametric classifiers ........................................................................................ 13
2.2.1 Gaussian maximum likelihood (GML)....................................................... 13
2.3 Non–parametric classifiers .............................................................................. 14
2.3.1 KNN ............................................................................................................. 14
2.3.2 SVM .............................................................................................................. 15
2.4 Conclusions from literature review .................................................................. 19
CHAPTER 3 – Mathematical Background ................................... 21
3.1 What is kernel? .................................................................................................. 21
3.2 Feature extraction techniques .......................................................................... 24
3.2.1 Segmented principal component analysis (SPCA) .................................... 25
3.2.2 Projection pursuit (PP) ............................................................................... 27
3.2.3 Kernel principal component analysis (KPCA) .......................................... 34
3.2.4 Orthogonal subspace projection (OSP) ...................................................... 38
vi

3.3 Supervised classifier .......................................................................................... 43
3.3.1 Bayesian decision rule ................................................................................ 43
3.3.2 Gaussian maximum likelihood classification (GML): ............................... 44
3.3.3 k – nearest neighbor classification ............................................................. 44
3.3.4 Support vector machine (SVM): ................................................................. 46
3.4 Analysis of classification results ....................................................................... 58
3.4.1 One tailed hypothesis testing ..................................................................... 59
CHAPTER 4 - Experimental Design .................................................. 61
4.1 Feature extraction technique ............................................................................ 62
4.1.1 SPCA ............................................................................................................ 62
4.1.2 PP ................................................................................................................. 62
4.1.3 KPCA............................................................................................................ 63
4.1.4 OSP............................................................................................................... 64
4.2 Experimental design .......................................................................................... 64
4.3 First set of experiment (SET-I) using parametric and non-parametric
classifier ........................................................................................................................ 66
4.4 Second set of experiment (SET-II) using advance classifier ............................... 67
4.5 Parameters ...................................................................................................... 68
CHAPTER 5 - Results .................................................................................... 69
5.1 Visual inspection of feature extraction techniques ......................................... 69
vii

5.2 Results for parametric and non-parametric classifiers ................................... 75
5.2.1 Results of classification using GML classifier (GMLC) ........................... 75
5.2.2 Class-wise comparison of result for GMLC ............................................... 81
5.2.3 Classification results using KNN classifier (KNNC) ................................ 82
5.2.4 Class wise comparison of results for KNNC ............................................. 91
5.3 Experiment results for SVM based classifiers ................................................. 92
5.3.1 Experiment results for SVM_QP algorithm .............................................. 93
5.3.2 Experiment results for SVM_SMO algorithm ........................................... 97
5.3.3 Experiment results for KPCA_SVM algorithm ....................................... 100
5.3.4 Class wise comparison of the best result of SVM ................................... 103
5.3.5 Comparison of results for different SVM algorithms ............................. 104
5.4 Comparison of best results of different classifiers......................................... 105
5.5 Ramifications of results ................................................................................... 107
CHAPTER 6 - Summary of Results and Conclusions ....... 109
6.1 Summary of results.......................................................................................... 109
6.2 Conclusions....................................................................................................... 112
6.3 Recommendations for future work ................................................................. 112
REFERENCES………………………………………………….……………….115
APPENDIX A……………………………………………………………………..120

viii

LIST OF TABLES
Table Title Page

2.1 Summary of literature review 18
3.1 Examples of common kernel functions 23
4.1 List of parameters 68
5.1 The time taken for each FE techniques 71
5.2 The best kappa values and z-statistic (at 5% significance values) 80
for GML
5.3 Ranking of FE techniques and time required to obtain the best k- 80
value
5.4 Classification with KNNC on OD and feature extracted data set 84
5.5 The best k-values and z-statistic for KNNC 89
5.6 Rank of FE techniques and time required to obtain best k-value 90
5.7 The best kappa accuracy and z-statistic for SVM_QP on different 95
feature modified data set
5.8 The best k-value and z-statistic for SVM_SMO on OD and different 100
feature modified data set
5.9 The best k-value and z-statistic for KPCA_SVM on original and 104
different feature modified data sets
5.10 Comparison of the best k-values with different FE techniques, 106
classification time, and z-statistic for different SVM algorithms
5.11 Statistical comparison of different classifier’s results obtained for 107
different data sets
5.12 Ranking of different classification algorithms depending on 109
classification accuracy and time. (Rank: 1 indicate the best)
ix

LIST OF FIGURES
Figure Title Page

1.1 Hyperspectral image cube 2
1.2 Fractional volume of a hypersphere inscribed in hypercube decrease 4
as dimension increases
1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8
1.4 FCC obtained by first 3 principal components and superimposed 8
reference image showing training data available for classes
identified for study area
1.5 Google earth image of study area 9
3.1 Overview of FE methods 24
3.2 Formation of blocks for SPCA 26
3.2a Chart of multilayered segmented PCA 27
3.3 Layout of the regions for the chi-square projection index 30
3.4 (a) Input points before kernel PCA (b) Output after kernel PCA. 37
The three groups are distinguishable using the first component
only
3.5 Outline of KPCA algorithm 38
3.6 KNN classification scheme 45
3.7 Outline of KNN algorithm 46
3.8 Linear separating hyperplane for linearly separable data 49
3.9 Non-linear mapping scheme 52
3.10 Brief description of SVM_QP algorithm 54
3.11 Overview of KPCA_SVM algorithm 58
3.12 Definitions and values used in applying one-tail hypothesis testing 60
4.1 SPCA feature extraction method 62
x

4.2 Projection pursuit feature extraction method 63
4.3 KPCA feature extraction method 63
4.4 OSP feature extraction method 64
4.5 Overview of classification procedure 66
4.6 Experimental scheme for Set-I experiments 67
4.7 The experimental scheme for advanced classifier (Set-II) 68
5.1 Correlation image of the original data set consisting of three 70
blocks having bands 32, 6 and 27 respectively
5.2 Projection of the data points. (a) Most interesting projection 71
direction (b) Second most interesting projection direction
5.3 First six Segmented Principal Components (SPCs) (b) shows water 72
body and salt lake
5.4 First six Kernel Principal Components (KPCs) obtained by using 72
400 TP
5.5 First six features obtained by using eight end-members 73
5.6 Two components of most interesting projections 73
5.7 Correlation images after applying various feature extraction 74
techniques
5.8 Overall kappa value observed for GML classification on different 78
feature extracted data sets using selected different bands
5.9 Comparison of kappa values and classification times for GML 81
classification method
5.10 Best producer accuracy of individual classes observed for GMLC 82
on different feature extracted data set with respect to different set
of TP
5.11 Overall accuracy observed for KNN classification of OD and 85
feature extracted data sets for 25 TP
5.15 Time comparison for KNN classification. Time for different bands 91
xi

at different neighbors for (a) 300 TP (b) 200 TP training data per
class
5.16 Comparison of best k-value and classification time for original and 91
feature extracted data set
5.17 Class wise accuracy comparison of OD and different feature 92
extracted data for KNNC
5.18 Overall kappa values observed for classification of FE modified 94
data sets using SVM and QP optimizer
5.19 Classification time comparison using 200 and 300 TP per class 97
5.20 Overall kappa values observed for classification of original and FE 100
modified data sets using SVM with SMO optimizer
5.21 Comparison of classification time different set of TPs with respect 101
to number of bands for SVM_SMO classification algorithm
5.22 Overall kappa values observed for classification original and feature 103
modified data sets using KPCA_SVM algorithm.
5.23 Comparison of classification accuracy of individual classes for 105
different SVM algorithms
xii

LIST OF ABBREVIATIONS
AC Advance classifier
DAFE Discriminant analysis feature extraction
DAIS Digital airborne imaging spectrometer
DBFE Decision boundary feature extraction
FE Feature extraction
GML Gaussian maximum likelihood
HD Hyperspectral data
ICA Independent component analysis
KNN k-nearest neighbors
k-value Kappa value
KPCA Kernel principal component analysis
KPCA_SVM Support vector machine with Kernel principal component
analysis
MS Multispectral data
NWFE Nonparametric weighted feature extraction
Ncri Critical value
OD Original data
OSP Orthogonal subspace projection
PCA Principal component analysis
PCT Principal component transform
PP Projection pursuit
rbf Radial basic function
SPCA Segmented principal component analysis
SV Support vectors
SVM Support vector machine
SVM_QP Support vector machine with quadratic programming optimizer
xiii

SVM_SMO Support vector machine with sequential minimal optimizer
TP Training pixels
Dedicated
to
my family & guide
xiv

CHAPTER 1
INTRODUCTION
Remote sensing technology has brought a new dimension in the field of earth
observation, mapping and in many other different fields. At the beginning of this
technology, multispectral sensors were used for capturing data. The multispectral
sensors capture data in a small number of bands with broad wavelength intervals.
Due to few spectral bands, their spectral resolution is insufficient to discriminate
amongst many earth objects. But if the spectral measurement is performed by using
hundreds of narrow wavelength bands, then several earth objects could be
characterized precisely. This is the key concept of hyperspectral imagery.
As compared to multispectral (MS) data set, hyperspectral data (HD) has large
information content, voluminous and also different in characteristics. So, the
extraction of that huge information from HD remains a challenge. Therefore, some
cost effective and computationally efficient procedures are required to classify the
HD. Data classification is the categorization of data for its most effective and efficient
use. As a result of classification, we need a high accuracy thematic map. HD has that
potentiality.
This chapter will provide the concept of high dimensional space, HD and
difficulties in classification of HD. Next part focuses on the objectives of the thesis
followed by an overview of data set used in this thesis. Details of the software used
are mentioned in the next part of this chapter followed by the structure of thesis.
1.1 High dimensional space

In Mathematics, an n-dimensional space is a topological space whose
dimension is n (where n is a fixed natural number). One of the typical example is n-
dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.
ii

n-dimensional spaces with large values of n are sometimes called high-dimensional
spaces (Werke, 1876). Many familiar geometric objects can be expressed by some
number of dimensions. For example, the two-dimensional triangle and the three-
dimensional tetrahedron can be seen as specific instances of the n-dimensional space.
In addition, the circle and the sphere are particular form of the n-dimensional
hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).
1.1.1 What is hyperspectral data?

When spectral measurement is done by using hundreds of narrow contiguous
wavelength intervals then the captured image is called Hyperspectral image. Mostly,
the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In
this cube, x and y axes specify the size of image and λ axis specifies the dimension or
the number bands. Hyperspectral sensors corresponding to each band collect
information as a set of images. Each image represents a range of the electromagnetic
spectrum for each band.
Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006)
These images are then combined and form a three dimensional hyperspectral
cube. As the dimension of the HD is very high, it is comparable with the high
dimensional space. HD follows same characteristics like high dimensional space
which are described in the following section.
2

1.1.2 Characteristics of high dimensional space
High dimensional spaces, spaces with a dimensionality greater than three,
have properties that are substantially different from normal sense of distance,
volume, and shape. In particular, in a high-dimensional Euclidean space, volume
expands far more rapidly with increasing diameter in compared to lower-dimensional
spaces, so that, for example:
(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin
shell near its outer "surface"
(ii). The volume within a high-dimensional hypersphere relative to a hypercube of
the same width tends to zero as dimensionality tends to infinity, and almost all
of the volume of the hypercube is concentrated in its "corners".
The above mentioned characteristics have two important consequences for high
dimensional data that appear immediately. The first one is, high dimensional space is
mostly empty. As a consequence, high dimensional data can be projected to a lower
dimensional subspace without losing significant information in terms of separability
among the different statistical classes (Jimenez and Landgrebe, 1995). The second
consequence of the foregoing is, normally distributed data will have a tendency to
concentrate in the tails; similarly, uniformly distributed data will be more likely to be
collected in the corners, making density estimation more difficult. Local
neighborhoods are almost empty, requiring the bandwidth of estimation to be large
and producing the effect of losing detailed density estimation (Abhinav, 2009).
3

Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube
Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube

decreases as dimension increases (Modified after Jimenez,
Landgrebe, 1995)
1.1.3 Hyperspectral imaging

Hyperspectral imaging collects and processes information using the
electromagnetic spectrum. Hyperspectral imagery makes difference between many
types of earth’s objects, which may appear as the same color to the human eye.
Hyperspectral sensors look at objects using a vast portion of the electromagnetic
spectrum. The whole process of hyperspectral imaging can be divided into three steps:
preprocessing, radiance to reflectance transformation and data analysis (Varshney
and Arora, 2004).
In particular, preprocessing is required to convert the raw radiance to sensor
radiance. The processing steps contain the operations like spectral calibration,
geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and
geometric accuracy of hyperspectral data is significantly different from one band to
another band (Varshney and Arora, 2004).
4

1.2 What is classification?
Classification means to put data into groups according to their characteristics.
In the case of spectral classification, the areas of the image that have similar spectral
reflectance are put into same group or class (Abhinav, 2009). Classification is also
seen as a means of compressing image data by reducing the large range of digital
number (DN) in several spectral bands to a few classes in a single image.
Classification reduces this large spectral space into relatively few regions and
obviously results in loss of numerical information from the original image. Depending
on the availability of information of the region which is imaged, supervised or
unsupervised classification methods are performed.
1.2.1 Difficulties in hyperspectral data classification

Though it is possible that HD can provide a high accuracy thematic map than
MS data, there are some difficulties in classification in case of high dimensional data
as listed below:
1. Curse of dimensionality and Hughes phenomenon: It says that when

the dimensionality of data set increases with the number of bands, the
number of training pixels (TP) required for training a specific classifier
should be increased as well to achieve the desired accuracy for
classification. It becomes very difficult and expensive to obtain large
number of TP for each sub class. This has been termed as “curse of
dimensionality” by Bellman (1960), which leads to the concept of “Hughes
phenomenon” (Hughes, 1968).
2. Characteristics of high dimensional space: The characteristics of high
dimensional space have been discussed in above section (Sec. 1.1.2). For
those reasons, the algorithms that are used to classify the multispectral
data often fail for hyperspectral data.
3. Large number of highly correlated bands: Hyperspectral sensor uses
the large number of contiguous spectral bands. Therefore, among these
bands, some bands are highly correlated. These correlated bands do not
provide good result in classification. Therefore, the important task is to
5

select the uncorrelated bands or make the bands uncorrelated, applying
feature reduction algorithms (Varshney and Arora, 2004).
4. Optimum number of feature: It is very critical to select the optimum
number of bands out of large number of bands (e.g. 224 bands for AVIRIS
image) to use in classification. Till today there are no suitable algorithms or
any rule for selection of optimal number of features.
5. Large data size and high processing time due to complexity of
classifier: Hyperspectral imaging system provides large amount of data. So
large memory and powerful system is necessary to store and handle the
data, generally which is very expensive.
1.3 Background of work

This thesis work is the extension of work done by Abhinav Garg (2009) in his
M.Tech thesis. In his thesis, he showed that among the conventional classifiers
(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),
GML provides the best result. The performance of GML is improved significantly
after applying feature extraction (FE) techniques. Principal component analysis
(PCA) was found to be working best, among all FE techniques (discriminant analysis
FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and
independent component analysis(ICA)), in improving classification accuracy of GML.
For the advance classifier, SVM’s result does not depend on the choice of
parameters but ANN’s does. He also showed SVM’s result was improved by using
PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE
failed to improve it significantly.
He showed some drawbacks for advanced classifier like SVM and suggested
some FE techniques which may improve the result for conventional classifier (CC) as
well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes
more processing time than small size of TP. The objectives of this thesis work are to
sort out these problems and to find the best FE technique, which will improve the
classification result for HD. In next article, the objective of this thesis work has been
described.
.
6

1.4 Objectives
This thesis has investigated the following two objectives pertaining to
classification with hyperspectral data:
Objective-1:
To evaluate various FE techniques for classification of hyperspectral data.
Objective-2
To study the extent to which advance classifier can reduce problems related to
classification of hyperspectral data.
1.5 Study area and data set used

The study area for this research is located within an area known as 'La
Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.
1.4). The area is mainly used for cultivation of wheat, barley and other crops such as
vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on
29th June, 2000, at 5 m resolution.
Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an
exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to
2.5 μm were selected for further analysis (Pal, 2002). Striping problems were
observed between bands 41 and 72. All the 72 bands were visually examined and 7
bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were
removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels
covering the area of interest was extracted (Abhinav, 2009).
The data set available for this research work includes the 65 (retained after
pre-processing) bands data and the reference image, generated with the help of field
data collected by local farmers as briefed in Pal (2002). The area included in imagery
was found to be divided into eight different land cover types, namely wheat, water
body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built
up area.

7

Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)
Figure 1.4: FCC obtained by first 3 principal components and superimposed

reference image showing training data available for classes
identified for study area (Pal, 2002).
8

Figure 1.5: Google earth image of study area (Google earth, 2007)
1.6 Software details

For the processing of HD very power full system is required due to the size of
data set and complexity of algorithms. The machine used for this thesis work
contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.
Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results
are obtained here from same machine for the comparison of different algorithm.
1.7 Structure of thesis

The present thesis is organized into six chapters. Chapter1 focuses on the
characteristics of high dimensional space, challenges of HD classification and outline
of the experiments of this thesis work. Also it discusses the study region, data set and
the software used in this thesis work. Chapter 2 presents the detailed description of
the HD classification and the previous research work related to this domain. Chapter
3 describes the detailed mathematical background of the different processes used in
this work. Chapter 4 outlines the detailed methodology carried out for this thesis
work. Chapter 5 presents the experiments which are conducted for this thesis
followed by interpretation. Chapter 6 provides the conclusions for present work and
the scopes for future works.
9

CHAPTER 2
LITERATURE REVIEW
This chapter outlines the important research works and major achievements in
the field of high dimensional data analysis and data classification. The chapter begins
with some of the FE techniques and classification approaches, for solving problems
related to HD classification as suggested by various researchers. The results of useful
experiments with the HD will also be included to highlight the usefulness and
reliability of these approaches. These results are presented in tabulated form. Some
other issues related to classification of HD are also discussed at the end of this
chapter.
2.1 Dimensionality reduction by

Swain and Davis (1978) mentioned details of various separability measures for
multivariate normal class models. Various statistical classes are found to be
overlapping which causes error of misclassification as most of the classifiers use
decision boundary approach for classification. The idea was to obtain such a
separability measure which could give an overall estimate of range of classification
accuracies that can be achieved by using a sub-set of selected features so that the
sub-set of features corresponding to highest classification accuracy can be selected for
classification (Abhinav, 2009).
FE is the process of transforming the given data from a higher dimensional
space to a lower dimensional space while conserving the underlying information
(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the
underlying information spread in high dimensional space by containing it into
comparatively smaller number of dimensions without loss of significant amount of
useful information. FE techniques, in case of classification, try to enhance class
separability while reducing data dimensionality (Abhinav, 2009).
10

2.1.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data for feature reduction. Also it can be used as the tool of image
enhancement and digital change detection (Lodwick, 1979). For the case of dimension
reduction of HD, PCA outperforms those FE techniques which are based on class
statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited
and ratio to the number of dimension is low for HD, class covariance matrix cannot be
estimated properly. To overcome these problems Jia (1996) proposed the scheme for
segmented principal component analysis (SPCA) which applies PCT on each of the
highly correlated blocks of bands. This approach also reduces the processing time by
converting the complete set of bands into several highly correlated bands. Jensen and
James (1999) proposed that the SPCA-based compression generally outperforms
PCA-based compression in terms of high detection and classification accuracy on
decompressed HD. PCA works efficiently for the highly correlated data set but SPCA
works efficiently for both high correlated as well as low correlated data sets (Jia,
1996).
Jia (1996) compared SPCA and PCA extracted features for target detection and
concluded SPCA as a better FE technique than PCA. She also showed that both
feature extracted data sets are identical and there is no loss of variance in the middle
stages, as long as no components are removed.
2.1.2 Projection pursuit (PP)

Projection pursuit (PP) methods were originally posed and experimented by
Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman
and Tukey (1974). They described PP as a way of searching for and exploring
nonlinear structure in multi-dimensional data by examining many 2-D projections.
Their goal was to find interesting views of high dimensional data set. The next stages
in the development of the technique were presented by Jones (1983) who, amongst
other things, developed a projection index based on polynomial moments of the data.
Huber (1985) presented several aspects of PP, including the design of projection
indices. Friedman (1987) derived a transformed projection index. Hall (1989)
developed an index using methods similar to Friedman, and also developed
11

theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)
introduced a projection index called the chi-square projection pursuit index. Posse
(1995a, 1995b) used a random search method to locate a plane with an optimal value
of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. Each projection found in this
manner shows a structure that is less important (in terms of the projection index)
than the previous one. Most recently, the PP technique can also be used to obtain 1-D
projections (Martinez, 2005). In this research work, Posse’s method is followed that
reduces n-dimensional data set to 2-dimensional data.
2.1.3 Orthogonal subspace projection (OSP)

Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)
method which simultaneously reduces the data dimensionality, suppresses undesired
or interfering spectral signatures, and detects the presence of a spectral signature of
interest. The concept is to project each pixel vector onto a subspace which is
orthogonal to the undesired pixel. In order to make the OSP to be effective, number of
bands must not be taken less than the number of signatures. It is a big limitation
associated with multispectral image. To overcome this, Ren and Chang (2000)
presented the Generalized OSP (GOSP) method that relaxes this constraint in such a
manner that the OSP can be extended to multispectral image processing in an
unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,
2001) and also for magnetic resonance image classification (Wang et.al, 2001).
2.1.4 Kernel principal component analysis (KPCA)

Linear PCA always detect all structure in a given data set. By the use of
suitable nonlinear feature extractor, more information can be extracted from the data
set. The kernel principal component analysis (KPCA) can be used as a strong
nonlinear FE method ( Scholkopf
and Smola, 2002) which maps the input vectors to
feature space and then PCA is applied on the mapped vectors. KPCA is also a
powerful method for preprocessing steps for classification algorithm (Mika et. al.
1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature
selection in a high-dimensional feature space where input variables were mapped by
12

a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of
the higher-order statistics. To obtain this higher-order statistics, a large number of
TP is required. This causes problems for KPCA, since KPCA requires storing and
manipulating the kernel matrix whose size is the square of the number of TP. To
overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian
Algorithm (KHA) was introduced by ( Scholkopf
et. al., 2005).
2.2 Parametric classifiers

Parametric classifiers (Fukunaga, 1990) require some parameters to develop
the assumed density function model for the given data. These parameters are
computed with the help of a set of already classified or labeled data points called
training data. It is a subset of given data for which the class labels are known and is
chosen by sampling techniques (Abhinav, 2009). It is used to compute some class
statistics to obtain the assumed density function for each class. Such classes are
referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon
the training data and may differ from the actual classes.
2.2.1 Gaussian maximum likelihood (GML)

Maximum likelihood method is based on the assumption that the frequency
distribution of the class membership can be approximated by the multivariate normal
probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one
of the most popular parametric classifiers that has been used conventionally for
purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages
of GML classification method are that, it can obtain minimum classification error
under the assumption that the spectral data of each class is normally distributed and
it not only considers the class centre but also its shape, size and orientation by
calculating a statistical distance based on the mean values and covariance matrix of
the clusters (Lillesand et al., 2002).
Lee and Landgrebe (1993) compared the result of GML classifier on PCA and
DBFE feature extracted data set and concluded that DBFE feature extracted data set
provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE
techniques were compared for classification accuracy achieved by nearest neighbor
13

and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is
better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,
DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed
that PCA is the best FE technique for HD among the other mentioned feature
extractor for GML classifier. He also suggested that some FE techniques like KPCA,
OSP, SPCA, PP may improve the classification result using GML classifier.
2.3 Non–parametric classifiers

The non–parametric classifiers (Fukunaga, 1990) uses some control
parameters, carefully chosen by the user, to estimate the best fitting function by
using an iterative or learning algorithm. They may or may not require any training
data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor
(KNN) (Cover and Hart, 1967) are two popular working classifiers under this
category. Edward (1972) gave brief descriptions of many non-parametric approaches
for estimation of data density functions.
2.3.1 KNN
KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern
recognition. The technique can achieve high classification accuracy in problems which
have unknown and non-normal distributions. However, it has a major drawback that
a large amount of TP is required in the classifiers resulting in high computational
complexity for classification (Hwang and Wen, 1998).
Pechenizkiy (2005) compared the performance of KNN classifier on the PCA
and random projection (RP) feature extracted data set. He concluded that KNN
performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the
KNN works better on the ICA feature extracted data set than the original data set
(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-
KNN method with a few wavelengths had the same performance as the KNN
classifier alone using information from all wavelengths.
Some more non–parametric classifiers based on geometrical approaches of data
classification were found during literature survey. These approaches consider the
data points to be located in the Euclidean space and exploit the geometrical patterns
of the data points for classification. Such approaches are grouped into a new class of
14

classifiers known as machine learning techniques. Support Vector Machines (SVM)
(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among
the popular classifiers of this kind. These do not make any assumptions regarding
data density function or the discriminating functions and hence are purely non–
parametric classifiers. However, these classifiers also need to be trained using the
training data.
2.3.2 SVM
SVM has been considered as advance classifier. SVM is a new generation of
classification techniques based on Statistical Learning Theory having its origins in
Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,
1998) discussed SVM based classification in detail. SVM tends to improve learning by
empirical risk minimization (ERM) to minimize learning error and to minimize the
upper bound on the overall expected classification error by structural risk
minimization (SRM). SVM makes use of principle of optimal separation of classes to
find a separating hyperplane that separates classes of interest to maximum extent by
maximizing the margin between the classes (Vapnik, 1992). This technique is
different from that of estimation of effective decision boundaries used by Bayesian
classifiers as only data vectors near to the decision boundary (also known as support
vectors) are required to find the optimal hyperplane. A linear hyperplane may not be
enough to classify the given data set without error. In such cases, data is transformed
to a higher dimensional space using a non–linear transformation that spreads the
data apart such that a linear separating hyperplane may be found. Kernel functions
are used to reduce the computational complexity that arises due to increased
dimensionality (Varshney and Arora, 2004).
Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization
capability and ability to adapt their learning characteristics by using kernel functions
due to which they can adequately classify data on a high–dimensional feature space
with a limited number of training data sets and are not affected by the Hughes
phenomenon and other affects of dimensionality. The ability to classify using even
limited number of training samples make SVM as a very powerful classification tool
for remotely sensed data. Thus, SVM has the potential to produce accurate
classifications from HD with limited number of training samples. SVMs are believed
15

to be better learning machines than neural networks, which tends to overfit classes
causing misclassification (Abhinav, 2009), as they rely on margin maximization
rather than finding a decision boundary directly from the training samples.
For conventional SVM an optimizer is used based on quadratic programming
(QP) or linear programming (LP) methods to solve the optimization problem. The
major disadvantage of QP algorithm is the storage requirement of kernel matrix in
the memory. When the size of the kernel matrix is large enough, it requires huge
memory that may not be always available. To overcome this Benett and Campbell
(2000) suggested an optimization method which sequentially updates the Lagrange
multipliers called the kernel adatron (KA) algorithm. Another approach was
decomposition method which updates the Lagrange multipliers in parallel since they
update many parameters in each iteration unlike other methods that update
parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which
updates lagrange multipliers on the fixed size working data set. Decomposition
method uses QP or LP optimizer to solve the problem of huge data set by considering
many small data sets rather than a single huge data set (Varshney, 2001). The
sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of
decomposition method when the size of working data set is fixed such that an
analytical solution can be derived in very few numerical operations. This does not use
the QP or LP optimization methods. This method needs more number of iterations
but requires a small number of operations thus results in an increase in optimization
speed for very large data set.
The speed of SVM classification decreases as the number of support vectors
(SV) decreases. By using kernel mapping, different SVM algorithms have successfully
incorporated effective and flexible nonlinear models. There are some major difficulties
for large data set due to calculation of nonlinear kernel matrix. To overcome the
computational difficulties, some authors have proposed low rank approximation to
the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)
have proposed the method of reduced support vector machine (RSVM) which reduces
the size of the kernel matrix. But there was a problem of selecting the number of
support vectors (SV). In 2009, Sundaram proposed a method which will reduce the
number of SV through the application of KPCA. This method is different from other
16

proposed method as the exact choice of support vector is not important as long as the
vector spanned a fixed subspace.
Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he
used linear SVM on the feature extracted data set and showed that KPCA features
are more linearly separable than the features extracted by conventional PCA. Shah et
al (2003) compared SVM, GML and ANN classifiers for accuracies at full
dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and
concluded that SVM gives higher accuracies than GML and ANN for full
dimensionality but poor accuracies for features extracted by DAFE and DBFE.
Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,
DBFE, DAFE feature extracted data set. He concluded that SVM provides better
result for OD than GML. SVM works best with PCA and ICA feature extracted data
set where ANN works better with DBFE and NWFE feature extracted data set.
The works done by various researchers with different hyperspectral data sets
using different classifiers and FE methods and the results obtained by them is
summarized in Table 2.1.

17

Table 2.1: Summary of literature review
Author Dataset used Method used Results obtained
Lee and Landgrebe Field Spectrometer GML classifier is used to Features extracted by DBFE
(1993) System (airborne compare classification produces better classification
hyperspectral accuracies obtained by accuracies than those
sensor) DBFE and PCA FE obtained from PCA and
Bhattacharya feature
selection methods.
Jimenez and Stimulated and real Hyperspectral data Hughes phenomenon was
Landgrebe (1998) AVIRIS data characteristics were observed as an effect of
studied with respect to dimensionality and
effects of classification accuracy was
dimensionality, order of observed to be increasing
data statistics used on with use of higher statistics
supervised classification order. But lower order
techniques. statistics were observed to
be less affected by Hughes
phenomenon.
Benediktsson et al ROSIS-03 KPCA and PCA feature KPCA features are more
(2001) extracted data set was linearly separable than
used for classification features extracted by
using linear SVM.
conventional PCA.
Shah et al. (2003) AVIRIS Compared SVM, GML SVM was found to be giving
and ANN classifiers for higher accuracies than GML
accuracies at full and ANN for full
dimensionality and dimensionality but poor
using DAFE and DBFE accuracies were obtained for
feature extraction features extracted by DAFE
techniques and DBFE.
Kuo and Landgrebe Stimulated and real NWFE and DAFE FE NWFE was found to be
(2004) data (HYDICE techniques were producing better
image of DC mall, compared for classification accuracies
Washington, US) classification accuracy than DAFE.
achieved by nearest
neighbor and GML
classifiers.
Pechenizkiy (2005) 20 data sets with KNN classifier was used PCA gave the better result than
different to compare classification Random Projection
characteristics were accuracies obtained by
taken from the UCI PCA and Random
machine learning Projection FE
repository.
Zhu et al (2007) Hyperspectral ICA ranking methods ICA-KNN method with a few
imaging system were used to select the band had the same
developed by ISL. optimal wave length the performance as the KNN
KNN was used. Then classifier alone using all
KNN alone was used. bands.
Sundaram (2009) The adult dataset KPCA was applied in Significantly reduce the
,part of UCI the support vector, then processing time without
Machine Learning usual SVM algorithm is effecting the classification
Repository used accuracy
18

Abhinav (2009) DAIS 7915 GML, SAM, MDM GML was the best among
classification techniques the other techniques and
were used on the PCA, performs best on PCA
ICA, NWFE, DBFE and extracted data set.
DAFE feature extracted
data set
Abhinav (2009) DAIS 7915 SVM and GML GML performed very low in
classification techniques OD than SVM. SVM provide
were used on the OD better accuracy than GML.
and PCA, ICA, NWFE, SVM performs better on
DBFE and DAFE PCA and ICA extracted data
feature extracted data set.
set to compare the
accuracy
2.4 Conclusions from literature review

1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,
ICA, DAFE, DBFE and NWFE perform well in improving the classification
accuracies when used with GML. But the features extracted by DBFE and
DAFE failed to improve results obtained by SVM implying a limitation of these
techniques for the advance classifiers. KNN works best with PCA and ICA
feature extracted data set. However, in the surveyed literature the effects of
PP, SPCA, KPCA and OSP extracted features on classification accuracy
obtained from the advance classifiers like SVM, parametric classifier like GML
and nonparametric classifier KNN have not been observed.
2. Another important aspect found missing in the literature is the comparison of
classification time for SVM classifiers because SVM takes long time for
training using large TP. It was seen that many approach of SVM were
proposed to reduce the classification time but there is no conclusion for the best
SVM algorithm depending on classification accuracy and processing time.
3. Although KNN is effective classification technique for HD, there is no guideline
for classification time or suggestion of best FE techniques for KNN classifier.
Also the effect of different parameters like number of nearest neighbor,
number of TP, number of bands is not suggested for KNN.
19

4. During the literature survey, it is further found that there is no suggestion for
the best FE techniques for different SVM algorithms, GML and KNN.
Such missing aspects will be investigated in this thesis work and the
guidelines to choose an efficient and less time consuming classification technique
shall be presented as the result of this research.
This chapter presented the FE and classification techniques for mitigating the
effects of dimensionality. These techniques were result of different approaches used
to deal with the problem of high dimensionality and improving performance of
advance, parametric and nonparametric classifier. The approaches were applied on
real life HD and comparative results as reported in literature were compiled and
presented here. In addition, the important aspects found missing in the literature
survey were highlighted which this thesis work shall try to investigate. The
mathematical rationale and algorithms used to apply these techniques will be
discussed in detail in the next chapter.

20

CHAPTER 3
MATHEMATICAL BACKGROUND
This chapter will provide the detailed mathematical background of each of the
techniques used in this thesis. Starting with the some basic concepts of kernels and
kernel space this chapter will describe the unsupervised and supervised FE
techniques followed by classification and optimization rules for supervised classifier.
Finally, the scheme for statistical analysis which has been used for comparing the
results of different classification techniques are discussed.
Notations which are followed in this chapter for matrix and vector are given
below:
X A two dimensional matrix, whose columns represent the data points (m) and
rows represent number of bands (n), where X = X ⎣⎡n, m⎦⎤ .
xi n -dimensional single pixel column vector where X = ⎡⎣x1 , x2 ......., xm ⎤⎦ and

T
xi = ⎡⎣x1i , x2i ,....., xni ⎤⎦
cj Represents jth class.
Φ( z ) Mapping of the input vector z in kernel space, using some kernel function.
a,b Defines inner product of the vectors a and b.
∈ Belongs to
Rn Set of n-dimensional real number.
N Set of natural number.
T
⎡⎣ ⎤⎦ Denotes the transpose of a matrix.
∀ For all.
3.1 What is kernel?

Before defining kernel, let’s look at the following two definitions:
• Input space: The space where originally data points lie.
21

• Feature space: The space spanned by the transformed data points (from
original space) which were mapped by some functions.
Kernel is the dot product in feature space H via a map Φ from input space,
such that Φ : X → H . Kernel can be defined as k( x , x ') = Φ( x ), Φ( x ') , where
x , x ' and Φ( x ), Φ( x ') are the elements of input space and feature space respectively
and k is called the kernel and Φ is called feature map associated with k. Φ also can
be called as the kernel function. The space containing these dot products is called
kernel space. This is a nonlinear mapping from input space to feature space which
increases the internal distance between two points in a data set. This means that the
data set which is nonlinearly separable in input space becomes linearly separable in
kernel space. A few definitions related to kernel are given below:
Gram matrix: Given a kernel k and inputs x1 , x2 ........., xn ∈ X , the n x n matrix,

K := ( k( x i , x j ))ij is called the gram matrix of k with respect to x1 , x2 ........., xn ∈ X .
Positive definite matrix: A real n x n symmetric matrix K satisfying x1T Kx1 > 0 for
all x1 = ( x11 , x 21 ,......., x n1 ) ∈ R n is called positive definite. x1 is a column vector. If the

T
equality in previous equation occurs only for x11 = x 21 = ........ = xn1 = 0 , then the matrix
is called strictly positive definite.
Positive definite kernel: Let X be a nonempty set. A function k : X× X → R , ∀
n ∈ N , xi ∈ X , i ∈ N if it gives rise to a positive definite gram matrix, is called a
positive definite kernel. A function k : X× X → R ∀ n ∈ N and distinct xi ∈ X if it
gives rise to a strictly positive definite gram matrix, called strictly positive definite
kernel.
Definitions of some commonly used kernel functions are shown in Table 3.1.
22

Table 3.1: Examples of common kernel functions (Modified after Varshney and
Arora, 2004)
Kernel function type Definition Parameters Performance depends on

K ( x , xi )
Decision boundary either
Linear x × xi
linear or non linear
Polynomial with User defined parameters
( x × x i + 1)n n is a positive integer
degree n
⎛ (x - x ) 2 ⎞ User defined parameters
Radial basis function exp ⎜ − i
⎟ σ is a user defined
⎜ 2σ 2
⎟ value
⎝ ⎠
tanh( k( x.x ) + Θ) K and Θ are user
User defined parameters
Sigmoid i
defined parameter
All the above definitions have been explained with the following simple
example.
⎡1 2 1 ⎤
Let, X = ⎣⎡x1 x 2 x3 ⎤⎦ = ⎢⎢2 1 3 ⎥⎥ is a matrix in input space whose columns ( xi , i = 1,2,3 )
⎢⎣1 1 3 ⎥⎦
denote the number of data points and rows denote the dimension of data points.
Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.
Let xi , x j denotes the inner product of the columns of the matrix X using Gaussian
kernel function.
Then the gram matrix (kernel matrix) K takes precisely the form,
⎡ x1 , x1 x1 , x 2 x1 , x3 ⎤
⎢ ⎥
K = ⎢ x 2 , x1 x2 , x2 x 2 , x3 ⎥
⎢ ⎥
⎢⎣ x3 , x1 x3 , x2 x 3 , x3 ⎥⎦
⎡ 1.0000 0.0498 0.0821 ⎤

The numerical value of the matrix K is, K = ⎢⎢ 0.0498 1.0000 0.6065 ⎥⎥
⎢⎣ 0.0821 0.6065 1.0000 ⎥⎦
K is symmetric matrix. If the matrix K turns out to be positive definite, then it is

called positive definite kernel and if it is strictly positive definite, then it is called
strictly positive definite kernel.
23

3.2 Feature extraction techniques
FE techniques are based on a simple assumption that given data sample

( x ∈ X : Rn ) belonging to an unknown probability distribution in n-dimensional space
can be represented by some coordinate system in m dimensional space (Carreira-
Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate
system such that when the data points from higher dimensional space are projected
onto it, a dimensionally compact representation of these data points will be obtained.
There are two following main conditions to obtain an optimal dimension reduction
(Carreira-Perpinan, 1997):
(i) Elimination of dimensions with very low information content. Features with
low information content can be discarded as noise.
(ii) Remove redundancy among the dimensions of data space i.e. the reduced
feature set should be spanned by orthogonal vectors.
The unsupervised and supervised FE techniques have been investigated in this

research work (Figure 3.1). For the unsupervised approach, segmented principal
component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,
kernel principal component analysis (KPCA) and orthogonal subspace projection
(OSP) are used. The next sub-sections will discuss the assumptions used by these FE
techniques in detail.
Figure 3.1: Overview of FE methods
24

3.2.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral
image data, PCT outperforms those FE techniques which are based on the class
statistics. The main advantage of using a PCT is that global statistics are used to
determine the transform functions. Implementation of PCT on high dimensional data
set requires high computational load. SPCA can overcome the problem of long
processing time by partitioning the complete data set into several highly correlated
subgroups (Jia, 1996).
The complete data set is first partitioned into K subgroups with respect to the
correlation of bands. From the correlation image of HD, it can be seen that blocks are
formed from highly correlated bands (Figure 3.2). These blocks are selected as the
subgroups. Let n1 , n2 and nk are the number of bands in subgroups 1, 2 and k
respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After
applying PCT on each subgroup, significant features are selected by variance
information of each component. The PCs which contain about 99% variance were
chosen for each block then the selected features can be regrouped and transformed
again to compress the data further.
25

Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27
bands respectively, corresponding to highly correlated bands have been
formed from the correlation image of HYDICE hyperspectral sensor data.
Segmented PCT retains all the variance as with the conventional PCT. There
is no information lost either in the case that the transformation is conducted on the
complete vector at once or a few sub vectors are transformed separately (Jia, 1996).
When the new components obtained from each segmented PCT are gathered and
transformed again, then the resulting data variance and covariance are identical to
those of the conventional PCT. The main effect is that, the data compression rate is
lower in the middle stages compared to the no segmentation case. However, it makes
a relatively small difference in compression rate, if segmented transformation is
developed on those subgroups which have poor correlation with each other.
26

Figure 3.2a: Chart of multilayered segmented PCA
3.2.2 Projection pursuit (PP)

Projection pursuit (PP) refers to a technique first described by Friedman and
Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by
means of selected low dimensional linear projections. To reach this goal, an objective
function is assigned, called projection index, to every projection characterizing the
structure present in the projection. Interesting projections are then automatically
picked up by optimizing the projection index numerically. The notion of interesting
projections has usually been defined as the ones exhibiting departure from normality
(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).
Posse (1990) proposed an algorithm based on a random search and a chi-
squared projection index for finding the most interesting plane (two-dimensional
view). The optimization method was able to locate in general the global maximum of
the projection index over all two-dimensional projections (Posse, 1995). The chi-
squared index was efficient, being fast to compute and sensitive to departure from
normality in the core rather than in the tail of the distribution. In this investigation
only chi-squared (Posse, 1995a, 1995b) projection index has been used.
27

Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:
(i) A projection pursuit index measures the degree of departure from normality.
(ii) A method for finding the projection that yields the highest value for the index.
Posse (1995a, 1995b) used a random search to locate a plane with an optimal
value of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. The interesting projections are
found in decreasing order of the value of the PP index. This implies that each
projection found in this manner shows a structure that is less important (in terms of
the projection index) than the previous one. In the following discussion, first the chi-
squared PP index has been described followed by the structure finding procedure.
Finally, the structure removal procedure is illustrated.
3.2.2.1 Posse chi-square index

Posse proposed an index based on the chi-square index. The plane is first
divided into 48 regions or boxes Bk , k = 1,2,..,48 that are distributed in the form of
rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the
same angular width of 450 . R is chosen so that the boxes have approximately the
1
same weight under normally distributed data and which is equal to

( 2 log 6 ) 2 . The
5
outer boxes were having weight 1/48 under normally distributed data. This choice for
the radial width provides regions with approximately same probability for the
standard bivariate normal distribution (Martinez, 2001). The projection index is
given as:
2
1 8 48 1 ⎡ 1 n
(
PI χ 2 (α , β ) = ∑∑ ⎢ ∑ I Bk zi j , zi j − ck ⎥
9 j =0 k =1 ck ⎣ n i =1
α (λ )
)
β (λ ) ⎤
⎦
(3.1)
Where,
φ The standard bivariate normal density.
ck Probability evaluated over kth region using the normal density function,
given by ck = ∫∫ φ dz1dz2 .
Bk
28

Bk Box in the projection plane.
πj
λj , j = 0,.....,8 is the angle by which the data are rotated in the plane
36
before being assigned to regions.
α,β Orthonormal p-dimensional vectors which span the projection plane (It
can be first two PCs or randomly chosen two pixels of the OD set).
P (α , β ) A plane consists of two orthonormal vectors α , β
α
Zi , Z j β
Sphered observations projected onto the vectors α and β . ( Ziα = ZiT α and
Ziβ = ZiT β )
α (λj ) α cos λ j − β sin λ j

β (λj ) α sin λ j + β cos λ j
I Bk The indicator functions for region.
PI χ 2 (α , β ) The chi-squareprojection index evaluated using the data projected onto
the plane spanned by α and β .
The chi-square projection index is not affected by the presence of outliers.

However, it is sensitive to distributions that have a hole in the core, and it will also
yield projections that contain clusters. The chi-square projection pursuit index is fast
and easy to compute, making it appropriate for large sample sizes. Posse (1995a)
provides a formula to approximate the percentiles of the chi-square index.
29

45o
1/48 1/48
1/48
1/48
R/5
1/48 1/48
1/48 1/48
R

Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after
Posse, 1995a)
3.2.2.2 Finding the structure (PPEDA algorithm)

For PPEDA projection pursuit index, PI χ 2 (α , β ) must be optimized over all
possible direction onto 2-D planes. Posse (1990) proposed a random search for
locating the global maximum of the projection index. Combined with the structure-
removal procedure, this gives a sequence of interesting bi-dimensional views of
decreasing importance. Starting with random planes, the algorithm tries to improve
(
the current best solution α * , β * ) by considering two candidate planes ( a1 ,b1 ) and
( a2 ,b2 ) ( )
within a neighborhood of α * , β * . These candidate planes are given by,
α * + cv1 β * − ( a1T β * ) a1 ⎫
a1 = * b1 = ⎪
α + cv1 β * − ( a1T β * ) a1 ⎪
⎪
⎬ (3.2)
α − cv1
* β * − ( a1T β * ) a2 ⎪
a2 = b2 = ⎪
α * − cv1 β * − ( a1T β * ) a2 ⎪⎭
Where c is a scalar that determines the size of the neighborhood visited, and v is a
unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to
30

start a global search and then to concentrate on the region of the global maximum by
decreasing the value of c. After a specified number of steps, called half, without an
increase of the projection index, the value of c is halved. When this value is small
enough, the optimization is stopped. Part of the search still remains global to avoid
being kept in dummy local optimum. The complete search of the best plane contains
m such random searches with different random starting planes. The goal of PP
algorithm is to find best projection plane.
The steps for PPEDA are given below:

1. Sphere the OD set, let’s say, Z is the matrix of sphered data set.
( )
2. Generate a random starting plane α 0 , β 0 , where α 0 and β 0 are orthonormal.
(
Consider this as the current best plane α * , β * . )
( )
3. Evaluate the projection index PI χ 2 α * , β * for the starting plane.
4. Generate two candidate plane ( a1 ,b1 ) and ( a2 ,b2 ) according to the Eq. (3.2)
5. Now calculate the projection index for these candidate planes.

6. Choose the candidate plane with a higher value of the projection pursuit index
( )
as the current best plane α * , β * .
7. Repeat steps 4 through 6 while there are improvements in the projection

pursuit index.
8. If the index does not improve for certain time, then decrease the value of c by
half
9. Repeat step 4 to step 8 until c becomes some small number (say .01).
3.2.2.3 Structure removal

There may be more than one interesting projection, and there may be other
views that reveal insights about the hyperspectral data. To locate other views,
Friedman (1987) proposed a method called structure removal. In this approach, first
we perform the PP algorithm on the data set to obtain the structure which means the
optimal projection plane. The approach then removes the structure found at that
projection, and repeats the projection pursuit process to find a projection that yields
another maximum value of the projection pursuit index. By proceeding in this
31

manner, it will give a sequence of projections providing informative views of the data.
The procedure repeatedly transforms the projected data to standard normal until
they stop becoming more normal as measured by the projection pursuit index. One
starts with a p × p matrix, where the first two rows of the matrix are the vectors of
the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal
and ‘0’ elsewhere. For example, if p = 4, then
⎡α1* α 2* α 3* α 4* ⎤
⎢ * * * * ⎥
β β β β
U =⎢ 1 2 3 4 ⎥
*
(3.3)
⎢0 0 1 0 ⎥
⎢ ⎥
⎢⎣0 0 0 1 ⎥⎦
Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of U *

orthonormal. Let U is the orthonormal matrix of U * . The next step in the structure
removal process is to transform the Z matrix using the following equation,
T = UZ T (3.4)
Where T is a p × n matrix. With this transformation, the first two rows of T of every
transformed observations are the projection onto the plane given by α * , β * . Now ( )
applying a transformation ( Θ ), which transforms the first two rows of T to a
standard normal and the rest remain unchanged, structure removal is performed
(Martinez, 2004). This is where the structure is removed, making the data normal in
that projection (the first two rows). The transformation is defined as follows,
Θ (T1 ) = φ −1 ⎡⎣ F (T1 ) ⎤⎦ ⎫
⎪⎪
Θ (T2 ) = φ −1 ⎡⎣ F (T2 ) ⎤⎦ ⎬ (3.5)
⎪
Θ (Ti ) = Ti i = 3,4,........., p ⎪⎭
Where φ −1 the inverse of the standard normal cumulative distribution function, T1
and T2 are the first two rows of the matrix T and F is a function defined in Eq. (3.7).
From Eq. (3.3), it is seen that only the first two row of T are changing. T1 and T2 can
be written as,
( * * *
T1 = z1α , z2α ......., z αj ,......., znα
*
) (3.6)
= (z )
* * * *
β
T2 1 , z2β ......., z βj ,......., znβ
32

* *
Where z αj and z βj are coordinates of the jth observation projected onto the plane
( )
spanned by α * , β * . Next, a rotation is defined about the origin through the angle as
follows
z j ( ) = z j ( ) cos γ + z j ( ) sin γ
1t 1t 2t
(3.7)
z j ( ) = z j ( ) cos γ − z j ( ) sin γ
2t 2t 1t
Where γ = 0,π / 4,π / 8,3π / 8 and z j ( ) represents the jth element of T1 at the tth
1t
iteration of the process. Now, applying the following transformation on Eq. (3.7) to the
rotated points it replaces each rotated observation by its normal score in the
projection.
z
1(t +1)
=φ ⎨
⎪
−1 j ( )
⎧ r z1(t ) − 0.5 ⎫
⎪
j ⎬
⎪⎩ n
⎭⎪
(3.8)
zj(
2 t +1) −1 ⎪
=φ ⎨
j ( )
⎧ r z 2(t ) − .5 ⎫
⎪
⎬
⎪⎩ n
⎭⎪
( )
Where r z j ( ) represents the rank of z j ( )
1t 1t
With this procedure, the projection index is reduced by making the data more
normal. During the first few iteration, the projection index should decrease rapidly
(Friedman, 1987). After approximate normality is obtained, the index might oscillate
with small changes. Usually, the process takes between 5 to 15 complete iterations to
remove the structure. Once the structure is removed using this process, data is
transformed back using the following equation,
Z ′ = U T Θ UZ T ( ) (3.9)
From Matrix Theory (Strang, 1988), it is known that all directions that are
orthogonal to the structure (i.e., all rows of T other than the first two) have not been
changed, whereas the structure has been Gaussianized and then transformed back.
Next section will describe the summary of the steps of PP,
33

3.2.2.4 Steps of PP
1. Load the data and set the value of the parameters like number of best
projection plane (N), number of neighborhood for random starts (m), value of c
and half
2. Sphere the data and obtain the Z matrix.
3. Find each of the desired number of projection plane (structures) (3.3.4.2) using
Posse chi-squareindex.
4. Remove the structure (to reduce the effect of local optimum) and find another
structure (3.3.4.3) until the projection pursuit index stop changing.
5. Continue the process until the best projection plane (orthogonal to each other)
is obtained.
3.2.3 Kernel principal component analysis (KPCA)

Kernel principal component analysis (KPCA) means conducting PCT in feature
space (kernel space). KPCA is applied on the variables which are nonlinearly related
to the input variables. In this section KPCA algorithm has been described through
PCA algorithm.
First m number of TP ( x i ∈ R n , i = 1,........, m ) are chosen. PCA finds the principal
axes by diagonalizing the following covariance matrix,

1 m
C= ∑
m j =1
x j x jT (3.10)
The covariance matrix C is positive definite; hence, non-negative eigen values

can be obtained.
λv = Cv (3.11)
For PCA, first sort the eigen values in decreasing order and find the corresponding
eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this
manner. Now next step is rewriting of PCA in terms of dot product. Now substituting
Eq. (3.10) in Eq. (3.11)
1 m
Cv = ∑ x j x jT v = λv
m j =1
Thus
34

1 m
v= ∑ x j x jT v
mλ j =1
(3.12)
1 m
= ∑ ( x j .v )x j
mλ j =1
( )
since x .x v = ( x .v ) x
T
In Eq. (3.12), the term ( x j .v ) is a scalar. This means that all the solutions v with λ ≠
0 lie in the span of x1 ,......, xm , i.e.

m
v = ∑ α i xi (3.13)
i =1
Steps for KPCA

1. For KPCA, first transform the TPs using a kernel function ( Φ ) to feature space
( H ). Data set ( Φ( xi ), i = 1,....., m ) in feature space are assumed as centered to
reduce the complexity of calculation. The covariance matrix in H of the data

set takes the form as following
1 m
C= ∑
m j =1
Φ( x j )Φ( x j )T (3.14)
2. Find the eigen values λ ≥ 0 and corresponding non zero eigen vectors
v ∈ H \ {0} of the covariance matrix C from the equation,
λv = Cv (3.15)
3. As shown in previously (for PCA), all solution of v ( λ ≠ 0 ) lie in the span of
Φ( x1 ),........, Φ( xm ) , i.e.,
m
v = ∑ α i Φ( x i ) (3.16)
i =1
Therefore,
m
Cv = λv = λ ∑ α i Φ( x i ) (3.17)
i =1
Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17)
m m m
mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi )Φ( xi )T Φ( x j ) (3.18)
j =1 i =1 j =1
4. Define kernel inner product by K ( x i , x j ) = Φ( xi )T Φ( x j ) . Substituting this in Eq.

(3.18) following equation is obtained.
35

m m m
mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi ) K ( xi , x j ) (3.19)
j =1 i =1 j =1
5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product

kernel, premultiply both sides by Φ( x k )T for all k = 1,……,m. Define the m ×m
matrix K, called the kernel matrix, whose ijth element is the inner-product
kernel , K ( x i , x j ) . The vector α of length m, whose jth element is the coefficient
αj.
6. Finally, Eq. (3.19) can be written as,
m
1 m m
λ ∑ α j Φ( x k )T Φ( x j ) = ∑∑ α j Φ( xk )T Φ( xi )Φ( xi )T Φ( x j )
i =1 m i =1 j =1 (3.20)
∀ k = 1,2,...., m
Now Eq. (3.20) can be transformed as (using K ( xi , x j ) = Φ( xi )T Φ( x j ) ),
mλ Kα = K 2α (3.21)
To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be
solved,
mλα = K α
(3.22)
7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel
matrix K. Let λ1 ≥ λ2 ≥ ........ ≥ λm be the eigen values of K and β1 , β2 ,......., βm be
the corresponding set of eigen vectors with λ p being the last non zero eigen
value.

36

(
(a) (b)

Figu
ure 3.4: (a
a) Input pooints before kernel PCA
P (b) Ouutput afterr kernel PCA.
Thhe three groups
g aree distinguishable usiing the firrst compon
nent
on
nly (Wikipeedia, 2010)).
8. To extract principal com

mponent, iit is neede pute projecction onto the
ed to comp
en vectors βn in H ( n = 1,...., p ). Let x be a test pointt, with an image
eige i Φ(x ) in
H. Then
T
m
β n , Φ( x ) = ∑ β n Φ( xi ),
) Φ( x ) (3.2
23)
i =1
9. In the
t above algorithm,
a it has been assumed
d that the d
data set is centered, but
it iss certainly difficult to
o obtain th
he mean off the mappeed data in feature sp
pace
H (Schölkopf, 2004) . Th
herefore, it is problem
matic to cen
nter the mapped
m data
a in
featture space. Howeverr, there is a way too do it by slightly modifying

m the
equation for kernel
k PCA
A. It is need
ded to diago atrix K,
onalize thee kernel ma
1
K i , j = ( K − 1m K − K1m + 1m K1m )i , j Where
W (1m )ij := ∀i, j 24)
(3.2
m
Figure-3.5
5 provides the
t outlinee of KPCA algorithm.
a

37

Figure 3.5: Outline of KPCA algorithm
3.2.4 Orthogonal subspace projection (OSP)

subspace projection is to eliminate all unwanted or undesired spectral
signatures (background) within a pixel, then use a matched filter to extract the
desired spectral signature (endmember) present in that pixel.

38

3.2.4.1 Automated target generation process algorithm (ATGP)
In hyperspectral image analysis a pixel may encompass many different
materials; such pixels are called mixed pixels. It contains multiple spectral
signatures. Let a column vector ri represent the mixed pixel by linear model,
ri = M αi + ni (3.25)
where the vector ri is a l × 1 column vector, represents the ith mixed pixel. l is the
number of spectral bands. Each distinct material in the mixed pixel is called an
endmember (p). Assume that there are p spectrally distinct endmembers in the ith
mixed pixel. M is a matrix of dimension l × p , is made up of linearly independent
columns. These columns are denoted by ( m1 , m2 ,......, m j ,......., mp ) . Here this system is
considered as over determined ( l > p ) system and m j denotes the spectral signature of
the jth distinct material or endmember. Let α be a p column vector given by
(α ,α ,......,α ,......,α )
T
1 2 j p where the jth element represents the fraction of the jth
signature as present in the ith mixed pixel. ni is a l × 1 column vector presenting the
white Gaussian noise with zero mean and covariance matrix σ 2 I where I is an l × l
identity matrix.
In the Eq. (3.25), assume ri ’s are a linear combination of p endmembers with
the weight coefficients designated by the fraction vector αi . The term M αi has been
rewritten to separate the desired spectral signatures from the undesired signatures.
In other way, targets are being separated from background. In searching for a single
spectral signature this can be written as:
M α = dα p + U γ (3.26)
Where d is l × l matrix, the desired signature of interest containing column vector mp

while α p is 1 × 1 , the fraction of the desired signature. The matrix U is composed of
the remaining column vectors from M. These are the undesired spectral signatures or
background information. This is given by U = ( m1 , m2 ,....., m j , ........, mp−1 ) with
dimension l × ( p − 1) where γ is a column vector containing rest of ( p − 1) components
(fractions) of α
39

Suppose P is an operator, which eliminates the effects of U, the undesired
signatures. To do this, an operator (orthogonal subspace operator) has been developed
that projects r onto a subspace that is orthogonal to the columns of U. This results in
a vector that only contains energy associated with the target d and noise n. The
operator used is the l × l matrix
(
P = 1 − U (U TU )−1U T ) (3.27)
The operator P maps d into a space orthogonal to the space spanned by the
uninteresting signatures in U. Now apply the operator P on the mixed pixel r from
Eq. (3.25)
Pr = Pdα p + PU γ + Pn (3.28)
It should be noticed that P operating on Uγ reduces the contribution of U to zero
(close to zero in real data applications). Therefore, from above rearrangement we

have
Pr = Pdα p + Pn (3.29)
3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization

The second step in deriving the pixel classification operator is to find the 1 × l
operator X T that maximizes the SNR. Operating on Eq. (3.28) get
X T Pr = X T Pdα p + X T PU γ + X T Pn (3.30)
The operator X T acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is
given by,
X T Pdα p2dT P T X
λ= (3.31)
X T PE ⎡⎣nnT ⎤⎦ P T X
⎛ α p2 ⎞ X T PddT P T X
λ =⎜ 2⎟ (3.32)
⎜ σ ⎟ X T PP T X
⎝ ⎠
where E [ ] denotes the expected value. Maximization of this quotient is the
generalized eigenvector problem
PddT P T X = λ PP T X (3.33)
40

⎛σ2 ⎞
where λ = λ ⎜ 2 ⎟ , The value of X T which maximizes λ can be determined in general
⎜α ⎟
⎝ p⎠
using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and
symmetric properties of the interference rejection operator. As it turns out the value
of X T which maximizes the SNR is
X T = kdT (3.34)
where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is
seen that the overall classification operator for a desired hyperspectral signature in
the presence of multiple undesired signatures and white noise is given by the 1 × l
vector as
q T = dT p (3.35)
This result first nulls the interfering signatures, and then uses a matched filter for
the desired signature to maximize the SNR. When the operator is applied to all of the
pixels in a hyperspectral scene, each l × 1 pixel is reduced to a scalar which is a
measure of the presence of the signature of interest. The ultimate aim is to reduce the
l images that make-up the hyperspectral image cube into a single image where pixels
with high intensity indicate the presence of the desired signature.
This operator can be easily extended to seek out k signatures of interest. The
vector operator simply becomes a k × l matrix operator which is given by,
Q = ( q1 , q2 ,...., q j ,...., qk ) (3.36)
When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral
scene, each l × 1 pixel is reduced to 1 × 1 vector. Ultimately, l dimensional
hyperspectral image reduces to single dimensional feature extracted image where
pixels with high intensity indicate the presence of the desired signature. Thus for k
desired signature hyperspectral image can be reduce to k dimensional feature
extracted image. Here each band corresponds to the each desired signature.
The above algorithm is discussed with the following example:
Let us start with three vectors or classes, each six elements or bands long. The
vectors are in reflectance units and can be seen below.
41

⎡0.26 ⎤ ⎡0.07 ⎤ ⎡0.07 ⎤
⎢0.30 ⎥ ⎢0.07 ⎥ ⎢0.13 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0.31 ⎥ ⎢0.11 ⎥ ⎢0.19 ⎥
Concrete = ⎢ ⎥ Tree = ⎢ ⎥ Water = ⎢ ⎥
⎢0.31 ⎥ ⎢0.54 ⎥ ⎢0.25 ⎥
⎢0.31 ⎥ ⎢0.55 ⎥ ⎢0.30 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣⎢0.31 ⎦⎥ ⎣⎢0.54 ⎦⎥ ⎣⎢0.34 ⎦⎥
Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels
looks like,
pixel40 = (.08 ) concrete + (.75 ) tree + (.07 ) dirt + noise (3.37).
Let us assume that the noise is zero. If all the pixel mixture fractions have been
defined, particular class spectrum can be chosen to extract from the image. Suppose
the concrete material has to be extracted throughout the image. Same procedure can
be followed to extract grass and tree material.
Assume that pixel40 is made up some weighted linear combination of
endmembers.
pixel40 = M α + noise (3.38)
Now M α can be break up into desired, dα and undesired, Uγ signatures. Now

assign the desired as d and undesired as U signatures to spectrum. Let concrete be
the vector d and tree and water be the column vectors of the matrix U. However, the
fractions of mixing are unknown to us. But it is known that pixel40 is made up of
some combination of d and U.

d = ⎡⎣concrete ⎤⎦ and U = ⎡⎣tree,water ⎤⎦
Now it is required to reduce the effect of U. To do this it is needed to find a

projection operator P, that when operated on U, will reduce its contribution to zero.
To find concrete, d, pixel40 is projected onto a subspace that is orthogonal to the
columns of U using the operator P. In other words, P maps d into a space orthogonal
to the space spanned by the undesired signatures while simultaneously minimizing
the effects of U. If P is operated on U, which contains tree and water, then it is seen
that the effect of U is minimized.
42

⎡00 ⎤
⎢0 0 ⎥
⎢ ⎥
PU =⎢0 0 ⎥ (3.39)
⎢ ⎥
⎢0 0 ⎥
⎢⎣0 0 ⎥⎦
Now let r1 = pixel40 and n = noise, then from eq. (3.29),
Pr1 = Pdα p + Pn (3.40)
Now operator x T needs to find out which will maximizes the signal-to noise
ratio (SNR). The operator x T acting on Pr1 will produce a scalar. As stated before, the
value of x T which maximizes the SNR is X T = kdT . This leads to an overall OSP
operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the
entire data vector can be project along the columns of Q and OSP feature extracted
image is formed.
3.3 Supervised classifier

This section describes the mathematical background of supervised classifiers.
First, it will describe the Bayesian decision rule followed by the decision rule for
Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-
nearest neighbor (KNN) and Support vector machine (SVM) classification rules.
3.3.1 Bayesian decision rule

In pattern recognition, patterns need to be classified. There are plenty of
decision rules available in literatures but only Bayes Decision Theory is optimal
(Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose
there are K classes and let f ( x ) be the distribution function of the kth class, where
k
K
0 < k < K , and P ( ck ) is the prior probability of the kth classes such that ∑ P (c k ) =1.
k =1
For any class k , the posteriori probability for a pixel vector x is denoted by pk ( ck |x )
and defined by (assuming all classes are mutually exclusive):
P ( x |ck )P (ck )
pk (ck | x ) = k= K
(3.41)
∑ f ( x )P ( c
k =1
k k )
43

Therefore, the Bayes decision rule is:
x ∈ ci if pi (ci | x ) = max pk (ck | x ) (3.41a)
k

3.3.2 Gaussian maximum likelihood classification (GML):
Gaussian maximum likelihood classifier assumes that the distribution of the data points is
Gaussian (normally distributed) and classifies an unknown pixel based on the variance and
covariance of the spectral response patterns. This classification is based on probability density
function associated with training data. Pixels are assigned to the most likely class based on a
comparison of the posterior probability that it belongs to each of the signatures being considered.
Under this assumption, the distribution of a category response pattern can be completely described
by the mean vector and the covariance matrix. With these parameters, the statistical probability of
a given pixel value being a member of a particular land cover class can be computed (Lillesand et
al., 2002). GML classification can obtain minimum classification error under the assumption that
the spectral data of each class is normally distributed. It considers not only the cluster centre but
also its shape, size and orientation by calculating a statistical distance based on the mean values
and covariance matrix of the clusters. The decision boundary for the GML classification is:
−(1 2) ⎡⎢ln Σ
⎣ k( ) k
ˆ −1 ( x − μˆ )⎤
ˆ + ( x − μˆ )T Σ
k k ⎥
⎦ (3.42)
And the final bayesian decision rule is:
x ∈ c j if g j ( x ) = max g k ( x )
k
where g k ( x ) is the decision boundary function for kth class.
3.3.3 k – nearest neighbor classification

KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification
technique which has been proven to be effective in pattern recognition. However, its
inherent limitations and disadvantages restrict its practical applications. One of the
shortages is lazy learning which makes the traditional KNN time-consuming. In this
thesis work traditional KNN process has been applied (Fix and Hodges, 1951).
The k-nearest neighbor classifier is commonly based on the Euclidean distance
between a test pixel and the specified TP. The TP are vectors in a multidimensional
feature space, each with a class label. In the classification phase, k is a user-defined
44

constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which
is most frequent among the k training samples nearest to that test pixel.

Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified
either to the first class of squares or to the second class of triangles. If k
= 3, it is classified to the second class because there are 2 triangles and
only 1 square inside the inner circle. If k = 5, it is classified to first class
(3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified
to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).
Let x be a n -dimensional test pixel and yi (i = (1,2.... p)) is n -dimensional TP,
Euclidian distance between them is defined by:
di ( x , yi ) = ( x11 − yi1 )2 + ( x12 − yi 2 )2 + ...... + ( x1n − yin )2 (3.43)
Where x = ( x11 , x12 ......x1n ), yi = ( yi1 , yi 2 ...... yin ) and D = { d1 , d2 ......dp } , p is number of TP
The final KNN decision rule is:
45

⎧ ⎛ ⎡k ⎤ ⎞ ⎫
⎪ ⎜ ⎢ ⎥ + 1 ⎟ , k even ⎪
x ∈ c j if minimum element of D corresponding to c j is ⎨
⎪ ⎝ ⎣2⎦ ⎠ ⎪
⎬ (3.44)
⎪ ⎡k ⎤ ⎪
⎢ 2 ⎥ , k odd
⎩⎪ ⎢ ⎥ ⎭⎪
In case of tie, the test pixel is assigned to the class c j if its distance from the mean
vector of the class c j is minimum.
Where ki ,(i =1,2,....., p) is a user defined parameter which implies the number of
nearest neighbor is chosen for classification. The outline of algorithm of KNN

classification is given in Figure: 3.7
Figure 3.7: Outline of KNN algorithm
3.3.4 Support vector machine (SVM):

The foundations of Support Vector Machines (SVM) have been developed by
Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)
46

principle, which has been shown to be superior, (Gunnet al., 1997), to traditional
Empirical Risk Minimization (ERM) principle, employed by conventional neural
networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM
that minimizes the error on the training data. SVMs were developed to solve the
classification problem, but recently they have been extended to the domain of
regression problems (Vapnik et al., 1997).
SVM is basically a linear learning machine based on the principle of optimal
separation of classes. The aim is to find a hyperplane which linearly separates the
class of interest. The linear separating hyperplane is placed between the classes in
such a way that it satisfies two conditions.
(i) All the data vector that belongs to the same class are placed to the same side of
separating hyperplane.
(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).
The main aim of SVM is to define an optimum hyperplane between two classes
which will maximize the boundary of two classes. For each class, the data vectors
forming the boundary of classes are called the support vectors (SV) and the
hyperplane is called decision surface (Pal, 2002).
3.3.4.2 Statistical learning theory

The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical
framework for learning from input training with known class and predict the outcome of data point
with unknown identity. The first is called ERM whose aim is to reduce the training error and the
second is called SRM, whose goal is to minimize the upper bound on the expected error on the
whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).
First, it does not depend on the unknown cumulative distribution function. Secondly, it can be
minimized with respect to the parameter, which is used in decision rule.
3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension):

VC dimension is a measure of the capacity of a set of classification functions. The
VC-dimension, generally denoted by h, is an integer that represents the largest number of
data points that can be separated by a set of functions fα in all possible ways. For
example, for a arbitrary classification problem, VC-dimension is the maximum
47

number of points, which can be separated into two classes without error in all
possible 2k ways (Varshney and Arora, 2004).
3.3.4.3 Support vector machine algorithm with quadratic optimization

method (SVM_QP):
The procedure of obtaining a separating hyperplane by SVM is explained for a

simple linearly separable case for two classes which can be separated by a hyperplane
and it can be extended for the multiclass classification problem. This procedure then
can be extended to the case where a hyperplane cannot separate the two classes that
is kernel method for SVM.
Let there are n number of training samples obtained from two classes,
represented as ( x1 , y1 ),( x1 , y1 ),..........,( xn , yn ) where x i ∈ R m , m is the dimension of the
data vector with each sample belonging to either of the two classes labeled by
y ∈ { −1, +1} . These samples are said to be linearly separable if there exists a
hyperplane in m-dimensional space whose orientation is given by a vector w and
whose location is determined by a scalar b as offset of this hyperplane from the origin
(Figure 3.8). In case such a hyperplane exists then the given set of training data
points must satisfy the following inequalities:
w ⋅ xi + b ≥ +1, ∀ i : yi = +1 (3.45)
w ⋅ xi + b ≤ −1, ∀ i : yi = −1
(3.46)
Thus, the equation of hyperplane is given by w ⋅ xi + b = 0 .
48

Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after
Gunn, 1998).
The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality
as:
yi (w.xi + b) ≥ 1 (3.47)
Thus, the decision rule for the linearly separable case can be defined in the following
form:
xi ∈ sign(w.xi + b) (3.48)
Where, sign(.) is the signum function whose value is +1 for any element greater than
or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily
represent the two classes given by labels +1 and –1.
The separating hyperplane (Figure 3.8) will be able to separate the two classes
optimally when its margin from both the classes is equal and maximum (Varshney,
2004) i.e. the hyperplane should be located exactly in the middle of the two classes.
49

The distance D( x ; w,b) is used to express the margin of separation or margin for a
point x from the hyperplane defined by w and b. It is given by

w.x + b
D( x ; w,b) = (3.49)
w 2
Where, 2
denotes the second norm which is equivalent to the Euclidean length of
the element vector for which it is being computed and is the absolute function. Let
d be the value of the margin between two separating planes. To maximize the
margin, express the value of d as
w.x + b + 1 w.x + b − 1
d= −
w2 w2
2
=
w2
2
= (3.49a)
wT w
2
To obtain an optimal hyperplane the margin value ( d ) should be maximized i.e.
w2
should be maximized, it is equivalent to minimization of the 2-norm of the vector w.

Thus, the objective function Φ(w) of finding the best separating hyperplane
reduces to
1 T
Φ(w ) = w w (3.50)
2
A constrained optimization problem can be constructed for minimizing the objective
function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of
constrained optimization problem with a convex objective function of w and linear
constraints is called a primal problem and can be solved using standard Quadratic
Programming (QP) optimization techniques. The QP optimization technique can be
implemented by replacing the inequalities in a simpler form by transforming the
problem into a dual space representation using Lagrange multipliers ( λi )
(Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( λi )
as shown:
50

n
w = ∑ λi yi xi ,
i =1
nt
(3.51)
∑λ y
i =1
i i =0
The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus
becomes
n
1 n n
max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi ( xi ⋅ x j ) (3.52)
λ
i =1 2 i =1 j =1
Subjected to the constraints:

n
∑λ y
i =1
i i =0 (3.53)
λi ≥ 0, i = 1,2,..., n (3.54)
Solution of the optimization problem is obtained in terms of Lagrange’s

multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,
2000) some of the Lagrange’s multiplier will be zero. The multipliers which have
nonzero values are called SVs. The result from an optimizer, also called as an optimal
solution, will be a set of unique and independent multipliers: λ o = ( λ1o , λ2o ,..., λnos )
where, ns is the number of support vectors found. Substituted these in Eq. (3.51) to
obtain the orientation of optimal separating hyperplane ( wo ) as

n
w0 = ∑ yi λi 0 xi (3.55)
i =1
The offset from origin ( b0 ) is determined from the equation given below,
1 0 0
b0 = ⎡⎣w x +1 + w0 x −01 ⎤⎦ (3.56)
2
Where x +01 and x −01 are support vector of class labels +1 and -1 respectively. The
following decision rule (obtained from Eq. (3.48)) is then applied to classify the data
vectors into two classes +1 and -1:
f ( x ) = sign ( ∑
support vectors
yi λi0 ( xi .x ) + b0 ) (3.57)
Eq. (3.57) implies that

x ∈ sign ( ∑
support vectors
yi λi0 ( xi .x ) + b0 ) (3.58)
51

Generally, it may not be possible to separate the classes optimally by a linear
hyperplane and thus a non-linear manifold in hyperspace would be required for
optimal separation among the classes. The data present in m-dimensional space can
be mapped into a higher dimensional space where it spread out and can be separated
by a linear hyperplane in that dimensional space, shown in Figure 3.9.
Suppose the non-linear transformation function φ map the data into a higher
dimensional space where a data point x in original m-dimensional space is

represented as φ ( x ) in higher dimensional space. Thus, the dual optimization
problem in Eq. (3.52) is modified as:

n
1 n n
max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi K ( xi , x j ) (3.59)
λ
i =1 2 i =1 j =1
The computation of the dot product φ ( x i ) ⋅ φ ( x j ) will be computationally very
expensive as computations will be done in a higher dimensional space. So, kernel

functions are used to substitute the value of dot product of the transformed vectors
according to Mercer’s Theorem (Mercer, 1909). Suppose there exists a kernel function
K such that
K ( x i , x j ) = φ ( x i ) ⋅ φ( x j ) (3.60)

(a) Input space (b) Feature space
Figure 3.9: Non-linear mapping scheme. φ is a nonlinear mapping, transforms the

pixels from input space to feature space. φ ( xi ) s are pixels in feature space.
Linearly non separable pixels in input space become linearly separable in
feature space (Cristianini, 2000).
52

Putting Eq. (3.60) into eq. (3.59), the modified form of dual optimization problem
becomes:
nt
1 nt nt
max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi K ( xi , x j ) (3.61)
λ
i =1 2 i =1 j =1
Subject to the constraints:

nt
∑λ y
i =1
i i =0 (3.62)
Similarly, the final decision rule can be modified as:

ns
x ∈ sign ( ∑ yi λio K ( xi , x ) + bo ) (3.63)
i =1
Some of the commonly used kernel functions for classification are presented in Table
3.2. Selection of suitable kernel function is essential for better classification of a
particular data set. The details on effects of different kernel functions on
classification accuracy are available in Varshney and Arora (2004).
Originally SVMs were developed to perform binary classification. Now it has
been extended for multiclass classification where the number of classes is more than
two. Pal (2004) proposed two multiclass classification methods: one is one against the
rest and another is pairwise classification method. In the first one, K binary
classifiers may be created where each classifier is trained to distinguish one class
from another K − 1 class for a K class classification problem. The second approach
considers one pair of classes at a time and performs SVM based binary classification
for classifying all the pixels to one of the two classes under consideration. A total of
K ( K − 1)
pairs of classes are possible for a K class problem and thus that many SVM
2
binary classifiers are to be created. A pixel is finally classified to a class to which it is
K ( K − 1)
classified by most number of SVM classifiers out of total (Varshney and
2
Arora, 2004).
Figure 3.10 shows summary of the SVM classification algorithm.

53

Figure 3.10: Brief description of SVM_QP algorithm
54

3.3.4.4 SMO optimization for SVM
Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly
solve the SVM QP problem without any extra matrix storage and without using
numerical QP optimization steps at all. SMO decomposes the overall QP problem into
QP sub-problems, using Osuna’s theorem (Osuna, 1997) to ensure convergence.
Unlike the previous methods, SMO chooses to solve the smallest possible
optimization problem at every step. For the standard SVM QP problem, the smallest
possible optimization problem involves two Lagrange multipliers, because the
Lagrange multipliers must obey a linear equality constraint. At every step, SMO
chooses two Lagrange multipliers to jointly optimize, finds the optimal values for
these multipliers, and updates the SVM to reflect the new optimal values. The
advantage of SMO lies in the fact that solving for two Lagrange multipliers can be
done analytically. Thus, numerical QP optimization is avoided entirely. Even though
more optimization sub-problems are solved in the course of the algorithm, each sub-
problem is so fast that the overall QP problem is solved quickly. In addition, SMO
requires no extra matrix storage at all. Thus, very large SVM training problems can
fit inside the memory of an ordinary personal computer or workstation. Because no
matrix algorithms are used in SMO, it is less susceptible to numerical precision
problems. There are two components to SMO: an analytic method for solving for the
two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize.
In this thesis, all the computations regarding SMO optimization method have
been done with the Matlab in-built function “SVMSMOSET”
3.3.4.4 KPCA-SVM
Nonlinear SVM is quite accurate then linear SVM. However, they are slow and
time taking for classification increases linearly with the number of SV. Reduced set
methods (reducing no. of SVs) try to speed up the SVM classification by reducing the
number of SV (Burges and Scholkopf, 1996). This section will present the technique of
reducing the number of SVs using KPCA algorithm (Sundaram, 2009). It should be
kept in mind that the space spanned by original set of SVs will be always equivalent
to the space spanned by reduced set of SVs. This is the criteria for choosing minimum
number of SVs to improve the classification time
55

The solution of the optimization problem Eq. (3.52) is obtained in terms of
Lagrange’s multiplier. SVs are extracted solving by the Eq. (3.52). The algorithm for
this method is stated below.
1. First choose appropriate kernel function. Then calculate the kernel matrix K xx
from the set of SV xi , i = 1,2,........, N
K xx (i, j ) = K ( x i , x j ) (3.64)
where , j = 1,2,........, N
2. Center the kernel matrix K xx ,
c
K xx = HK xx H (3.65)
1
where, H = I − I , I is N × N identity matrix. H is centering matrix
N
Sundaram (2009) used the Eq. (3.65) to center the kernel matrix. But, according to
different literatures, kernel matrix should be center by using Eq. (3.24). This is the
standard procedure for centering kernel matrix.
3. Perform Kernel PCA by implementing an eigen value decomposition on
c
centered kernel matrix ( K xx ).
c
K xx = A ΛA T (3.66)
Where A is the matrix of eigen vectors and Λ is a diagonal matrix of eigen
values whose diagonal elements are λ1 , λ2 ,..........., λN .
4. Sort the eigen values and corresponding eigen vectors. Discard eigen values
smaller than a threshold. A value of 10−5 has been used in this thesis work.
This was done to prevent numerical problems in the later stages of the
algorithm.
5. Calculate the normalized principal directions.
1 N
Vk = ∑ a jkΦ ( xi )
λk j =1
(3.67)
( x ) = Φ( x ) − 1 ∑ Φ( x )
N
where Φ j j i
N 1=1
In matrix form this becomes:
1
−
V = KA Λ 2 (3.68)
Select the first M number of principal directions which retains a total 99%
variance.
56

6. Calculate new SV by choosing the projections on the principal directions from
λk
a uniform distribution U[ −σ k , +σ k ] where σ k = . In matrix form it
N
becomes,

V = VR (3.69)
1
1
Where R = Λ 2U
N
Where U is a matrix of points chosen from the uniform distribution U[ −1, +1] .
7. Each column of V corresponds to a new SV. Now project image of the old SVs
( Φ( xi ) ) along the direction of new set of SVs (i.e. along the direction of PCs).
N
Φ( zk ) = ∑ Vik Φ( xi ) (3.70)
i =1
8. Calculate the approximate pre-images of the points obtained in the previous

step (( Φ( zk ) )) according to the formula given below (Scholkopf, 1996).
n
1
∑V ik ( (1 − VkT K xxVk + 2VkT kxi ))xi
2
zk = i =1N (3.71)
1
∑i =1
Vik ( (1 − VkT K xxVk + 2VkT kxi ))
2
where kxi = [ K ( xi , x1 ) K ( xi , x 2 )............K ( xi , xn ) ]T
9. Calculate the new coefficients β by solving

K zz β = K zxα (3.72)
This ensures that both SVMs produce same results for all the zk ’s, k = 1,2,.......M
(Scholkopf and Mika, 1999)

Therefore new set of SV are obtained, zk , k = 1,2,...., M and the new coefficients
βi , i = 1,2,....., M of the SV’s. Then general SVM classification algorithm is applied on

the new set of SV’s. Figure 3.11 describes the outline of above algorithm.
57

Figure 3.11: Overview of KPCA_SVM algorithm
3.4 Analysis of classification results

The classification results obtained using various classification techniques are
expressed in standard confusion matrix (Landgrebe, 2003) showing the class-wise
user ( kua ), producer ( kpa ) and overall (k) kappa measures (Congalton, 1991). The
58

overall kappa (k) values obtained from different classification techniques were used
for the one-tail hypothesis testing (Congalton, 1991) for comparing any two
classification results. While the class-wise producer’s kappa ( kpa ) values were used to
check the performance of different classification techniques in separating different

classes (Abhinav, 2009).
3.4.1 One tailed hypothesis testing

z-statistic (Congalton, 1991) is computed using the kappa values obtained for
comparing any two classification techniques:
kˆ1 − kˆ2
Z12 = (3.73)
(σˆ 2
1 + σˆ 22 )
Where, k̂1 and k̂2 are the kappa estimates obtained for the two classification
techniques under consideration and σˆ12 , σˆ 22 are the respective estimates of variances
for the kappa values observed. The z-statistic obtained is used for the one-tailed
hypothesis testing with the following null ( H 0 ) and alternate ( H1 ) hypotheses:

H 0 : Z12 = k1 − k2 ≤ 0
(3.74)
H1 : Z12 = k1 − k2 > 0
The null hypothesis chosen here is that the out of the two classification results
obtained k̂1 and k̂2 , k̂1 is not significantly better than k̂2 which means that the first
classification technique is not significantly better than the second technique. While
the alternate hypothesis selected, it says that the two classification results are
statistically different and also the result corresponding to k̂1 is statistically better
than that corresponding to k̂2 and thus, it can be said that the first classification
technique is significantly better than the second (Abhinav, 2009).

The z-statistic obtained in Eq. (3.73) follows the standard normal distribution
(Congalton, 1991) and thus, according to one-tailed hypothesis testing (Fig. 3.12) if the
value of Z12 -statistic is greater than a critical value (say, 1.65) for a confidence level
59

of 95%, the null hypothesis can be rejected and it can be said with 95% confidence
that the two classification results are statistically different with the first one
performing better than the second one (Abhinav, 2009).
Rejection region for

Non-rejection region H 0
for H 0
0 Zc = 1.65
Figure3.12: Definitions and values used in applying one-tailed hypothesis testing

(Abhinav, 2009).
60

CHAPTER 4
EXPERIMENTAL DESIGN
This chapter will address the methodology followed for this thesis work.
Experiments were designed to investigate the best FE technique, classification
algorithm and best time saving strategy for HD. On the basis of conclusions from the
literature survey and recommendations for future work by Abhinav (2009), several
FE and classification algorithms have been tested which have potential for improving
classification accuracy and time for HD. The theoretical background of these
algorithms was presented in Chapter 3.
The following FE methods and classification algorithms have been tested:

(1) Feature extraction algorithms
• Unsupervised feature extraction algorithm
a) Segmented principal component analysis (SPCA) (Jia, 1996).
a) Projection pursuit (PP) (Friedman and Tukey, 1974).
• Supervised feature extraction algorithm
b) Kernel principal component analysis (KPCA) (Scholkopf, 1995).
b) Orthogonal subspace projection (OSP) (Lentilucci, 2001).
(2) Classification algorithms
• Parametric classification approach
a) Gaussian maximum likelihood (GML) (Savage, (1976)).
• Non-parametric classification approach
a) k nearest neighborhood (KNN) (Fix and Hodges, 1951).
• Advance classification approach
a) Support vector machine (Quadratic programming optimization method)
(SVM_QP) (Vapnik, 1995).
b) Support vector machine (sequential minimal optimization method)
(SVM_SMO) (Platt, 1999).
61

c) Kernel principal component analysis support vector machine
(KPCA_SVM) (Sundaram, 2009).
This chapter starts with experimental details for different FE and selection
techniques. Then it explains the classification techniques for parametric and non-
parametric classifier followed by advanced classifier.

4.1 Feature extraction technique

Two types of FE techniques, unsupervised and supervised, were used in this
experiment. SPCA, PP are unsupervised FE techniques and KPCA, OSP are
supervised FE techniques. The details of FE methods are given below.
4.1.1 SPCA
For SPCA, complete data set is subgrouped on the basis of correlation of bands.
Then PCA is applied separately on each subgroup of data. Feature selection from the
new data set is obtained after the first subgroup transformation by variance
information (first few PCs retaining 99% variance were selected). Then selected
features are regrouped and transformed again to compress the data further. The
flowchart of SPCA method is shown in Figure 4.1.

Figure 4.1: SPCA feature extraction method
4.1.2 PP
For PP, Posse’s (1995a) algorithm was used in this research work where OD (n-
dimension) is projected on two dimensional space. Thus the dimension of the PP
62

feature extracted data set is two. Chi-square projection pursuit index was chosen
here. The methodology adopted for PP method is shown in Figure 4.2.
Figure 4.2: PP feature extraction method
4.1.3 KPCA
The number of PCs is equal to the number of TP used for FE . In this
experiment, a total up to 400 TP have been used for FE using KPCA method. Hence,
the dimension of the KPCA feature extracted data set is up to 400. Firstly, TP are
mapped into feature space using different kernel function (linear, polynomial and
Gaussian) in the form of gram matrix. Then eigen values and eigen vectors of gram
matrix are calculated. Afterwards, OD is mapped in kernel space using the same
kernel function (used for TP) and projected along the direction of eigen vectors.
Finally, KPCA feature extracted data set is obtained. The outline of KPCA method is
shown in Figure 4.3.

Figure 4.3: KPCA feature extraction method
63

4.1.4 OSP
The dimensionality of feature extracted data set depends upon the number of
classes present in the OD. OSP starts with finding the endmembers by automated
target generation process (ATGP). Then OD is projected along the endmembers and
feature extracted data set is obtained. The data set used for this thesis has eight
classes, so the number of endmembers is also eight. The dimension of feature
extracted data set is equal to the number of endmembers. The brief description of
OSP method is shown in Figure 4.4.

Figure 4.4: OSP feature extraction method
4.2 Experimental design

This section will provide the detailed methodology of the classification which
was followed in this research work. Feature extracted data or OD, TP and selected
bands are given as the input to classifier. In this thesis work, same set of TP have
been used for any data set to train the classifier. For example, to perform
classification using 200 TP per class on SPCA modified data set, the same 200 TP
were used for OD. To vet the results obtained by Abhinav (2009), the same sets of TP
are also used here. Those TP were obtained by multinomial TP selection algorithm.
Statistically sufficient sample size for training and test was calculated at a confidence
level of 99% and a desired precision of 4% using formula as suggested by Toratora
(1976). Following this approach, a minimum of 99 TP per class have to be chosen to
train a classifier.
Experiments were performed with GML, KNN and advance classifier (SVM).
For each classifier, two types of experiments were performed. The first type of
classification experiment was implemented on OD and the second type was carried
out on the feature extracted data set. For each set of experiment, classifier was
trained with 25, 100, 200 and 300 TP per class. The same set of TP will ensure no
discrepancy due to different training data sets while comparing different
64

classification results. These numbers were chosen in order to consider the following
cases of training sample size.
a) Statistically insufficient training sample size (25 TP)

b) Statistically exact training sample size (100 TP)
c) Statistically sufficient training sample size (200 TP)
d) Very large training sample size (300 TP)
Classifier provides thematic map as output of classification. These maps were

used to obtain test accuracy of classifiers in terms of confusion matrix. Accuracy
analysis of the resulted maps was performed using the kappa value for different
algorithms comparing z-statistics, on the basis of one tailed hypothesis, performed on
95% confidence interval (Congalton, 1991).
For each classification technique, initially five bands of OD or feature extracted
data set (except OSP and PP feature extracted data set) were chosen. Later on, it was
incremented by five in a stepwise manner up to the available bands (number of
available bands may be different for different feature extracted data set). The
classification was performed to evaluate if there was any improvement in accuracy.
This was performed for each set of TP.
Dimension of OSP feature extracted data set is equal to the number of classeds
present in OD. Each band of OSP feature extracted data set contains information
corresponding to each class. Therefore, for the classification, all bands of the OSP
feature extracted data set should be taken together. Otherwise, it may produce
classification error. For all the experiment in this thesis work, eight bands of OSP
feature extracted data set was taken together .
The dimension of the PP feature extracted data set is two. Therefore, the
maximum number of bands available for PP feature extracted data set is two. For all
the experiment on PP feature extracted data set both the bands were taken together.
The methodology of the classification procedure for this thesis work is shown in
Figure 4.5.
65

Figure 4.5: Overview of classification procedure
4.3 First set of experiment (SET-I) using parametric

and non-parametric classifier

Set-I experimental set up was designed to investigate the results of parametric
(GML) and non-parametric (KNN) classifier. The classification was performed by
selecting different parameters of KNN and GML.
For KNN, initially three neighboring pixels were chosen which was further
increased by one, up to a neighborhood size of 11. Then, it was performed only for
neighborhood size of 15. However, there were negligible improvements in accuracy for
more than five neighboring pixels. The experiment was conducted to study the effect
of neighboring pixels in accuracy.
The best classification result for KNN and GML for feature extracted data sets
as well as OD were independently observed along with the parameters responsible for
the best result. The experimental scheme is given in Figure 4.6.
66

Figure 4.6: Experimental scheme for Set-I experiments
4.4 Second set of experiment (SET-II) using advance

classifier

The second sets of experiments were designed with advance classifier, SVM
algorithms. Different optimization techniques and algorithms for SVM were chosen
for comparing the accuracy and time taken to train the classifier. In this thesis work,
SVM_QP, SVM_SMO and another approach KPCA_SVM were used to compare the
classification accuracy and time. As mentioned before, all these algorithms were
performed on OD as well as on feature extracted data set.
The purpose for this experiment is summarized below:
(i) Investigation of the best classification algorithm among these SVM

algorithms, depending upon the accuracy and processing time
(ii) Inquiry of the best FE techniques for SVM classifier
For KPCA_SVM, initially SV were extracted by solving dual optimization

problem using quadratic programming (QP) optimization method. Then KPCA
algorithm with Gaussian kernel was applied on the SV and PCs were arranged in
descending order with respect to the eigen values of kernel matrix. These PCs are the
new set of SV. In this research work, for all the experiment related to KPCA_SVM,
about 70% of the original SV were chosen from the new set of SV (for details, section
3.2.3.4), because about 99% variance was stored in first 70% of the PCs. Finally, the
SVM decision rule was applied on the new set of SV to obtain classified map.
67

For SVM_QP and SVM_SMO, quadratic programming optimization and
sequential minimal optimization methods were used respectively to solve the dual
optimization problem. The classification scheme for Set-II experiment is given in
Figure 4.7.
Figure 4.7: The experimental scheme for advanced classifier (Set-II)
4.5 Parameters
Parameters play also an important role in HD classification. So, choosing of
parameters are also an important task. All the parameters chosen for different FE techniques
and classification algorithms are listed in Table 4.1.
FE techniques Parameters
SPCA Correlation matrix of the bands
PP No. of random searches – 5
half – 15
Stopping value – .01
KPCA Kernel function – rbf
OSP No. of endmembers – 8
Classifiers Parameters
GML Confidence interval – 99%
KNN Neighbors – 3,4,5……,11 and 15
SVM Kernel function – rbf
Table 4.1: List of parameters
68

CHAPTER-5
RESULTS
This chapter provides observations for various experiments and interpretation of the
same. Starting with the visual interpretation of feature extracted data sets, the
chapter will discuss the result of GML classifier on feature-extracted data set. These
results are compared with the best result for GML as observed by Abhinav (2009).
Then it will discuss the effect of KNN classification algorithm on OD and feature
extracted data set followed by the discussion of the results of different SVM
algorithms.
5.1 Visual inspection of feature extraction

techniques

Apart from comparison of k-values, features extracted by various FE

techniques can be visually inspected using grayscale views of the first few features.
The image form of correlation matrix are also used for this purpose.
From the correlation image of OD (Figure 5.1), it is clear that there are three
highly correlated blocks of bands. The first block contains 32 bands, the second 6
bands and the last contains 27 bands (Figure 5.1). The average correlation values for
each block are 0.931, 0.997 and 0.941 respectively. Thus, the OD is segmented based
on correlation of these three blocks of bands. Then PCT was applied on the basis of
correlation matrix of each block of bands for which SPCA feature extracted data set
was obtained. Total time taken to complete the aforementioned process was about 8
seconds.
69

Figure 5.1: Correlation image of the OD set consisting of three blocks having bands
32, 6 and 27 respectively.
In PP process, one can find from the most important to less important two-
dimensional structures in a sequential manner. Two structures (first one is the most
interesting) with decreasing order is given in Figure 5.2. The PP index after five
random searches was 0.3825 and the size of neighborhood (c) around the best
projection plane was 0.011. Total time taken to complete the whole process was about
11.30 hours. Table 5.1 presents the required time for each FE techniques with
different constraints.
70

Table 5.1: The time taken for each FE techniques
FE methods Time
SPCA 6-8 seconds
KPCA with rbf 1) 4 minutes for 25 TP
kernel 2) 5.5 minutes for 100 TP
3) 6.3 minutes for 200 TP
4) 8.5 minutes for 300 TP
5) 10 minutes for 400 TP
OSP 90 seconds for 8 endmembers
PP 11.30 hours
(a) (b)
β*
β *
α* α*
Figure 5.2: Projection of the data points. (a) Most interesting projection direction
(b) Second most interesting projection direction.
The grayscale images of features extracted data using various FE techniques are
provided in Figures 5.3 to 5.6, followed by the corresponding correlation images
shown in Figure 5.7.
71

(a) SPCA-1 (b) SPCA-2 (c) SPCA-3
(d) SPCA-4 (e) SPCA-5 (f) SPCA-6
Figure 5.3: First six Segmented Principal Components (SPCs) (b) shows water body
and salt lake
(a) KPCA-1 (b) KPCA-2 (c) KPCA-3
(d) KPCA-4 (e) KPCA-5 (f) KPCA-6
Figure 5.4: First six Kernel Principal Components (KPCs) obtained by using 400 TP
72

(a) OSP-1 (b) OSP-2 (c) OSP-3
(d) OSP-4 (e) OSP-5 (f) OSP-6
Figure 5.5: First six features obtained by using eight end-members (b) shows
vineyards and wheat, (c) shows bare soil, (d) shows salt lake.
(a) PP -1 (b) PP -2
Figure 5.6: Two components of most interesting projections (a) shows salt lake.
73

1
0.9
0.8
0.7
0.6
(a) SPCA (b) KPCA
0.5
0.4
0.3
0.2
0.1
(d) OSP (e) PP

0.0
Figure 5.7: Correlation images after applying various FE techniques
The following were observed based on visual inspection of features extracted

data sets (Figure 5.3 to 5.6) and their correlation images (Figure 5.7):
(i) Since extracted SPCs were ranked according to their eigen values, a higher
amount of information can be easily noticed in the first four SPCs. No
interesting structures could be visually identified beyond 4th SPC. As SPC uses
the local correlation of the bands rather than global (like PCA), it has ability to
make involved bands highly uncorrelated than PCA. So better classification
result is expected from SPCs. It has also been visually observed that SPCA-2 is
associated with the water body and salt lake classes.
(ii) The first few features extracted by KPCA were visually inferior than those
obtained by SPCA (not revealing any class). Some of the features like KPCA-1
and KPCA-2 show water body and salt lake prominently but other classes are
also present there.
74

(iii) OSP is generally used to extract same number of features as the number of
classes present in the data set (in this case eight classes; hence eight features).
Although number of extracted features by OSP is low, it can identify some
structures prominently. For example, OSP-4 identifies salt lake, OSP-2
identifies vineyards and wheat and OSP-3 shows bare soil. From the algorithm
of OSP, it can be suggested that each band of OSP extracted data set is
associated with one of the predefined classes. Therefore, it can be said that
OSP is expected to perform well for classification.
(iv) The dimension of PP extracted feature is two. However, from the first
extracted feature, salt lake can be identified very clearly but the second feature
contains no identifiable structures and gives hazy appearance.
(v) The quality improvement of features extracted by different FE techniques can
be observed by comparing the correlation images of OD (Figure 5.1) and
feature extracted data (Figure 5.7). The correlation matrices obtained by SPCA
and PP extracted data sets are found to be perfectly diagonal with values equal
to unity and all the off-diagonal elements as zeros. On the other hand, feature
extracted data using supervised FE techniques (OSP, KPCA) are correlated.
This is because the SPCA and PP algorithms extract only orthogonal features
while the FE criterion is different for OSP. So highly correlated features are
observed for OSP. For the correlation image of KPCA feature extracted data
set, t can be observed that along diagonal correlation is unity which decreases
inversely with the increase in distance from diagonal in correlation matrix,
except for bands 80 to 100. These bands are observed to be fully uncorrelated.
5.2 Results for parametric and non-parametric

classifiers
This section will represent the results of GML and KNN classifier using
different data sets. First, it will describe the results for GML classifier followed by
KNN.
5.2.1Results of classification using GML classifier (GMLC)

The performance of GMLC with feature modified data sets (SPCA, KPCA,
OSP, PP FE methods) was compared to the best result obtained by Abhinav (2009)
75

for GML classifier to evaluate the improvement in classification due to these FE
technique. It may be noted that he obtained the best results with PCA modified data
set. Figure 5.8 shows k-values obtained for different feature modified data sets.
Following observations can be listed from Figure 5.8:
(i) Considering the case with sufficient TP (100, 200, 300), the k-values
obtained for PCA, SPCA, and OSP extracted data sets were observed to be
higher than the PP and KPCA modified data sets.
(ii) For statistically insufficient TP (25), GML performs poorly for SPCA, PCA
and OSP modified data sets. When the number of bands increase, after a
certain number of bands, k-value for PCA and SPCA modified data set
becomes negative for 25 TP per class. Because to invert a p × p matrix, at
least p+1 sample points are required for obtaining numerically well
conditioned inverse of the matrix. Due to this effect, GML fails when more
than 25 bands were used with 25 TP per class. These were insufficient for
computing the inverse of the class covariance matrix.
(iii) An interesting phenomenon can be observed for k-values of KPCA modified
data set. The k-value increases for the first 35 bands. Then suddenly it falls
for 40 bands. From 45 bands onwards, it again starts to increase. The result
for KPCA modified data set is observed up to 65 bands (dimension of OD is
65).
(iv) The k-values obtained for SPCA and OSP seems to be outperforming those
obtained by PCA, KPCA.
(v) Performance of PP is found to be very poor due to very low number of
features (two features). Hence, PP was not considered any further for
classification.
(vi) For all FE techniques (except KPCA, OSP), the k-values increase
significantly with increase in number of bands up to a critical number of
bands (say, Ncri) after which no improvement could be observed in k-values.
This is due to the fact that the features extracted by these techniques were
arranged in decreasing order of eigen values. So useful information are
stored in the first few features only while the lower order features contain
76

less useful information and are very noisy. Therefore, when noisy bands
were added then probability of misclassification increases. As a result, the
classification accuracy becomes stagnant.
(vii) Ncri is different for different set of TP. When number of TP increases, Ncri
increases. Because of Hughes phenomenon, classification of large number of
bands provide poor result unless the number of TP is large.
77

PCA SPCA
KP
PCA
OSP PP
Figure 5.8
8: Overall kappa vallue observeed for GML L classifica
ation on different featture
extracted data sets using selectted differen
nt bands
78
To confirm these observations, statistical analysis was performed. The k-
values obtained for each FE technique are given in Table 5.1. The best results
obtained by GML classification on different feature extracted data set for three
training data sets (100, 200 and 300 TP) were selected for comparison with the best
GML result obtained with PCA extracted data set. The condition for selecting the
best classification result (best k-value) is the least number of bands used after which
no statistically significant improvement in k-value could be achieved. A comparison of
the best results between the PCA and other FE modified data sets and among the
various FE techniques is presented in Table 5.2 in terms of z-statistic values obtained
for one-tailed hypothesis testing at 5% significance level.
Following observations can be viewed from the Table 5.2.

(i) PCA and SPCA were found to be giving statistically similar result for 100 and
300 TPs per class while SPCA provides statistically significantly better result
than PCA for 200 TP per class. SPCA is more improved method than PCA.
(ii) In case of OSP, statistically better result could not be achieved for statistically
exact TP set (100 TP per class) but when number of TP increases, it provides
the statistically better result than PCA. In case of large TP (300), statistically
similar result to PCA is obtained.
(iii) For 200, 300 TP set, SPCA and OSP provides statistically similar result but in
the case of small set of TP (100), SPCA provides the better result than OSP.
Since, SPCA extracted data set is more orthonormal than OSP extracted data
set, it can be concluded that SPCA is the best FE techniques than OSP for
GML classification.
(iv) PP extracted data set always provides statistically very poor result than OSP.
It is because of the low dimensionality (dimension-2) of PP extracted data set.
(v) KPCA always fails (for large or small TP) to provide statistically better result
than PCA or OSP and OSP is statistically better than PP for all sets of TP.
Again, SPCA provides statistically better result than PCA or OSP. Therefore, it
can be concluded that SPCA is the best FE techniques than PCA and other FE
techniques like OSP, PP, KPCA.
79

(vi) The best kappa accuracy for GML classifier is obtained by using SPCA
extracted data set with 300 TP. The kappa value is 0.9589 and the number of
bands used for classification is 45.
Table 5.2: Best kappa values and z-statistic (at 5% significance values) for GML
TP PCA SPCA KPCA OSP PP z-statistic

Best NB* Best NB Best NB Best NB Best NB Z12 Z13 Z14 Z24 Z34 Z45
k1 k2 k3 k4 k5
100 0.9362 20 0.9384 20 0.8489 35 0.9205 8 0.2220 2 -1.35 41.95 8.87 10.51 -41.95 222.81
200 0.9460 20 0.9579 30 0.8332 35 0.9505 8 0.2146 2 -3.97 53.47 -4.07 0.99 -53.47 304.51
300 0.9568 40 0.9589 45 0.8569 35 0.9572 8 0.2228 2 -1.45 50.65 -0.28 1.20 -50.65 290.75
kˆ1 − kˆ2
NB* numbers of bands used, Z12 =
(σˆ12 + σˆ 22 )
From Table 5.3 it is observed that the best results for PCA, SPCA, KPCA
extracted data sets were obtained for 30-45 features at 300 TP and for OSP extracted
data set 8 features at 300 TP. During the experiments, it was seen that GMLC took
around 55-70 seconds for processing of 30-45 bands for 300 TP per class for SPCA and
PCA extracted data set and about 32 seconds for OSP extracted data. However, OSP
provides statistically similar result to PCA and SPCA for 300 TP, but the processing
time is very less than other FE techniques. Therefore, OSP can be considered as an
effective FE technique. However, considering both accuracy and processing time, OSP
can be rated as the most effective FE technique for GMLC. For statistically
insufficient TP (25) and statistically sufficient TP (200) SPCA is rated as the best FE
technique. For 100 TP per class, performance of PCA and SPCA for GMLC is same.
From Figure 5.9, it can be observed that GMLC on OSP is the fastest than any other
FE technique. PCA and SPCA take about same time to provide the best k-values.
Table 5.3: Ranking of FE techniques and time required to obtain the best k-value
TP SPCA PCA KPCA OSP PP

k1* Time Rank k2 Time Rank k3 Time Rank k4 Time Rank k5 Time Rank
(s)* (s) (s) (sec) (sec)
25 0.8409 53.6 1 0.8296 53.6 2 0.8215 59.7 2 0.2700 35.4 3 0.1960 - 4
100 0.9384 60.6 1 0.9362 60.6 1 0.8489 75.6 3 0.9205 39.2 2 0.2220 - 5
200 0.9579 65.2 1 0.9460 59.4 3 0.8332 74.4 4 0.9505 36.7 2 0.2146 - 5
300 0.9589 83.5 1 0.9568 72.3 1 0.8569 62.8 2 0.9572 39.8 1 0.2228 - 3
Time* = Time (second) for obtaining best k-value, ki* = k-value for ith FE technique , Rank:1 indicates the best
80

9:
Figure 5.9 Comparrison of kappa va alues and
d classifica
ation times for GML
G
classificcation meth
hod.
5.2.2 Cla
ass-wise compar
rison of result
r for
r GMLC
Thee class-wise accuracy
y for GML
LC has beeen observ
ved for diffferent featture
extracted data set. From
F Figurre 5.10, following can be observeed:
(i) o TP, GMLC can ex

For all sizes of xtract salt lake class from all sets
s of featture
extrracted data y high k-va
a with very alue. Wateer class is a
also extraccted with very
v
h k-value for
high f all featture modiffied data set (except PP). From
m PP modified
data
a set, only
y Salt lak an be separated with satisfacctory k-va
ke class ca alue.
Because first feature off PP modiffied data set
s can disstinguish salt
s lake very
v
arly (Figure
clea e 5.6)
(ii) Beside salt lak
ke and watter body, G
GMLC sepa
arates hydrrophytic cla
ass from otther
classses with very
v high k-value for all featurre extracted
d data set and all seet of
TP. For 300 TP, GML
LC separattes hydrop
phyticc veg
g class wiith very h
high
accu
uracy from
m SPCA modified data
a set.
(iii) GM
MLC classiifies viney
yards and wheat with
w aboutt same k-value. Soome
vineeyards pixeels have be
een classifi
fied to wheat pixels due
d to pressence of miixed
pixe
el. Accuraccy of classiffication forr vineyardss, bare soil,, pasture la
and and bu
uilt-
up area classses are about same for 200 an
nd 300 TP
P for SPCA
A, PCA, O
OSP
dified data set. It is lo
mod ow for KPC
CA modified data set.
81
2 Training Pixels
25 100 Training
g Pixels
WT : W
Water
200 Training
g Pixels 300 Training
g Pixels
SLT : Salt lake
HV : HydrophyticVeg
WHT : W
Wheat
VY : V
Vineyards
BS : Bare Soil
PL : P
Pasture Land
BUA : Built-up Area
0: Best prooducer accu

Figure 5.10 uracy of inddividual classses observed for GML LC on differrent
feature extracted
e da h respect to different seet of TP.
ata set with
5.2.3 Cla
assification resu
ults using
g KNN cllassifier
r (KNNC))
To und
derstand th
he effect of
o FE tech
hniques on
n KNN cla
assifier, experiment was
w
performed
d with OD as
a well as feature
f exttracted datta. Same seet of TP, ass used in GML
G
classificatiion, was chosen
c to compare classificattion accura
acy. Obserrvations frrom
figure no. 5.11 to 5.14 are as following:
(i) In case
c of KN
NN, poor pe
erformancee is observ nsufficient TP
ved for stattistically in
set (i.e. 25TP)). Howeverr, KNN on OD perforrms betterr than PCA
A, OSP, SP
PCA
82
extracted data set. The maximum k-value was obtained for 65 bands and three
neighbors. For the KPCA extracted data set, k-value was comparatively better
than OD when 50 bands were taken for all neighbors. PP was not taken into
accuracy analysis as due to very low dimensionality it would not be able to
provide good k-values.
(ii) For statistically exact TP (100 TP), the performance of KNN on OD is better
than any other feature extracted data set. More number of bands, increases the
k-values for all feature extracted data sets except SPCA. Increasing number of
bands did not show any significant change in case of SPCA. However, if
number of neighbors is increased, changes were easily observed. It is observed
that, when number of neighbors is increased, after a critical number of
neighbors (say, Nnbd), k-value starts decreasing. Therefore, it is independent on
number of bands. It may be due to the effect of noisy points present in training
data set. However, large number of neighbors accelerates the chance of using
noisy TP. Consequently, misclassification error is added up.
(iii) For 200 TP per class, no improvement in result is observed for PCA, KPCA,
OSP extracted data set than OD. But, improvement was observed for SPCA
extracted data set. However, it did not show a prior change in PCA and KPCA
extracted data set for KNNC with 100 and 200 TP set respectively. Effect of
neighborhood on accuracy can be viewed from Table 5.4. Always for the first
few neighbors for all sets of TP, highest k-value is achieved (Table 5.2).
(iv) For large training data set (300 TP), it was observed that the k-values are
better than OD. This is due to PCA and SPCA extracted data sets. After a
certain threshold neighborhood, k-value decreases monotonically for PCA,
OSP, and SPCA extracted data set.
(v) KPCA extracted data set provides better result for high dimension since it is
more refined than PCA or SPCA extracted data set.
(vi) For all training data sets, except statistically insufficient, k-value for OSP
extracted data set varies a little (0.02 - 0.05) because of very low
dimensionality. If the number of extracted end members is large enough, result
could be further improved.
83

(vii) Another important aspect was observed for feature-extracted data set. The
difference of the k-values (for all set of TP), obtained using minimum and
maximum number of bands, is about 0.15 to 0.20. This could be because most
of the information was gathered in first some bands of feature extracted data
set. Additional bands cannot provide more useful information to change k-
value significantly.
Table 5.4: Classification with KNNC on OD and feature extracted data set
Data 100 TP 200 TP 300 TP

sets Bnd* NN Bnd NN Bnd NN
Original 55 3 35 3 30 3
PCA 35 5 45 5 20 3
SPCA 10 3 15 3 40 3
KPCA 35 3 45 3 30 6
OSP 8 3 8 3 8 3
PP 2 15 2 11 2 15
bnd* = best k-values obtained for the number of bands
NN* = no. of neighbors, for which best k-value obtained
84

25 Traiining Pixels
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
N
NNb*: number of nearest neiighbors
Figure 5.1
11: Overalll accuracy observed
o forr KNN classification off OD and feature extracted
data seets for 25 TP
P
85
100 Training Pixells
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
N
Figure 5.1
12: Overalll accuracy observed
T
86
Origin
nal PCA
NN
Nb
SPC
CA KPCA
NNb
b
OSP
P PP
N
13:
Figure 5.1 Overalll accuracy observed
T
87
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
NN
Nb
N
14:
Figure 5.1 Overalll accuracy observed
T
88
The k-values for the classification of these data sets were analyzed to select the
best results for each data set. Similar approach as in the case of GML is also followed
here. The z-statistic values obtained for selected best k-values are shown in Table 5.5.
The following can be inferred from these results:
(i) Results obtained using PCA and SPCA modified data sets, were found to be
significantly better than those obtained using the OD for large training data
size (300). However, SPCs and PCs still found to be performing inferior than
OD for 100 TP. Statistically similar results were obtained for OD and SPCA
modified data sets using a training data set of 200 TP. For other feature
extracted data set and for all set of training data, OD provides statistically
significant result for KNN classification.
(ii) The best results were obtained with OD using 30 to 55 bands and three
neighbors. For 300 TP, statistically better results than OD were obtained using
SPCA (40 bands) and PCA (20 bands) modified data sets with three neighbors.
For 200 TP, SPCA modified data set (15 features and 3 neighbors) provides
statistically similar results to OD.
(iii) SPCA extracted data sets were observed to be performing statistically
significant to PCA extracted data sets with smaller training data sets, whereas
the best results, obtained with 300 TP training data set using SPCs, were
statistically similar to those as obtained by PCs.
(iv) SPCs were also observed to be performing significantly better than KPCA and
OSP modified data sets for all training data sets. In addition, the best results
for PCA and OSP were found to be statistically poor for all training data size.
Table 5.5: The best k-values and z-statistic for KNNC
TP OD KPCA SPCA PCA OSP z-statistic

k1 NB* k2 NB k3 NB k5 NB Z12 Z13 Z14 Z23 Z34 Z45
100 0.8889 55 0.7773 35 0.8669 10 0.7715 35 0.8268 8 42.51 9.42 44.72 -34.98 37.24 -20.55
200 0.9037 35 0.7881 45 0.9040 15 0.8062 45 0.8514 8 48.98 0.15 41.31 -49.10 11.43 -17.72
300 0.9244 30 0.8141 30 0.9325 40 0.9320 20 0.8701 8 47.91 -4.58 -4.29 -52.68 0.29 30.95
* Number of bands used to obtain best k-value
89

Time taken to train the KNN classifier is highly affected by the number of TP.
This is due to the fact that a distance matrix needs to be computed between a test
pixels and each of TP. Increasing number of TP indeed extends the calculation time
i.e. for n TP and m test pixels, number of distances calculated is nm . However,
increasing number of neighbors has significantly less effect in run time. It has been
observed that time taken for classification, for three and for 15 neighbors are almost
similar (maximum difference is 60-120 seconds) (Figure 5.15). Another aspect is also
noticed, increasing number of bands proportionally affect the calculation time (Figure
5.15). From the Figure 5.16, it could be observed that PCA takes least time in
compared to OD and SPCA extracted data to provide best result. Considering the
time constraint and k-value, PCA could be chosen as the best FE technique, followed
by SPCA, among the available techniques for KNN classification. Figure 5.15 shows
the comparison of time between 200 TP and 300 TP for same number of bands and
neighbors. Rank of FE techniques with respect to accuracy for KNNC for each set of
TP could be inferred from table 5.6.
From Table 5.6, it is further observed that for statistically exact size of (i.e.
100), KNNC produced best result with OD. For statistically sufficient TP (i.e.200),
SPCA secured first rank. However, for statistically large TP (i.e. 300), SPCA and PCA
both perform better. Therefore, it is concluded that among all the data sets feature
modified and original, SPCA and PCA provide the best result for KNNC which in
turn tells that PCA is the best FE technique among all of these techniques for KNNC.
Table 5.6 Rank of FE techniques and time required to obtain best k-value (Rank 1
indicates the best)
TP Original KPCA SPCA PCA OSP

k1 Time Rank k2 Time Rank k3 Time Rank k4 Time Rank k5 Time Rank
(s)* (s) (s) (s) (s)
100 0.8889 875.1 1 0.7773 722.9 4 0.8669 661.2 2 0.7715 789.6 5 0.8268 655.2 3
200 0.9037 1200.6 1 0.7881 1271.1 4 0.9040 1122.1 1 0.8062 1272.0 3 0.8514 1022.7 2
300 0.9244 1574.6 2 0.8141 1556.0 4 0.9325 1712.5 1 0.9320 1434.0 1 0.8701 1291.9 3
Time(s)*: presents the required time in second
90

NN
Nb N
NNb
(a) 300 TP
P (b) 200 TP
T
N
Figure 5.15: Time comparison

c n for KNN N classifica
ation. Timee for differrent bandss at
differen
nt neighb
bors for (a) 300 TP (b)) 200 TP trraining datta per classs.
Figure 5.16: Compa arison of best

b k-valu
ue and cla n time for original and
assification
feature
e extracted
d data set
5.2.4 Cla
ass wise comparison of r
results fo
or KNNC
C
From Figure 5.17,
5 follow
wing observ
vations can d for class wise accurracy
n be viewed
of KNNC
KN
NNC extraccts water and
a salt la
ake classess with very
y high accu
uracy for both
b
feature modified datta and OD
D. Howeverr, the builtt up area is classifieed very pooorly
due to preesence of large numb
ber of mixeed pixels. For built u
up area soome pixels are
classified into
i hydrop
phytic veg,, wheat, pa
asture land
d classes for all data sets
s due to the
presence of
o large nu
umber of mixed
m pixels in built-u
up area cla
ass. Perforrmance of OD,
O
91
KPCA and
d OSP mod
dified data sets are loower than SPCA and
d PCA mod
dified data
a set
to providee good classification accuracy
a foor classificcation of h
hydrophyticc veg class for
all sets of TP. For vin
neyards, a built-up arrea classess for all datta sets and
d TP.
10
00 Training Pixels 200
0 Training Pixels
P
30
00 Training Pixels
WTT : Water
SLT
T : Salt lakee
HV
V : Hydroph hobicVeg
WHHT : Wheat
VY
Y : Vineyard ds
BSS : Bare Soiil
PL : Pasture Land
BU
UA : Built-up
p Area

Figure 5.17: Class wise

w accurracy compa
arison of OD
O and diffferent featture extraccted
data foor KNNC
5.3 Experim
E ment re
esults for
fo SVM
M based
d classiifiers
In this
t section
n, results of
o differentt SVM algoorithms ha
ave been de
escribed. First
F
it will de
escribe thee results of SVM_Q
QP algoritthm follow
wed by SV
VM_SMO and
KPCA_SV
VM. The secction also provides
p a comparisoon of classiffication tim
me of differrent
SVM algorrithms.
92
5.3.1 Experiment results for SVM_QP algorithm
Using the optimal set of parameter values (Table 4.5, recommended by
Abhinav, 2009) for SVM classifiers, classification were performed on feature modified
data sets. Results from these experiments are compared with the best result obtained
by Abhinav (2009) for SVM classifier. He noted that performance of SVM_QP was the
best for PCA extracted data set. The same training and input data sets were used as
for GML and KNN classifiers. The classification results obtained by SVM are
presented in Figure 5.18 from which the following observations can be made:
(i) The k-values are seen as improving with increase in training data size for all
input data sets types (PCA, SPCA, KPCA, OSP and PP modified data sets).
(ii) The best classification results were obtained by PCA and SPCA modified data
sets. For KPCA modified data set, when number of bands increases the k-
values also increase. It is possible that for very high dimension, KPCA
extracted data set can provide high k-value like SPCA or PCA extracted data
sets.
(iii) Increasing in k-values were observed for PCs and SPCs which stagnates after a
critical number of features used. After that it starts to decrease gradually. This
could be due to same reason discussed for GML classification algorithm in
section 5.1.
(iv) A similarity can be observed for KPCA, PCA and SPCA modified data set. For
statistically insufficient TP (25) suddenly k-values reach to about zero for
classification using 50 bands. The reason is not clear. Probably due to using
these number of bands and TP, SVM_QP was unable to find proper decision
boundary.
(v) Best result for KPCA and OSP extracted data set are about to similar for each
set of TP except for 25 TP.
93

SV
VM_QP
(a) PC
CA (b) SPCA
(c) KPC
CA
(d) PP
P (e) OSP
Figure 5.18: Overalll kappa values obseerved for classificatio

c on of FE modified
m d
data
sets ussing SVM and
a QP opttimizer
Thee k-values for

f the classsification of these da
ata sets weere statisticcally analy
yzed
to select th
he best ressults for ea
ach data seet. The app
proach was similar too that follow
wed
94
in case of GML. The z-statistic values obtained for best k-values are shown in Table
5.7. The following can be inferred from these results:
(i) PCA and SPCA were found to be giving statistically similar result for all set of
TP. On the other hand, PCA always provides statistically significantly better
result than KPCA and OSP modified data set for all set of TP for SVM_QP
classifier.
(ii) Classification with SPCA modified data set always performs statistically better
than KPCA modified data set for all sets of TP. However, OSP performs
statistically better than KPCA modified data set for 100 and 200 TP per class.
For large set of TP (300), OSP performs statistically similar with KPCA
modified data set.
(iii) Another observation is made from the Table 5.7 that the SPCA modified data
set always performs statistically better than OSP modified data set.
(iv) It can be concluded that PCs and SPCs have the better ability to improve k-
value than any other FE techniques. KPCA performs the worst among all the
FE techniques.
Table 5.7: The best kappa accuracy and z-statistic for SVM_QP on different feature
modified data set
TP PCA KPCA SPCA OSP z-statistic

k1* NB* k2 NB k3 NB k4 NB Z12 Z13 Z14 Z23 Z24 Z34
100 0.9408 15 0.8703 55 0.9408 15 0.8874 8 36.30 0.00 28.70 -36.30 -7.79 28.70
200 0.9621 15 0.8901 65 0.9573 15 0.9050 8 7.89 0.53 6.26 -33.39 -7.26 30.40
300 0.9643 15 0.9090 60 0.9691 20 0.9069 8 6.07 -0.59 6.30 -7.40 1.06 7.65
NB* = no. of bands used to achieve the best k-value; ki* = k-value for ith FE technique ,
During above experiments, it was observed that time taken to train the SVM
based classifier is affected very much by the number of training samples used. This is
because a kernel matrix has to be computed for every pair of TP. There were very
little changes in training times with increase in number of bands.
Generally the total time taken to perform SVM based classification was
observed to be ranging from 23 to 102 seconds when bands were increased from 5 to
95

65 for 25 TP.
T The sa
ame range for 100 TP
P was obserrved as 82 to 273 secconds, and 522
to 615 seco
onds for 20
00 TP.
An important
i aspect hass been obseerved for th
he classificcation timee using 200
0 TP
with SPCA
A modified
d data set (Figure 5..19). When
n the band
ds are increased, afteer a
critical nu
umber of ba
ands (30 bands),
b the classificattion time d
decreases monotonica
m ally.
Same tren
nd was observed for 300
3 TP perr class. Thiis could bee due to thee, fact thatt by
using larg
ge number of TP and
d large num
mber of ba
ands, SVM
M_QP was unable
u to ffind
sufficient number off support vectors required for classificattion. For SPCA
S or PCA
P
modified data
d sets, except
e firstt few bands, all rema
aining band
ds contain large amoount
of noise. Due
D to thee presencee of noise, optimizattion probleem might not be sollved
properly for
fo large nu
umber of bands
b with
h large set of TP for SPCA or PCA modified
data sets. That mea
ans that sufficient
s n
number of SV could not be fin
nd. When the
number off SV are less
l then classificatio
c on time allso be less and k-vallues may a
also
decrease. This
T could be supporrted from th
he Figure 5.18 (a), (b
b). It is obsserved thatt for
SPCA and s k-valuees start to decrease
d PCA modiified data set d affter 25 ban
nds.
Excceptionally higher tim
mes of the order of 2600
2 secon
nds were observed when
the trainin
ng data sizze was incrreased to 300 TP. Succh higher ttimes weree observed due
to the QP
P optimizeer used. Varshney
V and Arora
a (2004) suggested
s a few better
optimizerss which woould give the same classificatiion accura
acies in shorter train
ning
times. It is known th
hat same performanc
p ce would bee achieved
d irrespective of choicce of
optimizer in case of SVM as it makes usee of the sta
atistical lea
arning theoory as poin
nted
out by Varrshney and
d Arora (20
004)
96
Figure 5.19: Classification time comparison using 200 and 300 TP per class.
5.3.2 Experiment results for SVM_SMO algorithm

The classification results obtained using SVM with SMO optimization
techniques are presented in Figure 5.20. The rbf kernel function is used for
classification of different data sets using SVM_SMO algorithm. The following
observations can be made on the basis of k-value presented in Figure 5.20:
(i) The k-values could be seen as improving with increase in training data size
(except 200 TP) for all input data set.
(ii) Like SVM_QP, a sudden decrease in k-value is observed with 25 TP for the
OD, SPCA, KPCA and OSP extracted data sets. For all data sets, this
happens for 50 features.
(iii) For all data sets (except KCPA extracted data), statistically sufficient
training data set (200 TP) is unable to provide positive k-value. This could
be due to failure of solving optimization problem for these data sets using
200 TP. For KPCA extracted data set, first few bands provide very low k-
value for 200 TP. From 20 bands onwards, k-value provided by KPCA
extracted data set for 200 TP is acceptable.
(iv) Increasing k-values were observed for original and KPCA modified data sets
which stops after a critical number of features used. After that, it starts to
decrease. It is because of same reason as reported for GML classifier. For
the OD and KPCA modified data sets k-values increase monotonically for
100 and 300 TP per class.
(v) For PP modified data set, however, very low k-values are observed. So, all
the results for PP extracted data set are ignored for comparison of results of
SVM_SMO classifier.
The k-values for the classification of these data sets were statistically analyzed
to select the best results for each data set. The approach was similar to the one
followed in previous cases. The z-statistic values are obtained to compare each data
97

set. The best k-values are shown in Table 5.8. The following can be inferred from
these results:
(i) The best results obtained using feature modified data sets were found to be
significantly better than those obtained using the OD set for large training
data size (300 TP). For OSP modified result is marginal, but can be said
that significantly better than OD set. Performance of OD, SPCA and OSP
modified data is very bad, but performance of KPCA modified data is very
high for 200 TP training data. SPCs found to be performing statistically
better than OD set for 100 TP per class and statistically similar to OD for
200 TP.
(ii) The best results were obtained with the OD using 50-60 bands, while
significantly better results than OD were obtained using SPCA modified
data sets with 15-30 features. For 300 TP, statistically similar result to OD
is obtained using OSP modified data set with eight bands.
(iii) KPCs were observed to be performing significantly better than SPCA and
OSP modified data set for 200 TP. For 100 and 300 TP, the best results
obtained by SPCA modified data set are significantly better than OSP and
KPCA modified data sets.
(iv) Classification with OSP is found to be significantly better than KPCA for
100 TP while KPCA is observed to be statistically better than OSP modified
data for 200 and 300 TP. Thus it can be said that SPCA performs better
than OD and any other feature extracted data and performance of OSP is
worst for SVM_SMO based classification.
98

SVM
M_SMO
Origin
nal SPCA
KPCA OSP and PP
Figure 5.2
20: all kappa values
Overa v obsserved for classificattion of original and FE
modifi
fied data seets using SVVM with SMO
S optimmizer
Table 5.8: The best k-value an

nd z-statistiic for SVM
M_SMO on OD
O and diffferent featture
modified data set
TP OD KPCA SPCA OSP z-sstatistic

k1 NB** k2 NB k3 NB k4 NB Z12
1 Z13 Z1 4 Z23 Z24
2 Z34
100 0.8955 50 0.8626 40 0.9304 15 0
0.8739 8 15
5.00 -15.91 9.8
85 -33.90 -4.99 28.25
200 0.1694 5 0.8826 50 0.1694 5 0
0.0001 8 -33
36.2 0.00 12.9
90 475.46 630.00 -
300 0.8934 60 0.9013 50 0.9436 30 0
0.8999 8 -3
3.80 -26.98 1.6
65 -23.75 5.56 28.86
NB* = No. of band used

d to obtain bestt k-value; ki* = k-value for itth FE techniqu
ue
99
Figure 5.2
21: Compa
arison of cllassification
n time for different sset of TP with
w respecct to
numbe
er of bands for SVM_S SMO classiification allgorithm.
Thee total time

e taken to perform S
SVM_SMO based classsification was obserrved
to be rang
ging from 55-90
5 secon
nds when bands
b weree increased
d from 5 to
o 65 for 25 TP,
the same range
r for 100
1 TP wass observed as 145-194
4 seconds, 350-409 seeconds for 200
TP and 11
184 to 1814
4 for 300 TP
T (Figure 5.21). Unlike to SVM
M_QP it is observed that
t
when num
mber of ba
ands increases the cclassificatio
on time allso increasses. The time
t
requireme
ent for larg
ge number of TP for S
SVM_SMO is observeed to be sig
gnificantly less
than the SVM classsification method
m ba
ased on QP
P optimizeer. This is due to SMO
optimizatiion methood. The soolution deerived for SMO meethods neeeds very few
numericall operation
ns. This meethod need
ds more nu
umber of itterations but
b requirees a
small num
mber of opeerations th
hus resultiing in an increase
i in
n optimizattion speed
d for
very large data sets
.
5.3.3 Exp
perimen
nt resultss for KPCA_SVM
M algoritthm
Thee classificattion results obtained using KPC
CA_SVM a
algorithm (QP
( optimizer)
is presentted in Figu
ure 5.22. The rbf k
kernel fun
nction is u
used for cla
assification
n of
different data
d set using
u KPCA
A_SVM algorithm. The
T followiing observ
vations can
n be
he basis off k-values presented
made on th p iin Figure 5.22:
5
100
(i) For OD and KPCA extracted data, unpredictable behavior of KPCA_SVM
classifier is observed for all data set, TP and for different bands. Maximum k-
value for OD is obtained for 200 TP with 35 bands and for KPCA 200 TP with
25 bands.
(ii) For SPCA extracted data set, k-values reach to about zero after 20 bands for
each set of TP. Maximum k-value obtained by SPCA is better than obtained by
OD and KPCA extracted data set. Maximum k-value for each set of TP is
obtained with five bands.
(iii) For OSP extracted data set, highest k-value is obtained for 200 TP. This value
is higher than the k-values of other feature modified data sets, those are
obtained for 200 TP. Reverse of this scenario is seen for OSP modified data set
with 300 TP.
(iv) One important phenomenon is observed for KPCA_SVM algorithm. For large
set of TP (300), KPCA_SVM provides very low k-value. The best k-value is
obtained for all data set using 200 TP per class.
101

KPC
CA_SVM
Origin
nal SPCA
KPCA OSP
Figure 5..22: Overa

all kappa values
v obseerved for classificati
c ion origina
al and featture
modifi
fied data seets using K
KPCA_SVM M algorithmm.
Thee k-values for classifi

fication of these data
a sets weree analyzed
d to select the
best resultts for each
h data set. The approoach was siimilar to th
hat followeed in previious
cases. The
e following can be infeerred from these resu
ults (Table 5.9):
(i) Thee best resullts obtained using fea
ature modiified data sets
s (except KPCA) were
w
foun
nd to be siignificantly
y better th
han those obtained
o using the OD
O set for 200
TP. For 300 TP, OD provides
p sttatistically better ressult than other featture
mod
dified data. Performa
ance of OD
D, KPCA an
nd OSP moodified datta is not goood.
How
wever, perfformance of
o SPCA moodified datta is very h
high for 100
0 TP. SPCA
A is
obse
erved to bee performin
ng statisticcally betterr than OD sset for 100 TP per cla
ass.
102
(ii) The best results were obtained with the OD with 50-60 bands while
significantly better results than OD were obtained using SPCA modified data
sets with five to ten features for 100 and 200 TP per class. For OSP modified
data set, statistically better result than OD is obtained using 200 TP with eight
bands
(iii) SPCs were observed to be performing significantly better than OSP for 100 and
200 TP. While OSP performs statistically better than SPCs for 200 TP. KPCs
perform statistically better than OSP for 100 TP. However, performance of
KPCs for 200 and 300 training data is statistically significantly low than OSP.
(iv) SPCs always perform statistically better than KPCs and OSP performs better
than SPCA only for 200 TP. It could be concluded that for 100, 200 and 300 TP,
KPCA_SVM performs better with SPCA, OSP modified data set and OD
respectively. KPCA_SVM provides low k-value compared to SVM_QP or
SVM_SMO algorithms.
Table 5.9: The best k-value and z-statistic for KPCA_SVM on original and different
feature modified data sets.
TP OD KPCA SPCA OSP z-statistic

k1 NB* k2 NB k3 NB k4 NB Z12 Z13 Z14 Z34
100 0.7110 50 0.5150 25 0.7565 10 0.4192 8 62.93 -15.32 93.69 96.77
200 0.6736 45 0.6514 30 0.6976 5 0.7917 8 6.98 -6.64 -39.72 -21.83
300 0.7142 55 0.5109 45 0.5340 5 0.3488 8 61.15 5.10 203.10 104.90
NB* = No. of band used to obtain best k-value
5.3.4 Class wise comparison of the best result of SVM

Ability of SVM classifiers to separate different classes is observed from Figure
5.23.
(i) Ability to distinguish salt lake class of all SVM classifier is about same.
(ii) Accuracy of separation of wheat class by SVM_QP and SVM_SMO
classifiers is about same. However, performance of KPCA_SVM is very low
(except salt) to separate any other classes than other two classifiers.
(iii) SVM_SMO separates all other classes with little low accuracy than
SVM_QP.
103

(iv) SVM_QP
S is
i the bestt classifier. It has ab
bility to seeparate alll classes with
w
h
high k-valu
ue.

23:
Figure 5.2 Comp parison off classification accu uracy of individuall classes for
differrent SVM algorithm ms. WT – water, SL LT – Salt Lake, HV V –
Hydrrophobic veeg, WHT – wheat, VY Y – Vineyarrds, BS – Bare
B soil, PL
P –
Pastu ure land, BUA
B – Buillt-up area

5.3.5 Com
mpariso
on of resu
ults for d
differentt SVM allgorithm
ms
Thee overall beest results obtained b
by differen
nt SVM alg
gorithms were
w compa
ared
statisticallly to find out
o the besst SVM cla
assification
n method in terms off classificattion
accuracy obtained.
o The
T same was
w done foor the timee scales obsserved for these in orrder
to comparee the practtical appliccability of tthese methods.
(i) From Table 5.10,

5 it is observed that
t _QP methood is statisstically better
SVM_
n all otherr SVM algoorithms forr all sets of
than o TP. Bestt results off SVM_QP are
obta
ained for SPCA
S and PCA modiified data sets
s for 10
00 and 200 TP per class.
For 300 TP best resu
ult is obttained for SPCA m
modified data
d set. The
T
classsification time
t rangees from 14
48 seconds (for 100 T
TP) to 2596
6 seconds ((300
TP). This timee range is very
v high.
(ii) From Table 5.10,
5 it is observed
o th
hat SVM_S
SMO algorrithm is th
he second best
b
M decision rule. The best
SVM b k-valu
ues for SVM
M_SMO aree obtained with 100, 200
and
d 300 TP using
u SPCA
A, KPCA a
and SPCA modified d
data sets re y. k-
espectively
valu
ues obtain
ned for 100
0 and 300
0 TP are little
l M_QP, though
less than SVM
104
required classification time using 300 TP is about two third of SVM_QP.
Though SVM_SMO needs more bands than SVM_QP to obtain best k-values
for different sets of TP but its processing time is very less than SVM_QP.
(iii) KPCA_SVM is poorest method amongst SVM_QP and SVM_SMO. Highest k-
value is obtained for KPCA_SVM by using OSP modified data set with 200 TP.
When number of pixel is large performance of KPCA_SVM is less.
From the above discussion, it can be concluded that SVM_QP is the best
classifier with respect to accuracy. Considering both the classification time and
accuracy, SVM_SMO can be considered as the effective SVM classifier. The best
accuracy is obtained by SVM_QP by using 300 TP with the first 20 bands of SPCA
modified data set. For SVM_SMO the best accuracy is obtained by using 300 TP with
the first 30 bands of SPCA modified data set.
Table 5.10: Comparison of the best k-values with different FE techniques,

classification time, and z-statistic for different SVM algorithms.
TP SVM_QP SVM_SMO KPCA_SVM z-statistic

k1 FEA* Time NB* k2 FEA Time NB k3 FEA Time NB Z12 Z13 Z23
(s)* (s) (s)
100 0.9408 PCA, 122.6 15 0.9304 SPCA 148.1 15 0.7565 SPCA 94.3 10 6.14 77.61 71.94
SPCA
200 0.9621 PCA, 585.7 15 0.8836 KPCA 363.9 50 0.7927 OSP 262.3 8 45.16 77.51 36.4
SPCA
300 0.9691 SPCA 2596.2 20 0.9446 SPCA 1694.8 30 0.7142 OD 1190.2 55 18.38 113.47 97.01
ki = best k-value for ith classifier; FEA* = Feature extraction algorithms; NB* = No. of band used to
obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second
5.4 Comparison of best results of different

classifiers
The best results obtained by the parametric (GML), non-parametric (KNN) and
advanced (SVM) classifiers with different feature modified data set are already
presented in Tables 5.2, 5.5 and 5.9. The best advanced classifier (SVM_QP) is chosen
by statistically comparing all the advanced classifiers. The statistical comparison of
parametric, nonparametric and best advanced classifiers are carried out in order to
evaluate the best classifier among these classifiers with respect to classification
accuracy and time. The corresponding z-statistic is presented in Table 5.11:
105

The followings are observed from the Table 5.11:
(i) GML performs statistically better than KNN classifier for all set of TP. Also
the classification time of GMLC is negligible with respect to KNNC.
(ii) GMLC performs statistically similar with SVM_QP for 100 and 200 TP. For
large set of TP (300), the performance of SVM_QP classifier is statistically
significantly better than GMLC. However, required classification time is very
high for SVM classifier.
(iii) SVM_QP provides statistically better result than KNNC for all set of TP. From
here it can be concluded that SVM_QP is the best classifier on the basis of
classification accuracy. GML is ranked as the second best classifier.
(iv) It is also observed that the best results are obtained by all the classifiers by
using SPCA modified data set. It is also concluded that SPCA is the best
feature reduction technique among all other techniques for all classifiers.
(v) Processing time of GMLC is very less than any other classifiers. GMLC
provides little poor k-value than SVM_QP for 300 TP. Considering both
classification time and accuracy, it can be concluded that GMLC is the best
classifier than any other classifier.
Table 5.11: Statistical comparison of different classifier’s results obtained for

different data sets
TP GML KNN SVM_QP z-statistic

k1 FEA* Time (s)* NB* k2 FEA Time (s) NB k3 FEA Time (s) NB Z12 Z13 Z23
100 0.9384 SPCA 60.6 20 0.8669 SPCA 661.2 10 0.9408 SPCA 122.6 15 36.82 -1.54 -38.06
, PCA
200 0.9579 SPCA 64.7 30 0.9040 SPCA 1122.1 15 0.9573 SPCA 585.7 15 31.33 0.42 -30.98
, PCA
300 0.9589 SPCA 82.6 45 0.9325 SPCA 1712.5 40 0.9691 SPCA 2596.2 20 16.00 -7.97 -25.37
ki = best k-value for ith classifier; FEA* = Feature extraction algorithm; NB* = No. of band used to
obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second
The difference in performance of GML, KNN and SVM classifiers can be

attributed to difference in their classification mechanisms. GML and KNN are
capable of forming only simple decision boundaries where SVM can forms highly
complex non-linear decision boundaries. In the given data, different kinds of class
separabilities were observed for different classes. The water and salt classes were
found easily separable from the rest of the classes. About 100% classification
accuracies were observed for these classes with very small number of features for all
106

the classifiers. After these, the classes: wheat, vineyards and bare-soil were showing
a little lower accuracy values which means these are a little difficult to separate. The
lowest accuracies were observed for pasture land, built-up area and hydrophytic
vegetation classes. These classes are very poorly separated and thus complex decision
boundaries would be required to separate them. For large set of TP, SVM_QP is able
to achieve higher classification accuracies than the parametric and non-parametric
classifier because they were not able to separate the poor classes in a better way.
Classified maps corresponding to the best results of different classifiers are
shown in Appendix A (Figure A.1).
5.5 Ramifications of results

HD classification is very crucial task due to its characteristics and large
volume of data. It is clear from the analysis that depending on availability of TP the
selection of FE techniques and classification algorithms are very important for
classification of HD. Another important aspect should also be kept in mind that is
time-consuming classification and FE procedures. This thesis work has pointed on
some important guidelines for classification of HD (Table 5.12).
(i) When only statistically insufficient TP is available, it is suggested to apply
either SVM_QP algorithm with OSP FE technique. This will provide high
classification accuracy in minimum time.
(ii) GML is strongly recommended to apply on SPCA modified data set to achieve
very high accuracy in very less time for statistically exact and statistically
sufficient training data sets.
(iii) For statistically large training data set, high accuracy could be achieved by
implementing SVM_QP on SPCA modified data set. Nevertheless, this method
will take very large processing time. So, it is strongly recommended to apply
GML on SPCA modified data set, though achieved classification accuracy is
little less than SVM_QP but processing time is negligible than SVM_QP.
SVM_SMO could also be used for large set of TP on SPCA modified data set.
(iv) Among all the popular FE techniques for HD, SPCA is the most effective FE
technique, which could be used to achieve high classification accuracy for HD
for all classification techniques.
107

Table 5.12: Ranking of different classification algorithms depending on classification

accuracy and time. (Rank: 1 indicate the best)
Ranking depending on accuracy

TP Parametric Non- Advanced
parametric
GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA
25 2 SPCA 3 KPCA 1 SPCA, 1 SPCA 4
OSP
100 1 SPCA 3 SPCA 1 PCA, 2 SPCA 4 SPCA
SPCA
200 1 SPCA 2 SPCA 1 PCA, 3 KPCA 4 OSP
SPCA
300 2 SPCA 4 SPCA 1 SPCA 3 SPCA 5 OD
Ranking depending on accuracy & time
TP Parametric Non- Advanced
parametric
GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA
25 2 SPCA 3 SPCA 1 OSP 1 SPCA 4 SPCA
100 1 SPCA 4 SPCA 2 PCA, 3 SPCA 5 SPCA
SPCA
200 1 SPCA 4 SPCA 2 PCA, 3 KPCA 5 OSP
SPCA
300 1 SPCA 4 SPCA 3 SPCA 2 SPCA 5 OD
108

CHAPTER 6
SUMMARY OF RESULTS AND
CONCLUSIONS
Starting with the summary of observations as noticed in the previous, this

chapter mainly aims to summarize the conclusions corresponding to the main
objectives as defined in the first chapter. It also suggests the some area and methods
for further research in future.

6.1 Summary of results
This research work is the extension of the work done by Abhinav (2009). For
this research work, DAIS 7915 hyperspectral sensor data was used for testing
different FE techniques and classification algorithms. The best results obtained by
these experiments were compared with those obtained by Abhinav (2009). Based on
the conclusions from the literature survey and recommendations for future work by
Abhinav (2009), several FE (SPCA, KPCA, OSP, PP) and classification algorithms
(KNN, GML, SVM based classifiers) have been tested to achieve the objectives as
mentioned in section 1.4.
For parametric classifier (GML), experiments were performed on different
feature extracted data sets which are mentioned above. The best result obtained by
the experiments were compared with the best result obtained by Abhinav (2009) to
observe the improvement. For non-parametric classifier (KNN), first experiment was
performed with OD. Then algorithm was applied on the different feature modified
data. The best results for OD and feature extracted data were compared to obtain the
best result for non-parametric classifier. For the advance classifier (SVM_QP,
SVM_SMO and KPCA_SVM) experiments were performed on OD as well as feature
modified data sets. For SVM_QP, like GML, also the best result was compared with
the best result obtained by Abhinav (2009). The best results of different SVM
classifiers were examined to obtain best SVM algorithm.
109

Lastly, the best results for parametric, non-parametric and advance classifiers
were compared to find out the best classifier for HD. All the comparisons were
performed by the one-tailed hypothesis testing at 5% significance level.
Classification experiments were performed using the four FE techniques,
namely, SPCA, KPCA, OSP and PP. From the statistical analysis of classification
results obtained using these feature modified data sets, it could be concluded that
among the four above mentioned FE techniques, SPCA modified data set provides the
best results. These results were also compared with the best classification results
obtained by Abhinav (2009) using different FE techniques. SPCA performs better
because it uses the local statistics rather than global.
Analyzing the different classifiers results, it is observed that sometimes the
results obtained from PCA modified data set competes with those obtained by SPCA
modified data set. Generally, different classifiers provide the best results using 15 to
30 bands of SPCA or PCA modified data sets, which effectively reduces the
classification time. For OSP and PP, due to very low dimensionality, these always fail
to produce satisfactory results. However, the results obtained by using eight bands of
OSP modified data set are reasonably good, though they are not always statistically
significantly better than SPCA or PCA modified data sets. There is a possibility of
improving result by increasing the dimension of OSP modified data set by extracting
more number of endmembers. For KPCA modified data set, it was observed that its
performance is always poor in quality. However, it is observed that KPCA can
produce satisfactory result by increasing the dimension which will also increase the
classification time proportionally. Therefore, KPCA is not considered as an effective
FE technique.
From the experiments performed with parametric classifier (GML), it was
observed that the performance of GML was significantly improved after applying FE
techniques. Comparing the obtained results with the best result obtained by Abhinav
(2009), SPCA was found to be working best among all available FE techniques, in
improving classification accuracy by GML.
Moving on to the non-parametric classifier, it is observed that result of KNN
classifier depends on the choice of number of bands and neighbors. Best results were
selected for KNN with and without applying FE techniques and it was found that
110

result of KNN was enhanced by PCA and SPCA techniques while the supervised FE
techniques like KPCA and OSP failed to do so.
SVM algorithm was selected as the advance classifier. It uses statistical learning
theory, which is expected to produce consistent and optimal results as compared to
the parametric and non-parametric classifiers. Different SVM algorithms (SVM_QP,
SVM_SMO and KPCA_SVM) were tested to reach this goal. For SVM based
classifiers, it was observed that, the dimension of the data sets and choosing of
optimizer significantly affect the results. The best result of SVM_QP was achieved by
SPCA feature extracted data set with 20 bands. It was also observed that, the
classification result using advanced classifier was further improved than the best
result obtained by Abhinav (2009). He obtained the best result using PCA modified
data sets. This result was further improved by using SPCA modified data set. This
proves that by using selected FE techniques, classification results of advance
classifier can further be improved. It was observed that supervised FE technique like
KPCA, OSP could not improve the result of SVM while unsupervised FE technique
(SPCA) made improvement in result. On the other hand, the best results of
SVM_SMO and KPCA_SVM were obtained by using SPCA and OSP modified data
sets respectively. Comparing the best results of different SVM algorithms SVM_QP is
concluded as the best SVM classifier.
On comparing the best results obtained by SVM classifiers with the best
results of parametric and non-parametric classifications, it was found that the
advance classifier performs significantly better for both the data sets, original or
feature extracted. The reason for better performance of this classifier is the
improvement in separating a few classes which shows poor k-values when parametric
or non-parametric classifiers were used. This observation is expected because of the
variation in formation of decision boundary. The decision boundary form by
parametric or non-parametric classifiers are simpler. For this reason they are unable
to perform to separate the poor classes efficiently. Advance classifier has ability to
form complex, nonlinear decision boundaries which help them to improve decision
boundary for separating poor classes.
Compared to parametric classifier, SVM required higher computation time and
memory requirement. In spite of these difficulties, significant improvement was
111

observed over parametric and non-parametric classifiers by advance classifier. This
strongly suggest that SVM has an ability to reduce the troubles regarding HD
classification.
6.2 Conclusions
Based on these results, the following conclusions are drawn:
1. Out of various FE techniques for classification of HD, SPCA is the best FE

technique followed by PCA. In addition, orthogonal subspace projection can be
taken as the effective FE technique if its dimension could be increased.
2. Although advance classifiers needs large processing time but these are able to
reduce the problems concerned with the classification of HD in a much better
manner than the parametric or non-parametric classifiers. For statistically
exact and sufficient sets of TP, performance of SVM_QP is not statistically
better than those of parametric classifier. For large set of TP, SVM_QP
produces statistically better result than all classifiers. In addition, the SPCA
FE techniques were found to be helpful to increase the accuracy significantly
for all of advance, parametric and non-parametric classifiers.
6.3 Recommendations for future work

During the literature survey, some additional methods were found that are not
included in this thesis work. These seem to be showing scope of improving accuracy
and computation time for the advance classifiers presented in this thesis. The
following methods are recommended for the future work:
(i). In this thesis work the high memory and computational time required by SVM
methods were little reduced by using different optimizers and algorithms.
There is still chance to reduce the computation time for SVM algorithm by
using Lagrangian SVM algorithm (Mangasarian and Musicant, 2000). This
required testing further. In addition, some optimization techniques like Kernel
Adatron (Bennett and Campbell, 200), Succesive Overrelaxation (SOR)
(Mangasarian and Musicant, 1998) should also be tested which may reduce the
computation time significantly.
112

(ii). Moreover, it can be commented that for large set of TP, KPCA method takes
much time. Lima and Zen (2005) suggested a method called Sparse KPCA
which may reduce the computation time. This needs to be tested.
(iii). The high computation time required by KNN found in this thesis work. It is
because of the large number of computation is required to classify a single
pixel. For large data set it will increase exponentially. In order to reduce these
Hash-table approach could be applied. By using Hash-table number of
computation will be less.
113

REFERENCES
Barros, A. S and Rutledge, D, N (2005) ‘Segmented principal component
transform–principal component analysis’, Chemometrics and Intelligent Laboratory
Systems 78 (2005) 125– 137
Bhattacharyya, A. (1943) ‘On a measure of divergence between two statistical

populations defined by probability distributions,’ Bulletin of Calcutta Mathematical
Society, Vol. 35, pp. 99-109.
Ben-Dor, E., Patkin K., Banin A. and Karnieli, A. (2002) ‘Mapping of several soil
properties using DAIS-7915 hyperspectral scanner data – a case study over clayey
soils in Israel,’ International Journal of Remote Sensing, Vol. 23, No. 6, pp. 1043-
1062.
Bierwirth, P., Huston, D., and Blewett, R. (2002) ‘Hyperspectral mapping of

mineral assemblages associated with gold mineralization in the Central Pilbara,
Western Australia,’ Economic Geology and the Bulletin of the Society of Economic
Geologists, Vol. 97, No. 4, pp. 819-826.
Boser, H., Guyon, I. M., Vapnik, V. N. (1992) ‘A training algorithm for optimal
margin classifiers’ Proceedings of the 5th Annual Workshop on Computational
Learning Theory, ACM New York, NY, USA, pp. 144-152.
Carreira-Perpinan, M. A. (1997) ‘A review of dimension reduction techniques,’

Technical Report, Vol. 9, No. CS-96, Department of Computer Science, University of
Sheffield.
Cha, G. H. (2005) ‘Kernel principal component analysis for content based image
retrieval’, PAKDD 2005, LNAI 3518, pp. 844 – 849, Springer-Verlag Berlin
Heidelberg.
Chang, C. I., Sun, T. L. E., and Althouse, M. L. G. (1998) ‘An unsupervised

interference rejection approach to target detection and classification for hyperspectral
imagery,’ Opt. Eng., VOL. 37, PP. 735–743.
Chang, C. I. (2005) ‘Orthogonal subspace projection (OSP) revisited: A

comprehensive study and analysis’, IEEE Transactions on Geoscience and
Remotesensing, VOL. 43, No. 3.
Cristianini, N., Shawe-Taylor, J. (2000) An introduction to support vector

machines and other kernel-based learning methods, Cambridge University Press,
Cambridge, UK.
114

Congalton, R. G. (1991) ‘A reviews of assessing the accuracy of classifications of
remotely sensed data,’ Remote Sensing of Environment, Elsevier Science (pub.),
Vol.37, No. 1, pp. 35-46.
Cover, T. M. and Hart, P. E. (1967) ‘Nearest neighbor pattern classification,’ IEEE

Transactions Information Theory, Vol. IT-13, No. 1, pp. 21–27.
Curran, P. J. and Dungan J. L. (1989) ‘Estimation of signal-to-noise – a new

procedure applied to AVIRIS data,’ IEEE Transactions on Geoscience and Remote
Sensing, Vol. 27, No. 5, pp. 620-628.
Dasarathy, B. V. (1991) ‘Nearest neighbour (NN) norms: NN pattern classification

techniques’, IEEE Computer Society Press, Los Alamitos, CA
Devijver, P. and Kittler, J. (1982) Pattern recognition: A statistical approach,

Englewood Cliffs, New Jersey.
Dundar, M. M. and Landgrebe, D. A. (2004) ‘Toward an optimal supervised

classifier for the analysis of hyperspectral data,’ IEEE Transactions on Geoscience
and Remote Sensing, Vol. 42, No. 1, pp. 271-277.
Friedman, J. H. (1987) "Exploratory projection pursuit," Journal of the American

statistical association, 82, 249-266.
Fukunaga, K. (1990) Introduction to statistical pattern recognition, Rheinboldt, W.

(edt.), II edn., Academic Press, Inc., San Diego, USA.
Garg, A (2009) Investigations on classification techniques for hyperspectral imagery,

M. Tech Thesis, Indian Institute of Technology, Kanpur.
Harsanyi, J. C. and Chang, C. I. (1994) ‘Hyperspectral image classification and

dimensionality reduction: An orthogonal subspace projection,’ IEEE Transactions on
Geoscience and Remote sensing, VOL. 32, PP. 779–785.
Harsanyi, J. C.(1993) Detection and classification of subpixel spectral signatures

in hyperspectral image sequences, Ph.D. dissertation, Dept. Elect. Eng., Univ.
Maryland Baltimore County, Baltimore, MD.
Huber, P. J. (1985) ‘Projection pursuit’, The Annals of Statistics, 13, 435-475.
Hughes, G. (1968) ‘On the mean accuracy of statistical pattern recognizers,’ IEEE
Transactions on Information Theory, Vol. IT-14, No. 1, pp. 55-63.
Hwang, W. J. and Wen, K.W. (1998) ‘Fast KNN classification algorithm based on
partial distance search’, IEEE Transaction, Electronics Filter, Vol. 34, No. 21.
115

Hwang, J., Lay, S., and Lippman, A. (1994), ‘Nonparametric multivariate density
estimation: A comparative study,’ IEEE Transactions Signal Processing, Vol.42, No.
10, pp. 2795-2810.
Ifarraguerri, A. and Chang, C. I. (2000) ‘Unsupervised hyperspectral image

analysis with projection pursuit’ IEEE Transactions on Geoscience and
Remotesensing, VOL. 38, NO. 6.
Jia, X. (1996) Classification techniques for hyperspectral remote sensing data, Ph. D.
Thesis, University of Canberra.
Jones, M. C., and Sibson, R. (1987) ‘What is projection pursuit?’, Journal of the
Royal Statistical Society, Ser. A, 150, 1-38.
Jimenez, L. O. and Landgrebe, D. A. (1998) ‘Supervised classification in high

dimensional space: Geometrical, statistical and asymptotic properties of multivariate
data,’ IEEE Transactions Systems, Man and Cybernetics - Part C: Applications and
Reviews, Vol. 28, No. 1, pp. 39-54.
Kim, K. I., Franz, F. O., and Scholkopf, B. (2005) ‘Iterative Kernel principal
component analysis for image modeling’, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 27, No. 9.
Kohram. M. and Sap, M. N. M. (2008) ‘Composite kernel for support vector

classification of hyperspectral data’, MICAI 2008, LNAI 5317, pp. 360 – 370,
Springer-Verlag Berlin Heidelberg.
Kolahdouzan, M. and Shahabi, C. (2004) ‘Voronoi-based K Nearest Neighbor

search for spatial network databases’. Proceedings of the 30th VLDB
Conference,Toronto, Canada, 2004.
Lee, Y. J. and Huang, S. Y. (2005) ‘Reduced support vector machines: A statistical

theory’, Taiwan.
Landgrebe, A. (1971) ‘Description and results of the LARS/GE data compression

study,’ LARS Information Note, Vol. 21171.
Leunberger, D. (1984) Linear and nonlinear programming, II edn., Addison-

Wesley, Menlo Park, California
Luttrell, R. D. and Vogt, F. (2008) ‘Accelerating kernel principal component

analysis (KPCA) by utilizing two dimensional wavelet compression: applications to
spectroscopic imaging’, Wiley Inter Science.
Martinez, W. L. and Martinez, A. R. (2004) Exploratory data analysis with

Matlab, Chapman and Hall /CRC
116

Mercer, J. (1909) ‘Functions of positive and negative type, and their connection with
the theory of integral equations,’ Transactions of the London Philosophical Society,
Vol.-209, No. A, pp. 415-446.
Nilsson, N. J. (1990) The mathematical foundations of learning machines, Morgan

Kaufmann Publishers Inc., San Mateo, CA.
Pal, M. (2002) Factors influencing the accuracy of remote sensing classifications: A

comparative study, Ph. D. Thesis, University of Nottingham.
Pechenizkiy, M. (2005) ‘The Impact of Feature Extraction on the Performance of a

Classifier: kNN, Naïve Bayes and C4.5’. B. Kégl and G. Lapalme (Eds.): AI 2005,
LNAI 3501, pp. 268 – 279, 2005., Springer-Verlag Berlin Heidelberg
Ping, X., Guo, G., and Chen, G. (2006) A fast document classification algorithm
based on improved KNN, IEEE Transaction.
Posse, C. (1995) ‘Tools for two-dimensional exploratory projection pursuit’, Journal of

Computational and Graphical Statistics, Vol. 4, No. 2 (June, 1995), pp. 83- 100.
Richards, J. A. and Jia, X. (2006) Remote sensing digital image analysis: An

introduction, IV edn., Springer, Berlin.
Robila, S. A. and Varshney, P. K. (2002) ‘Target detection in hyperspectral images

based on independent component analysis,’ Proceedings of SPIE: Automatic Target
Recognition XII, SPIE-International Society for Optical Engineering, Vol. 4726, pp.
173-182.
Schraudolph, N. N., Gunter, S. S., and Vishwanathan, V. N. Fast iterative kernel

PCA, Statistical Machine Learning, National ICT Australia.
Smola, A. J. and Scholkopf, B. (1997) ‘On a kernel-based method for pattern

recognition, regression, approximation, and operator inversion’, GMD Technical
Report: 1064.
Sundaram, N. (2009) ‘Support vector machine approximation using kernel PCA’,

Technical Report No. UCB/EECS-2009-94.
Vapnik, V. N. (1995) The nature of statistical learning theory, Springer, NY.
Vapnik V. N. (1998) Statistical learning theory. John Wiley and Sons, NY.
Varshney, P. K. and Arora, M. K. (2004) Advanced image processing techniques for

remotely sensed hyperspectral data, Springer, NY.
Wegman, E. J. (1990) ‘Hyperdimensional data analysis using parallel coordinates’,

Journal of the American Statistical Association, Vol. 85, No. 411, PP. 664- 675.
117

Welling, M. ‘Kernel principal component analysis’, Department of Computer Science,
University of Torento.
Zhu, B., Jiang, L., Jin, F., Qin, L.,Vogel, A., and Tao, Y. (2007) ‘Walnut shell and
meat differentiation using fluorescence hyperspectral imagery with ICA-KNN optimal
wavelength selection’, Sens. & Instrumen. Food Qual. (2007) 1:123–131 DOI
10.1007/s11694-007-9015-z, Springer Science+Business Media, LLC 2007
118

APPENDIX A
GML Legend KNN
SVM_QP SVM_SMO
KPCA_SVM
Figure A.1: Classified maps corresponding to the best results of different classifiers
119

Supervised Learning With Hyperspectral Imagery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning With Hyperspectral Imagery

Uploaded by

Copyright:

Available Formats

SUPERV

U VISED LEARNING WITH

1.1 High dimensional space....................................................................................... 2

1.1.1 What is hyperspectral data? ......................................................................... 2

1.1.2 Characteristics of high dimensional space .................................................. 3

1.1.3 Hyperspectral imaging ................................................................................. 4

1.2 What is classification? ......................................................................................... 5

1.2.1 Difficulties in hyperspectral data classification.......................................... 5

1.3 Background of work ............................................................................................. 6

1.4 Objectives ............................................................................................................. 7

1.5 Study area and data set used .............................................................................. 7

1.6 Software details ................................................................................................... 9

CHAPTER 2 – Literature Review ........................................................ 10

2.1 Dimensionality reduction by feature extraction .................................................. 10

2.1.1 Segmented principal component analysis (SPCA) ........................................ 11

2.1.2 Projection pursuit (PP) ............................................................................... 11

2.1.3 Orthogonal subspace projection (OSP) ..................................................... 12

2.1.4 Kernel principal component analysis (KPCA) ......................................... 12

2.2 Parametric classifiers ........................................................................................ 13

2.2.1 Gaussian maximum likelihood (GML)....................................................... 13

2.3 Non–parametric classifiers .............................................................................. 14

2.3.1 KNN ............................................................................................................. 14

2.3.2 SVM .............................................................................................................. 15

2.4 Conclusions from literature review .................................................................. 19

CHAPTER 3 – Mathematical Background ................................... 21

3.1 What is kernel? .................................................................................................. 21

3.2 Feature extraction techniques .......................................................................... 24

3.2.1 Segmented principal component analysis (SPCA) .................................... 25

3.2.2 Projection pursuit (PP) ............................................................................... 27

3.2.3 Kernel principal component analysis (KPCA) .......................................... 34

3.2.4 Orthogonal subspace projection (OSP) ...................................................... 38

3.3.1 Bayesian decision rule ................................................................................ 43

3.3.2 Gaussian maximum likelihood classification (GML): ............................... 44

3.3.3 k – nearest neighbor classification ............................................................. 44

3.3.4 Support vector machine (SVM): ................................................................. 46

3.4 Analysis of classification results ....................................................................... 58

3.4.1 One tailed hypothesis testing ..................................................................... 59

CHAPTER 4 - Experimental Design .................................................. 61

4.1 Feature extraction technique ............................................................................ 62

4.1.1 SPCA ............................................................................................................ 62

4.2 Experimental design .......................................................................................... 64

4.3 First set of experiment (SET-I) using parametric and non-parametric

4.4 Second set of experiment (SET-II) using advance classifier ............................... 67

4.5 Parameters ...................................................................................................... 68

CHAPTER 5 - Results .................................................................................... 69

5.1 Visual inspection of feature extraction techniques ......................................... 69

5.2.1 Results of classification using GML classifier (GMLC) ........................... 75

5.2.2 Class-wise comparison of result for GMLC ............................................... 81

5.2.3 Classification results using KNN classifier (KNNC) ................................ 82

5.2.4 Class wise comparison of results for KNNC ............................................. 91

5.3 Experiment results for SVM based classifiers ................................................. 92

5.3.1 Experiment results for SVM_QP algorithm .............................................. 93

5.3.2 Experiment results for SVM_SMO algorithm ........................................... 97

5.3.3 Experiment results for KPCA_SVM algorithm ....................................... 100

5.3.5 Comparison of results for different SVM algorithms ............................. 104

5.4 Comparison of best results of different classifiers......................................... 105

5.5 Ramifications of results ................................................................................... 107

CHAPTER 6 - Summary of Results and Conclusions ....... 109

6.1 Summary of results.......................................................................................... 109

6.2 Conclusions....................................................................................................... 112

6.3 Recommendations for future work ................................................................. 112

Table Title Page