Professional Documents
Culture Documents
1INTRODUCTION
Term cancer does not refer to one disease, but rather to many
diseases that can occur in various regions of the body. Every
type of cancer is characterized to growth of cell. The cancer is
third most common diseases and the second leading cause of
death in this world. Detection of cancer is the research topic
with significant importance. Every gene array techniques have
been shown to provide inside into cancer study and the
molecular profiling based on gene expression array technology
is expected to the promise of precise cancer detection and the
classification. The most important problems in the treatment of
cancer is the early detection of the disease. If the cancer is
detected in later stages, it compromised the function of one or
more organ systems and is widespread entire body. Methods for
the early detection of cancer are of utmost importance and are an
active area of current research. An important step in the
diagnosis of cancer is classification of different types of
unknown malignant tumors to known classes.
IJTET2015
144
2 RELATED WORKS
Wang et al. (2005) demonstrates various gene selection methods
to improve the classification performance. As microarray data
contains small number of samples and large number of
genes/features therefore it is a very challenging task to choose
relevant features to be in various types of cancer. In the study
various Feature Selection (FS) algorithms namely wrappers,
filters and Correlation based Feature Selection (CFS) are
statistically analyzed to get the useful information about genes
and to reduce the dimensionality.
IJTET2015
145
Two date sets are used to check the classifiers performance. The
datasets are Tumors and Brain Tumor. Experimental results of
Sparse Represent the gene expression data is compared with the
SVMs. The proposed sparse representation method is
implemented in MATLAB R14.The Result of SVM are
calculated By GSM GEMS (Gene Expression Model Selector),
with graphical user interface for classification data. This GUI is
available at http://www.gemssystem.org/. Accuracies are
compared with SVM results and it was found that there are no
differences except a little improvement. Hence it was found that
classification performance of sparse representation is quite
similar to that of SVM.
3 PROPOSED SYSTEM
Principal Component Analysis (PCA) is statistical procedure that
uses an orthogonal transformation to convert a set of
observations of possibly correlated variables into a set of values
of linearly uncorrelated variables called principal components.
The number of principal components is lesser than to the number
of original variables. Datasets that used in this work has a large
number of features m and a small number of samples n. It is very
computationally intensive to calculate covariance of such type of
data. To beat this issue, covariance is computed through the idea
of eigenfaces. The eigenvectors of the covariance matrix that are
count by projecting the data into feature space that spans the
changes among known classes then it can solve the memory
issue occur due to the large number of features. The data into
feature space makes some mathematical assumptions.
The eigenvectors of the original covariance matrix is used to
transform matrix to low dimensional space. So in this case when
there is large m and small n. Calculations are reduced from the
number of features to number of samples. PCA is one of the
most famous dimensionality reduction techniques, used for the
datasets that have redundant information. In PCA class
information is not given while for proposed scheme class
information is given and taken into consideration for the
projection of data. It is not exactly the PCA, but a variation of
PCA because although, it reduces the dimension of data, it
preserves the information of data. The term eigenfaces is used to
because they were dealing with face images for recognition
purpose. However in this study transformed genes are used for
classification purpose therefore eigengenes are used instead of
eigenfaces for recognition purpose. It is difficult to compute
inverse of covariance matrix because it is not a diagonal matrix.
Karhunen Leove (KL) theorem is used to make data somewhat
diagonal matrix. KL-transformation is mostly used for the
datasets that are highly correlated to make them uncorrelated. In
proposed scheme, data is KL-transformed by multiplying each
sample with the transformation matrix. The computations are
greatly reduced due to the fact that there is less correlation
between various transformed features.
146
4 PERFORMANCE ANALYSES
According to the experimental result, the proposed classification
schema should be plotted in the graph. Simulation results for the
performance indices are showed below. The performance of
classifiererror rates are computed for each dataset using
bootstrap method.In this method is used for the estimation of
error rate and uses the weighted average of the reconstitution
error, means the training set and error on the samples are not
used to train the classifier to leave out cross validation. The
Adeno, Colon and SRBCT the error rate is high, when compare
to other datasets when we going to take for classification. The
number of errors present in this datasets represents the
deficiency of the process. Although the classification boundary
is nonlinear as it incorporates different data spreads between
classes as well as gene types.
1.5
Err
1
0.5
0
5 CONCLUTION
The classification scheme for both binary and
multiclass cancer diagnosis involves the transformation
of Microarray data to the Mahalanobis space after
performing KL-transformation. The force of proposed
pattern is that the final classification becomes simple
Euclidean classifier.
Also the classification boundary is nonlinear it incorporates
different data spreads between classes as well as gene types. The
gene selection is applied for better performance of a algorithm.
For both gene selection and cancer classification the researchers
were curious to know about the main causes that lie behind the
occurrence of any disease therefore a lot of research has been
done for the analysis data.
A various number of issues were encountered during the
analysis. So the Accuracy and speed of proposed technique is
checked on different published datasets. Results show that
proposed technique is extremely efficient in performance as well
as computation.
6 FUTURE WORKS
In future, this work can be extended to apply ensembling
techniques. Ensemble methods as sets of machine learning
techniques whose decisions are combined in some way to
improve the performance of the overall system. Two important
aspects to be focused on ensemble approaches. First aspect is
how to generate diverse base classier. In traditional, re-sampling
has been widely used to generating training datasets for base
classier learning. This method is much too random and due to
the small numbers of samples, the datasets may be greatly
similar. The second aspect is how to combine the base classiers.
An intelligent approach for constructing ensemble classiers is
proposed. The methods rst training the base classiers with
Particle Swarm Optimization (PSO) algorithm, and then select
the appropriate classiers to construct a high performance
classication committee with Estimation of Distribution
Algorithm (EDA).
147
IJTET2015
148