You are on page 1of 7

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013

A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Dataset
S.Gilbert Nancy a,*, Dr.S.Appavu alias Balamurugan b,1
Abstract - In gene expression dataset, classification is the task of involving high dimensionality and risk since large number of features is irrelevant and redundant. The classification requires feature selection method and a classification; hence this paper proposed a method of choosing suitable combination of attribute selection and classifying algorithms for good accuracy in addition for computational efficiency, generalization performance and feature interpretability. In this paper, the comparative study had done by some well known feature selection methods such as FCBF, ReliefF, SVM-t-RFE and Random Selection method along with classification algorithm called Kernel penalized SVM[2], which selects relevant features using classifier construction by penalizing each features use in dual formation of support vector machine and which is applied on lung cancer dataset. The experimental results show that the ReliefF feature selection method can achieve best features for classification. The work can be extended to handle dataset having multiclass and numerous dataset can be applied. Index Terms SVM-t-RFE, FCBF.

I. INTRODUCTION Microarray data classification[1] is a task involving high dimensionality and small samples sizes. A common criterion to decide on the number of selected genes is maximizing the accuracy, which risks over fitting and usually selects more genes than actually needed. For relaxing the maximum accuracy criterion, to select the combination of attribute selection and classification algorithm that using less attributes has accuracy not significantly worst that the best. Also this paper compares the results to choose a suitable combination of attribute selection and classifying algorithms for a good accuracy when using a low number of gene expressions. This study used some well known attribute selection methods (FCBF, ReliefF, SVM-RFE, Random selection) along with kernel penalized SVM classification on the cancer dataset. There by the number of attributes selected from each algorithm will be subjected into the kernel-penalized classification[2] and the features are selected further. Finally the selected features are given into k-fold accuracy calculation. Hence the algorithm with best accuracy is considered to be suitable for cancer classification.
Manuscript received 10/September/2013 . S.Gilbert Nancy, Assistant Professor, Department of Information Technology, Thiagarajar College of Engineering, Madurai, India. Dr.S.Appavu alias Balamurugan, Professor and Head, Department of Electronics and Communication Engineering, K.L.N.College of Information Technology, Madurai, India.

Data mining is a multidisciplinary effort in extracting knowledge of data. There is wide range of data sets available which imposes several challenges to data mining. Several types of datasets are now available such as microarray in genomics and proteomics, and networks in social computing and system biology. Feature selection[3] is a process of selecting subset according to certain criteria, is an important dimensionality reduction technique. It is divided into four steps. These are determining the possible features subset to carry out the problem representation, evaluating the features subset generated in previous step, checking if selected subset satisfies the search stopping criteria, and verifying the quality of features subset selected. It reduces the number of features, removes irrelevant features, redundancy or noisy data. The need for feature selection is that practical classification problems require selection of subset of attributes or features to represent the patterns to be classified. The objectives of feature selection are manifold, the most important ones being: a)To avoid over fitting and improve model performance, i.e. prediction performance. b)To avoid faster and more cost-effective models c)To gain a deeper insight into underlying processes that generated the data. At the same time, in many cases it is difficult or impossible to know without training which features are relevant to a given task and which are effectively noise. As a result, the ability to select features from a huge feature set is critical. The training data can be labeled, unlabeled or partially labeled, leading to the development of supervised, unsupervised and semi supervised feature selection algorithms. Supervised feature selection determines feature relevance by evaluating features correlation with class, and without labels, Unsupervised feature selection uses data variance and separable to evaluate feature relevance. Semi supervised algorithms can use both labeled and unlabelled data. In addition, large number of features requires a large amount of memory storage. Ideally, feature selection methods search through the subsets of features, and try to find the best one among the competing candidate subsets according to some evaluation function. However, this procedure is exhaustive as it tries to find only the best one. Hence certain stopping criterias are used (i) whether addition(or deletion) of feature does not produce a better subset and (ii) whether an optimal subset according to some evaluation function is obtained. There are mainly three main feature selection techniques based on the computational model. They are filter, wrapper and embedded method; the filter method relies on the general characteristics of data and evaluates features without involving any learning algorithm. This method rely on intrinsic properties of data they are 78

A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Dataset

independent of the classifier used and are fast and scalable to high dimension data sets[4]. In most cases a feature relevance score is calculated, low-scoring features are removed. The following key advantages of this method are time-efficiency, since they are independent of the classifier; applicability to supervised, semi-supervised, or unsupervised learning for both binary and multi-class problems. The main disadvantages, as compared to wrapper approaches are sub-optimality, since the selected subset of features may not be the best for any classifier; the sub-optimality of our efficient redundancy assessment process implies that, in some cases, the number of selected features could be further reduced without significantly penalizing the accuracy. Wrapper techniques search subsets of features that are evaluated by the performance of the intended classification method thereby searching the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The evaluation of a specific subset of features is obtained by training and testing a specific classification model. To search the space of all feature subsets, a search algorithm is then wrapped around the classification model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to give the search for an optimal subset. These search methods can be divided into two classes: Deterministic and randomized search algorithms. This technique can be either randomized or deterministic. Advantage of this approach includes the interaction between feature subset search and model selection, and the ability to take into account features dependencies. A common drawback of this technique is that they have a higher risk of over fitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost since iterative loop around the learning process. This method consists of two phases. The first phase is feature subset selection, which selects the best subset using a classifiers accuracy (on the training data) as a criterion. Second phase includes learning and testing. A classifier is learned from the training data with the best feature subset, and is tested on the test data. In embedded method, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets. They offer a good tradeoff between filters and wrapper approaches, being more computationally effective than wrappers and modeling feature dependencies. Just like to wrapper approaches, embedded technique are specific to given learning algorithm and it measures feature subset usefulness, but it lowers computational cost. In addition variable ranking method which is a filter technique is also used because of its simplicity and scalability. The key point is that variable selection is integrated or embedded in the learning algorithm. This method has the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods. The drawback is that higher risk of over fitting than deterministic algorithms and is computationally intensive and it is restricted with classifier, i.e., classifier dependent selection. Classification is one of the important data mining tasks. It depends mainly on the appropriate selection of relevant feature hence improves the performance of the classifier. 79

With high dimensionality data typically many features are irrelevant and redundant for a given learning task having harmful consequences in terms of performance and computational cost. It maps the data into predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is to build a classifier based on some cases with some attributes to describe the objects or one attribute to describe the group of the objects. Among existing classification methods, support vector machines (SVMs) provides several advantages such as adequate generalization to new objects, absence of local minima, and representation that depends on only a few parameters. Support vector machine is an algorithm that attempts to find a linear separator (hyper-plane) between the data points of two classes in multidimensional space. SVM is well suited in dealing with interactions among features and redundant features. This method, kernel-penalized SVM (KP- SVM), determines simultaneously a classifier with high classification accuracy and an adequate feature subset by penalizing each features use in the dual formulation of the respective mathematical model. In numerical experiments, KP- SVM outperforms existing approaches. A common criterion for selecting required patterns or feature is mainly for maximizing the accuracy, since over fitting and selection of more genes than actually needed affect the accuracy, here to select the set of attribute and performing classification using less attributes has an accuracy not statistically significantly worst that the best. Also here give some idea to choose a suitable combination of attribute selection and classifying algorithms for a good accuracy. This comparative study used some well known attribute selection methods (FCBF, ReliefF and SVM-t-RFE, plus a Random selection, used as a base line technique) and classifying techniques (KPSVM). It is an embedded method that simultaneously selects relevant features during classifier construction by penalizing each features use in the dual formulation of support vector machines (SVM). This approach called kernel-penalized SVM (KP-SVM). Additionally, it has the advantage that it employs an explicit stopping condition, avoiding the elimination of features that would negatively affect the classifiers performance. The experiments performed on lung cancer dataset comparing the feature selection techniques. KP-SVM outperformed the alternative approaches and determined consistently fewer relevant features [2]. The dataset is taken from UCI repository[12].UCI repository is a website where dataset will be available in plenty of format. Numerous dataset will be present in the UCI repository in both training form and test form. Training data set will contain all the missing values as well as noisy data. Further training data will be converted into test dataset for implementing in our algorithms. II. ALGORITHMS FOR COMPARISON 2.1 Measures 2.1.1 To Calculate entropy Entropy is usually calculated to find the amount of information present in the dataset as in equation (1). ---(1)

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013 where, x- attributes in the class relapse and non relapse. yclass labels (relapse and non relapse class). 2.1.2 To calculate Information Gain This paper need to find the information gain for feature selection. It is done by using equation (2) ---(2) where, I(G)- information gain H(x)-entropy calculated for lung cancer dataset. 2.1.3 Calculating weight factor A weight factor is a value which is given to something according to how important or significant it is. Here calculate the weight factor for each attributes based on the following formula equation (3). ---(3) Where, is the constant y- Values range from -1 to +1 which implies -1 for non cancerous cells and +1 for cancerous cells x- values takes each attribute in both the sample sets. 2.1.4 To calculate the T-Statistics value Statistic is the facts which are obtained from analyzing information expressed in numbers. Here this paper uses a mean and variance value obtained from two Samples and apply it on the following formula equation(4) ---(4) Flowchart of FCBF: Where, x1 and x2 mean of the samples 1 and 2 S1, S2- variance of the samples 1 and 2 n1, n2- number of elements in sample 1 and 2. 2.1.5 Calculating rank Rank is a grade which helps in selecting the list of best attributes. The attributes with higher ranks are finally chosen by equ(5). r = *Wi+(1-)*t ---(5) Where, w = weight factor, = a constant, t = T-statistics value. 2.2 FCBF Algorithm: Fast Correlation Based Filter algorithm[6] finds a set of predominant features for the class. It consists of two major parts; first part calculates the SU value for each feature selects relevant features into list based on the predefined threshold and orders them in descending order according to SU values. In the second part, it further processes the ordered list to remove redundant features and only keeps predominant ones among all the selected relevant features. Algorithm: Inputs: Lung cancer dataset with 1000 attributes (relapse and non-relapse cancerous cells). Output: Get limited number of attributes. Procedure: Step 1: Calculate the entropy and information gain, use equations 1&2 for every attributes in the lung cancer dataset. Step 2: Create array list which contains 1000 attributes in it. Step 3: Fix threshold value as 1 because the range of probability is between 0 and 1. 80 2.3 ReliefF Algorithm: ReliefF algorithm[8,9] is based on the random selection. It comes under filter method. The only algorithm that never uses the values of information gain and entropy is ReliefF algorithm. The calculation of the ReliefF algorithm is based on the hit and miss ratios. Based on both ratios difference is calculated. Manual threshold is set and the features are selected based on that. Algorithm: Inputs: Lung cancer dataset with 1000 attributes (relapse and non-relapse cancerous cells). Step 4: Within the array list, compute SU[equ 6] by perform multiply entropy and information gain value for each attribute and based on the threshold value take the attribute which has value greater than 1. Step 5: Relevancy is checked between the attributes. The attributes with higher relevancy is chosen. Symmetry is a desired property for a measure of correlations between features. However, information gain is biased in favor of features with more values. Furthermore, the values have to be normalized to ensure they are comparable and have the same aect. Therefore, choose symmetrical uncertainty dened in following equation (6). ---(6) It compensates for information gains bias toward features with more values and normalizes its values to the range [0, 1] with the value 1 indicating that knowledge of the value of either one completely predicts the value of the other and the value 0 indicating that X and Y are independent. In addition, it still treats a pair of features symmetrically. These entropy-based measures require nominal features, but they can be applied to measure correlations between continuous features as well, if the values are discretized properly in advance (Fayyad & Irani, 1993; Liu et al., 2002a). Therefore, this paper uses symmetrical uncertainty in this work.

A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Dataset

Output: Get attributes which is the outcome of feature selection Procedure: Step 1: This is based on Random Selection Process. Step 2: The threshold value is fixed as 0.5. This is a user defined value. Step 3: The value of one attribute is compared with all the other attributes. Step 4: Set the upper and lower limit for the values of both classes Step 5: After all the iterative process will get appropriate features are selected as the result. This feature selection comes under the filter method. This algorithm has given the best set of attributes for classification. The ReliefF (Relief-F) algorithm is not limited to two class problems, is more robust and can deal with incomplete and noisy data. Similarly to Relief, ReliefF randomly selects an instance Rj, but then searches for k of its nearest neighbors from the same class, called nearest hits Hj, and also k nearest neighbors from each of the different classes, called nearest misses Mj(C). It updates the quality estimation W[A] for all attributes A depending on their values for Ri, hits Hj and misses Mj. The update formula is similar to that of Relief, except that we average the contribution of all the hits and all the misses. The contribution for each class of the misses is weighted with the prior probability of that class P(C) (estimated from the training set). Since want the contributions of hits and misses in each step to be in and also symmetric, here to ensure that misses probability weights sum to 1. As the class of hits is missing in the sum we have to divide each probability weight with factor 1P (class (Ri)) (which represents the sum of probabilities for the misses classes). The process is repeated for m times. The difference between the hit and miss ratios is calculated by equations(7&8) Diff h = ((sel Val-hit value)/(max Val-min vale))/mk --(7) Diff m = ((sel Val-miss value)/(max Val-min Val))/mk --(8) Flowchart of ReliefF:

2.4 Random Selection Algorithm: In an algorithm [10,11], randomly selected sample from a larger sample, give all the individuals in the sample an equal chance to be chosen. The features are chosen at random and not more than once to prevent a bias that would negatively affect the validity of the result of the experiment. Algorithm: Inputs: Lung cancer dataset with 1000 attributes (relapse and non-relapse cancerous cells). Outputs: The attributes which are selected after feature selection. Procedure: Step 1: Randomly attributes are selected from the dataset. Step 2: Entropy and Information gain for all attributes are selected. Step 3: Attributes with Distinct information gain are subjected. Step 4: Compared with all other attributes, the count of unique attribute set are listed. Step 5: Attributes with count 15 are considered as selected features. Flowchart of Random Selection:

2.5 SVM-t-RFE Algorithm: SVM-t-RFE Algorithm[12] extends support vector machine recursive feature elimination (SVM-RFE) algorithm by incorporating T-statistic. It is identifying more differentially expressed genes, and achieving highest prediction accuracy 81

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013 using equal or smaller number of selected genes. The t-test is a well known statistical method for detecting differential expressed genes between two samples in microarray data. Algorithm: Inputs: Lung cancer dataset with 1000 attributes (relapse and non-relapse cancerous cells). Outputs: The attributes with the highest ranking scores are the selected features. Procedure: Step1: Calculate and normalize T-statistic for features in Samples using the weight factor. Step 2: Repeat until initial gene subset S is null. Step 3: Train linear SVM on the training data, with initial gene subset S as input variables Step 4: Calculate and normalize the weight vectors Step 5: Calculate the ranking scores ri for genes Step 6: Choose the gene with the lowest ranking score. Step7: Based on the ranking score, select the attributes with highest ranking as final feature selected attributes. Algorithm: Step 1: Dataset are loaded from the database. All the attributes are predicted since the column have numerical values. Step 2: Mean values for both the relapse and non relapse classes are computed. Information gain values for each attribute are calculated. Step 3: After predicting the information gain values, variance and standard deviation are computed. Step 4: Since all the data taken from UCI repository are already trained dataset there is no need for preprocessing. Step5: Vector values are computed by multiplying e and standard deviation. e value will be constant since it is math function. Step 6: Set the threshold value as 0.05 and and the negative values that fall below this threshold value are converted into zeros. Step 7: Sort every selected attribute in descending order and do comparison of every sorted value with all the other values. Step 8: After comparing all the values zero values are replaced with the previous one so that each and every value in the attribute will have non-negative values. Kernel-penalized SVM (KP-SVM), determines simultaneously a classier with high classication accuracy and an adequate feature subset by penalizing each features use in the dual formulation of the respective mathematical model. III. OVERALL DESIGN

Flowchart of SVM-t-RFE:

2.6 Kernel Penalized Support Vector Machine Kernel Penalized Support Vector Machine[2] is an embedded method that simultaneously selects relevant features during classier construction by penalizing each features use in the dual formulation of support vector machines (SVM). This approach optimizes the shape of an anisotropic RBF Kernel eliminating features that have low relevance for the classier. Additionally, it employs an explicit stopping condition, avoiding the elimination of features that would negatively affect the classiers performance.

IV. RESULTS AND DISCUSSIONS Figure 4 shows the number of selected features after applying each algorithm. The ReliefF algorithm has given better results compared with other feature selection algorithm.1000 attributes are reduced to 246 attributes in ReliefF, 249 attributes in FCBF, 407 in SVM-t-RFE and 283 in random selection. Memory space occupied by irrelevant and redundant attributes are removed and hence lot of memory space is reduced. By reducing the memory space of the system it can enhance the performance of the system. Another added advantage of reducing the memory space is that processing speed could be increased. 82

A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Dataset

Figure 1. Calculation of Information and Entropy.

Figure 4. Selected number of features after applying Feature selection.

Figure 1 represents the information gain and entropy which is used for the 3 feature methods (FCBF, SVM-t-RFE and Random Selection).

Figure 5. Graph that shows the result of selected features or attributes.

Figure 2. Mean calculation for the specified 2 classes.

A comparative graph showing the result of each feature selected (no of selected features)are taken and is shown as chart.

Figure 2 represents the mean calculation for 2 classes namely relapse and non-relapse which helps in reducing features by applying in suitable algorithm.

Figure 6. Classified Attributes Before And After Feature Selection Is Shown. Figure 3. Weight and Similarity Calculation.

Figure 3 shows the calculation of weight and similarity(rank) which is required for above feature selection methods.

Classifications for the lung dataset has done and result in shown first box of interface and classification for the result of best feature selection algorithm is done in shown in second. Both comparatively show the selected attributes with its respective numbers.

83

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013 other kernel functions like polynomial kernel or with weighted support vector machines to compensate for the undesirable effects caused by unbalanced data sets in model construction; an issue which occurs for example in the domains of credit scoring and fraud detection. REFERENCES
[1] Carlos J. Alonso-Gonzlez, Q. Isaac Moro-Sancho, Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods, an international journal Expert Systems with Applications 39 (2012) 7270 7280. [2] Sebastin Maldonado, Richard Weber, Jayanta Basak Simultaneous feature selection and classification using kernel-penalized support vector machines, Information Sciences 181 (2011) 115128. [3] Zheng Zhao, Shashvata Sharma Advancing Feature Selection Research,ASU Feature Selection Repository. [4] Artur J. Ferreira a,c, Mrio A.T, Figueiredo b,c,Efficient feature selection filters for high-dimensional data, Pattern Recognition Letters 33 (2012) 17941804. [5] Isabelle guyon, Jason Weston, Gene Selection for Cancer Classification using Support Vector Machines, Machine Learning, 46, 389422, 2002. [6] Lei Yu ,Huan Liu, Feature Selection for High -Dimensional Data A Fast Correlation-Based Filter Solution, Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. [7] Marko Robnik, Igor Kononenko Theoretica l and Empirical Analysis of ReliefF and RReliefF, Machine Learning Journal (2003) 53:23 -69. [8] Y. Wang and F. Makedon, Application of Relief -F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification Using Microarray Data, in Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, IEEE Computer Society Washington, DC, USA, 2004. [9] Rainer Gemulla, Sampling Algorithms for Evolving Datasets, Ph.D. thesis, 2008. [10] A Samplingbased Framework For Parallel Data Mining,PPoPP'05, June 15.17, 2005, Chicago, Illinois, USA.ACM. [11] Xiaobo Li, Sihua Peng, Jian Chen, Bingjian Lu, Honghe Zhang, Maode Lai, SVMT-RFE: A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles, Biochemical and Biophysical Research Communications, Volume 419, issue 2 (March 9, 2012), p. 148-153.Elsevier Science. [12] UCI repository website.

Figure 7: Calculated accuracy for feature selection methods

Figure 8: Graph showing accuracy of classification only &feature selection and classification.

Figure 9: Graph showing accuracy for before and after classification

BIBLIOGRAPHY
Ms.S.Gilbert Nancy, Assistant Professor, Department of Information Technology, Thiagarajar College of Engineering, Madurai, India. She received the BE degree from the Department of Computer Science and Engineering at Sethu Institute of Technology, Madurai and M.E. from the Department of Computer Science and Engineering at Thiagarajar College of Engineering, Madurai. India. She is currently working toward the Ph.D. degree in the Department of Information and Communication Engineering at Anna University, Madurai, India. Her research interests include machine learning, pattern recognition and their applications. Dr. S. Appavu Balamurugan, Professor and Head, KLN College of Information Technology, Madurai, India. He received his ph.D in Data Mining from Anna University. His research has been supported by UGC, India. He is a member of IEEE & CSI. He has published 30 research papers in reputed journals including ELSEVIER &amp.

Figure 9 shows the graph that shows the classification accuracy of dataset with feature selection and without feature selection. There is a higher accuracy of 73% when dataset is classified with feature selected attributes rather than the complete set classified which is nearly 65%. V. CONCLUSION A comparison with feature selection techniques shows that the ReliefF feature selection method selected good features rather than other methods and the classifiers efficiency is increased when cancer dataset is subjected to feature selection process before classifying them. In addition memory space and error rate is minimized. FUTURE WORK The work can be extended to handle dataset having multiclass. Numerous dataset can be applied. Future work can be done in several directions. First, it would be interesting to use the proposed method in combination with variations of SVM, such as Regression or Multi-class. Also interesting would be the application of this approach with 84

You might also like