You are on page 1of 9

ACEEE Int. J. on Information Technology , Vol. 4, No.

1, March 2014

An Approach to predict Breast Cancer and Drug


Suggestion using Machine Learning Techniques
Megha Rathi1 and Chetna Gupta2
1

Jaypee Institute of Information Technology, Noida, India


Email: megha.rathi@jiit.ac.in
Jaypee Institute of Information Technology, Noida, India
Email: Chetna.gupta@jiit.ac.in

Abstract Breast cancer is one of the major causes of death in women globally. In this
paper, we propose a tool framework for prediction of breast cancer with the help of
machine learning techniques. In this study, breast cancer is diagnosed and the tool helps
doctors in decision making within no time. For this purpose a predictive tool is proposed
which helps oncologist in diagnosing the breast cancer and then helps oncologist in decision
making in treatment method. To find out the best classifier, Breast cancer data set is preprocessed and various classification algorithms are applied one after another on this data
set to compare the result of classification. Further the proposed tool uses these results to
identify which algorithms are computing better results for various attributes to predict this
deadly disease. The main objective of this research is to detect this life-threatening disease at
initial stage using best classifier. This research shows that machine learning and data
mining techniques can prove to be effective in both diagnoses and treatment decisions for
medical applications.
Index Terms Breast Cancer, Machine Learning, Data Pre-processing, Prediction.

I. INTRODUCTION
Breast cancer is one of the most life-threatening diseases among women worldwide. According to research
[1] in the past few decades breast cancer has proved to be one of the top cancer killers amongst women
globally. Breast cancer is a malignant tumor (group of cancer cells) which originates from breast tissues
especially from milk ducts. Studies indicate that breast cancer accounts for 30% of all cancers in women [2]
globally. According to [2] in the year 2008 breast caused 458,503 deaths all over the world. In the year 2009
one million new cases were diagnosed [3] and in the year 2010 one and half million new cases were
diagnosed in women. Because of early detection and treatment breast cancer patients are still alive after six
years from the diagnosis [3] if doctors are able to detect it. Therefore, an early prediction of this disease can
lead in reduction of mortality rates. The study conducted in [4] suggest that the survival rate after five year
of diagnosis is 88% and in case of 10 years it is 80% therefore it is essential to detect this life-threatening
disease at initial stage to increase the survivability of breast cancer patients. One of the major clinical
problems of breast cancer is the judgment of the type of cancer the person is suffering from. Thus the main
objective of this research is to develop a model that is able to detect this disease at initial stage so that proper
treatments and drugs are provided to the patient for curing it. Too many tests are required clinically to
diagnose breast cancer, but if a tool is available that can predict the disease by inputting certain attribute
values then it can substantially reduce the time of prediction. Based on these results the patient can consult
the doctor for immediate medical aid.
DOI: 01.IJIT.4.1.3
Association of Computer Electronics and Electrical Engineers, 2014

Controlling the mortality rate and enhancing the survivability of breast cancer patients is the main objective
of this model. This model could help in predicting breast cancer at initial stage which saves a lot of valuable
time. The proposed model uses data cleaning algorithm to clean data and to obtain error free, accurate and
complete data for prediction. Therefore, machine learning classifiers are trained with error free data which
enhances the prediction accuracy of classifiers when tested with test data. Moreover, in situations when good
oncologists are not available to the patient, predictive model created with machine learning techniques can
support other doctors in decision making and in the initialization of patient initial treatment without any
delay.
It is difficult however, to compare the accuracy of the techniques and determine the best one as the
performance of classifier is data dependent. Few studies have compared data mining techniques and
statistical techniques to solve prediction problems. Comparison mainly considered a specific data set or the
distribution of the dependent variable. Prediction of disease at initial stage is very crucial so that treatment for
the particular disease will start as early as possible. Applying treatment method at early stage will decrease
the damage to patient health especially when the disease is life threatening like Cancer and AIDS.
In this paper we propose a tool framework which will help doctors in diagnosis of breast cancer. The
proposed tool framework uses various machine learning techniques like SVM, Nave Bayes, End Meta, and
Function Trees for precise and accurate prediction. The tool uses feature selection algorithm to improve the
parameters like accuracy, precision and kappa statistics. Next step is to train the classifier to judge the
classifier performance on particular data set so that the best classifier can be selected.
The remaining paper is organized as follows: In the next section, Literature Review is presented. Application
Scenario is described in section 3 and Section 4 presents the Proposed Framework. In section 5 System and
Methods are described and finally section 6 presents the conclusion.
II. LITERATURE REVIEW
Abdulkadir Cakir and Burcin Demirel presented a study in which author developed software tool known as
Treatment Assistant which assists doctors in decision making using data mining techniques for suggesting
the treatment methods for breast cancer patients [5]. In the study, [6] author developed a prediction tool for
classification of Antimicrobial Peptides. In this study an algorithm called Class AMP is developed, which is a
combination of two data mining techniques. Performance measures of Class AMP are dependent on available
information on target specification of various Antimicrobial Peptides. In paper [7] hybrid approach is
proposed for the detection of breast cancer using neural network and feature selection using Sequential
Forward Selection and Sequential Backward Selection. Two hybrid feature selection techniques are used in
this study, which is the composition of sequential forward and backward selection together with the principal
component analysis by utilizing quadratic discriminant analysis classification algorithm for breast cancer
diagnosis. Results obtained achieve good accuracy as listed below, Sequential forward selection with neural
network classifier achieves 97.57% accuracy and Sequential backward selection with neural network as
classifier achieves 98.57% accuracy.
According to [8] a novel neural network classifier is implemented which makes use of floating centroid
method and particle swarm optimization with inertia weight as optimizer to improve the performance of
neural network classifier. In conventional neural network classifier position of centroid and classes are set
manually also the count of centroid remains constant with respect to the number of classes. This constraint of
fixed centroid decreases the chance of finding the optimal neural network. In this paper, a new approach is
developed which makes use of particle swarm optimization with inertia weight as optimizer to improve the
search performance of floating centroid method. The results are very promising and achieve accuracy
equivalent to 96.47, which is higher than other conventional neural network classifiers. A Remote Health
monitoring of heart failure with data mining techniques by using CART method is developed [9] which
enhances the efficiency of home monitoring for the detection of any severe heart problem of the patient. In
this approach remote monitoring platform identify the severity of heart problem and classify the heart failure
into severe heart failure and mild heart failure. For the same purpose classification and regression tree is
used, two CARTS are combined in a telemedicine platform to detect heart failure and its severity. Data
mining tasks involved in this study are: Preprocessing, Feature Extraction, and Patient Classification. The
software developed achieves accuracy and precision of 96.39% and 100% in detecting heart failure and of
79.31% and 82.35% for classifying the heart failure as severe heart failure and mild heart failure. In paper
[10] Francesco Folino and Clara Pizzuti presented an approach for prediction of disease, which is the
24

combination of various data mining techniques like clustering, Markov model and Association Analysis. For
disease prediction firstly medical records of patient are clustered and then for each cluster a Markov model is
generated. The developed system is known as CORE which is the integration of clustering, association
analysis and Markov model approaches. Paper [11] presents a series of emerging technologies for
improvement of healthcare services provided to the patients. It emphasizes on the provision of personalized
healthcare services which include following, (1) Pattern Recognition methods for signal pattern classification
for diagnosis and prediction of disease, (2) Body Sensor network, (3) Algorithm for the analysis of patient
specific physiological signals, (4) Ontologies and context based electronic health record, (5) Methodologies
for the integration of clinical, imaging, and genetic data, (6) Diagnostic and therapeutic systems based on
physiological signals, (7) Modeling of physiological signals, (8) Monitoring and treatment support tools for
chronic disease, (9) Patient specific multiscale modeling, and (10) Integrated e-healthcare solution. In study
of [12] a Least Square Support Vector Machine (LS-SVM) classifier is used for the purpose of breast cancer
detection, and this classifier achieves accuracy 98.53% using tenfold cross- validations. Yeh in the study [13]
proposed a new effective hybrid technique using discrete particle swarm optimization and statistical method
for breast cancer pattern mining and achieves accuracy of 98.71% Xu et al. [14] proposed linear orthogonal
transform algorithm for breast cancer diagnosis and accuracy achieved is 98.53%. In the study [15] author
described the relationship between feature selection and classification accuracy. Accuracy of classifier
depends upon the features selected so it is very essential to select relevant features from the dataset. Reducing
the attributes is very important for both supervised as well as in unsupervised learning. This study investigate
the relationship between various feature selection techniques and the resulting classification accuracy.
Features are extracted by filter and wrapper methods and explore the relationship with classifiers accuracy.
In the paper [16] accuracy of classification is improved by automatically extracting training data.
Classification is very important aspect of data mining especially in healthcare domain so its very essential to
know how accurate the classifier is and it is also important to know how to improve the accuracy of the
classifier. In this paper author applied NLP for the purpose of knowledge discovery, examine how to identify
the data sources which provide training data at low cost and examine how to automatically extract training
data. A survey on data mining techniques on medical data is presented in the paper [17] for finding out the
local frequent disease. This paper focuses on analyzing the data mining techniques for finding out the
frequent pattern for disease like cancer, heart diseases etc. In the paper [18] author reviewed the feature
selection for cancer classification on gene expression data set. Original data contains lot of error and because
of high dimensionality input and small sample data size there is a need of some technique in order to extract
relevant data from the data source so feature selection is needed in micro array data. Secondly author
reviewed Support vector machine for classification of cancer on microarray dataset and found that the
classifier achieve 100% accuracy when combined with feature selection techniques. Nidhi Bhatia and Kiran
Jyoti presented a study in which an analysis for the prediction of heart disease is presented using data mining
approach. Heart Disease is major disease in todays time predicting heart disease is very important aspect in
healthcare. This paper aims at analyzing various data mining approaches for the prediction of heart disease
and it has been found that Neural network achieves highest accuracy that is 100%, also Decision tree
achieves good accuracy that is 99.625 and accuracy achieved by Genetic Algorithm is 99.2% [19]. Various
cancers are predicted using XLMiner in the study [20]. Data Mining is gaining popularity day by day in the
medical field. This study shows the application of data mining techniques for the detection of cancer (Bone
cancer, Bladder Cancer, Stomach Cancer, Kidney Cancer, and Uterus Cancer) using XLMiner.
III. APPLICATION SCENARIO
In general there are many clinical tests required to detect even a simple disease like fever. If we have to
predict a particular disease like cancer we have to investigate various factors like age, race, family history,
environment, other disease, past history of patient etc. so too many attributes will be required for study and
analysis. Therefore, in order to predict precise and accurate results there is a need of error free, accurate and
concise data set to train classifier. These days, various data mining techniques have been used extensibility in
medical field for patient diagnosis, drug suggestion, treatment methods etc. Prediction of disease like cancer
is very crucial especially at initial stage so that treatment will start with no delay. Medical data consist of too
many attributes with so many instances and when it is required many data sets are needed to investigate the
need of mechanisms, which will manipulate and analyze the data effectively. In general, no person can go for
cancer diagnosis regularly so the chances increase that patients cancer is diagnosed at later stage when
his/her health degrades or when he is feeling some cancer related problem. For better prediction data cleaning
25

is applied to corresponding data set as noisy and inconsistent data can produce wrong results of prediction.
Therefore one of the crucial steps in this framework is the collection of breast cancer dataset so that cleaning
algorithm can be applied on this data set. Data cleaning algorithms clean the data by filling missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Figure-1 presents the
characteristics and quality of good data which is essential for any prediction. Dirty data can cause confusion
for classification purpose, resulting in inaccurate output.
Data Processing
Aggregation
Data
Transformation
Duplicate
elimination
Quality

Quality
Accurate

Characteristics

Complete
Consistent
Uniform
Validity

Nominal
Relevant
Reliable
Continuous

Figure 1. Quality and characteristics of good data

Applying data cleaning techniques help in assuring that the resultant data is complete, consistent and error
free. Once the accurate data set is available, data set is split into two parts: train data, which is used to train the
classifiers and test data on which classifiers are applied one by one for validating the classifier performance.
Once the classifier is trained the tool is able to predict breast cancer of new patients. Doctors only need to enter
some data into the relevant fields of tool and outcome is in the hands of oncologist within no time, this saves
time and money as clinical tests of cancer require a lot of financial expense. This early diagnosis of cancer
helps in early medical aid to the breast cancer patients also mortality rate degrades because disease is
diagnosed at initial stage of cancer and first stage cancer is more curable than second and third stage cancer.
Figure-2 presents the application of proposed system.
Patient

Consult with
oncologist

Symptoms
and other
feedback
taken from
the patient

Tool Determine the drug and treatment


Method using past history of patients

Input data for


Cancer detection

Proposed
prediction
tool using
Machine
Learning
Techniques

Treatment
and Drug
Suggestion

Classi
ficati
on

Yes
Cancer Diagnosed

Figure 2. Application Scenario of proposed system

26

No

IV. PROPOSED FRAMEWORK


In this study, we present an approach for detection of breast cancer at initial stage and determine treatment
methods and drug suggestions based on past history of various patients of similar kind. Classifier is trained
which helps oncologist in cancer detection and then provide suggestion to the doctor that which drug are best
suited for the particular patient and what treatment methods are best for the patient. First, we collect breast
cancer data set and secondly clean data using data cleaning algorithm. In the second step we split data set into
two parts: Train data and Test Data. Classifiers are trained using the past history and after this, we apply
various classifiers for checking, which one producing better results in terms of accuracy for breast cancer
data set. Once the disease is confirmed tool also provide doctors the list of drugs given to such kinds of
patients and also determine the treatment method. For the same we again train classifiers using past history of
various such kinds of patients so that if new case has arrived, the tool will be able to suggest the best drug
with treatment method. The proposed study can prove to be very useful in healthcare domain because it aid to
the prediction of disease and their treatment method and drug suggestion within no time and even if doctor is
new or from any other domain he can start treatment by seeking help from the tool without any delay which
improves the mortality rate and survivability of breast cancer patients. Figure-3 depicts the implementation
details of the tool. As shown in the figure we first collect dataset. As data sets are originated from various
sources in order to remove inconsistencies and error from the dataset, we apply cleaning algorithm so that
data is consistent and contains no error. After that we pre-process the data set to make data available for the
classification purpose. For the same we apply normalization, summarization techniques to make data set
appropriate for the classification purpose. Data set is divided in two parts Training data set and Test data set.
Accuracy of classifier depends on how we train the classifier so training data plays very crucial role in
classification and prediction. Next approach applied by our framework is the use of Feature Selection
algorithm in order to select only relevant attributes from the entire set of attributes. It has been found that
when data mining classifiers are combined with features selection techniques results are improved. So in
order to improve classification and prediction results we select features using feature selection algorithm so
only relevant attributes are inputted to the system. After that we apply classifiers on the test data set for the
detection of cancer. This framework aims to detect cancer at an early stage so that proper treatment steps can
be taken to cure disease at initial stage. Once the Disease is confirmed by the system this framework also
helps in suggesting various drugs to the patient based on medical history of the patient.
V. SYSTEM AND METHOD
The proposed framework consists of four main modules namely, Data Pre-processing, Feature Selection,
Classification, and Drug and Treatment Suggestion. We propose a framework for the detection of breast
cancer at early stage and then based on patient condition suggestion of drugs and appropriate treatment
method, for the same we propose a software tool framework that diagnose the cancer within no time and
medical aid is given to the patient from the very initial stage of breast cancer which enhance survivability of
breast cancer patient and decreases the mortality rate. The proposed framework consists of four main sub
modules namely Data Collection, Data Pre-processing, Feature Selection, and Last Data Classification and
Prediction. In this section, we provide a brief overview of all the techniques and data set used.
A. Data Collection
Data collection is the first step. We collect data from Wisconsin breast cancer database from the UCI ML
Repository [21]. This data set is most commonly employed by researchers who investigate various machine
learning techniques for breast cancer classification and prediction. A total 699 instances are there in this data
set. Each instance of record of this data set has nine attributes which is represented as an integer between 1 to
10. This dataset has some attributes which is helpful for cancer classification and prediction. These attributes
are presented in table-1.
B. Data Preprocessing
Data preprocessing is the second step in which we process data so that high quality data is available which is
correct and contains no erroneous data because classification accuracy also depends on the quality of data. If
attributes of data set contains missing or noisy data the output will be incorrect, so we apply techniques to
cleanse collected data. Data pre-processing is a technique, which is helpful in removing inconsistencies, dirty
and noisy data from data set. It is a well- known fact that quality decisions based on quality of data is
achieved through data pre processing. Data Extraction, cleaning and transformation comprises main task in
27

Data Set Repository

Data Cleaning Algorithm

Processed Data

Training data set

Test Data Set

Reduced data set

Feature Selection

Classification

Not Improved

Classifier
Performance

Improved

Validation of Result
Prediction
of drug and
treatment
Input patient data

Drug and treatment suggestion

Figure 3 Proposed Framework architecture


TABLE I: DATA SET ATTRIBUTES
Clump_Thickness
Cell_Size_Uniformity
Cell_Shape_Uniformity
Marginal_Adhesion
Single_Epi_Cell_size
Bare_Nuclei
Bland_Chromatin

[1,10] integer
[1,10] integer
[1,10] integer
[1,10] integer
[1,10] integer
[1,10] integer
[1,10] integer

Normal_Nucleoli

[1,10] integer

Mitoses

[1,10] integer

28

data pre-processing. In this module, first we apply data cleaning techniques to remove noisy and dirty data to
handle missing data and also remove inconsistencies from the data set In the second step we integrate data, if
it is collected from two or more sources and determine and resolve conflict because real world data taken
from different sources are in different format and in order to map two different file formats we need to
integrate them. In the third step data transformation is applied and the data is normalized using normalization.
Finally data reduction technique is applied to decrease the dimensionality space of data set. This is done so
that redundant or unnecessary data can be removed. Figure-4 presents data pre-processing techniques.
Data Preprocessing

Data Cleaning

Data Cleaning

-Missing Data
-Noisy Data
-Redundancy
Data Integration

Data Integration

-Integrate Data
-Resolving data
value conflict

Data
Transformation

Data Transformation
Data Reduction

-Normalization
-Aggregation

Figure 4. Data Processing Concepts and Technique

C. Feature Selection
Feature Selection is the process of selecting the subset of variable, attribute that is inputted to the system.
Often data set contains redundant and irrelevant data. Redundant features are those features that provide no
further useful information in any context of the system. Feature selection is very important in breast cancer
prediction as it removes unnecessary and useless data, which in turn increases the quality of prediction.
Accuracy of classifier depends upon the features of the data set so it is necessary to select important features
which contribute to the prediction of breast cancer. In this study, we select the relevant features from the data
set using the technique known as MRMR (Maximum Relevance and Minimum Redundancy) [22]. Minimum
redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics
of genes and phenotypes and narrow down their relevance. In this technique features can be selected to be
mutually far away from each other while still having "high" correlation to the classification variable. This
scheme is more powerful than the maximum relevance scheme in which we select features that correlate
strongest to the classification variable. After selecting the subset of features, we apply classification
algorithm on this reduced data set.
D. Classification and Prediction
Figure 3. Classification is the process of classifying data records into one of the set of predefined classes. It is
an important component of machine learning algorithms in order to extract rules and patterns from data that
could be used for prediction. Prediction can be thought of as classifying an attribute value into one of a set of
possible classes. It is viewed as forecasting a continuous value while classification forecasts a discrete value.
Data classification is a two step process as shown in figure-5 and figure-6. In the first step, model is built
indicating the set of data classes. The model is constructed by analyzing data set tuples described by the
attributes or features. Each row is assumed to belong to one of the existing class, as determined by the class
label attribute.

29

Classification
Techniques
SVM
Nave Bayes
End Meta
Function Tree

Training data

Attributes
A1

A2

. A.n
...

Classification Rules
If attribute range 0 to 5 the
benign else if >5
malignant

Figure 5 Learning (Training data analyzed by classification algorithm)

In the second step firstly, the predictive accuracy of the model is estimated. The accuracy of a model on a
given test data set is the percentage of test set samples that are correctly classified by the model. For each test
sample the known class label is compared with the learned models class prediction for that sample. In this
proposed framework we use SVM, Nave Byes, End Meta and Function Trees for classification and
prediction.
Classification Rule

Test data

A1

A2

New data

. A.n
...

Input attribute values:


A1.A9 >5 : Class
Malignant

Figure 6 Classification on test data

VI. CONCLUSIONS
Breast Cancer detection and their treatment method is a major area of concern these days. Machine Learning
can be useful for medical diagnosis and application. In this study, we have presented an approach for
detection of breast cancer and their treatment method and drug suggestion using machine learning
techniques. This study presents an approach for the development of a tool that assists doctors in diagnosing
breast cancer and provides help for suggesting drug and treatment at initial stage of disease. The main
concern is to save time, which is very important in life-threatening disease like cancer. The proposed tool
provides proper treatment methods for the disease within no time and medical help starts at initial stage of
breast cancer which increase survivability of patient and decrease the mortality rate.
REFERENCES
[1] Breast cancer awareness (http:// www. notouchbreastscan.com /awareness globaldisease.html).
[2] "World Cancer Report". International Agency for Research on Cancer. 2008. Retrieved 2011-02-26. K. Elissa, Title
of paper if known, unpublished.
[3] Breast cancer statistics (http:// www. world widebreastcancer.com / learn/ breast-cancer-statistics-worldwide).

30

[4] American cancer society. Breast cancer facts and figures 2005-2006M.
[5] Abdulkadir Cakir, Burcin Demirel, "A Software Tool for determination of Breast Cancer Treatment methods using
Data Mining Approach".Springer,2010.
[6] Shaini Joseph, Shreyas Karnik, Pravin Nilwae, V.K. Jayaram, and susan Idicula-Thomas, "ClassAMP: A Prediction
Tool for classification of Antimicrobial Peptides". IEEE/ACM, 2012.
[7] Mustafa serter Uzer, Onur Inan, Nihat Yilmaz," A hybrid breast cancer detection system via neural network and
feature selection based on SBS, SFS, and PCA. Springer Journal of Neural Computing and application, 2012.
[8] Lei Zhang, Lin wang, Xujiewen Wang, Keke Liu, and Ajith Abraham, "Research of Neural Network Classifier
Based on FCM and PSO for Breast Cancer Classification".Springer, 2012.
[9] Leandro Pecchia, Paolo Meilillo, and Marcello Bracale," Remote Health Monitoring of Heart Failure with Data
Mining via CART Method on HRV Feature". IEEE Transaction on Biomedical Engineering, Vol.58, 2011.
[10] Francesco Folino and Clara Pizzuti,"Combining Markov Models and Association Analysis for Disease Prediction".
pp. 39-52, ITBAM-2011.
[11] Emerging Technologies for Patient Specific Healthcare, IEEE Transactions on Information Technology in
Biomedicine, Vol.16, No.2, 2012.
[12] Polat K, Gunes S ,"Breast Cancer Diagnosis using least square support vector machine". Digit signal Process
17(4):694-701, 2007.
[13] Yeh WC, Chang WW, Chung YY," A new hybrid approach for mining breast cancer pattern using discrete particle
swarm optimization and statistical method". expert System application, 2009.
[14] Xu Y, Qi Z, Wang J," Breast Cancer diagnosis based on kernel orthogonal transform". Neural computing
application,2011.
[15] Andreas G. K. Janecek,Wilfried N. Gansterer,"On the Relationship Between Feature Selection and Classification
Accuracy",Journal of Machine Learning Research: Workshop and Conference Proceedings 4 pp. 90-105,2008.
[16] Ariel Fuxman,Anitha Kannan,Andrew B. Goldberg," Improving Classification Accuracy Using Automatically
Extracted Training Data, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery
and data mining,2009.Pages 1145-1154.
[17] Mohammed Abdul Khaleel,Sateesh Kumar Pradham,G.N.
Dash, " A Survey of Data Mining Techniques on Medical
Data for Finding Locally Frequent Diseases, in International Journal of Advanced Research in Computer Science
and SoftwareEngineering,Vol.3,Issue 8,2013.
[18] G.Victo Sudha George,V.Cyril Raj, Review On Feature Selection Techniques And The Impact Of Svm For Cancer
Classification Using Gene Expression Profile, in International Journal of Computer Science & Engineering Survey
Vol.2, No.3,2011.
[19] Nidhi Bhatia, Kiran Jyoti, "An analysis of Heart Disease Prediction using Different Data Mining Techniques in
International Hournal of Engineering and Research & Technology,ISSN:2278-0181,Vol.I Issues 8,2012.
[20] S.Jothi,S.Anita," Data Mining Classification Applied for Cancer Disease-A Case Study using XLMiner",in
International Journal of Engineering Research & Technology,Vol.I Issue 8,2012.
[21] UCI Machine Learning Repository Spambase Dataset. http://archive.ics.uci.edu/ml/datasets.
[22] Hanchuanb Peng, Fuhui Long, and Chris Ding, "Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy,"IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 27, No. 8, pp.1226-1238, 2005. NewYork, N Y, 1998

31

You might also like