Professional Documents
Culture Documents
A PROJECT REPORT
Submitted by
B. Tech
in
APRIL, 2018
i
School of Computer Science and Engineering
DECLARATION
I hereby declare that the project entitled “Medical Data Analysis Using Machine
Learning Techniques” submitted by me to the School of Computer Science and Engineering,
Vellore Institute of Technology, Vellore-14 towards the partial fulfillment of the requirements for
the award of the degree of Bachelor of Technology in Computer Science and Engineering
(Specialization in Bio Informatics) is a record of bonafide work carried out by me under the
supervision of Prof. Rajkumar R., Designation. I further declare that the work reported in this
project has not been submitted and will not be submitted, either in part or in full, for the award of
any other degree or diploma of this institute or of any other institute or university.
Signature
ii
School of Computer Science and Engineering
CERTIFICATE
The project report entitled “Medical Data Analysis Using Machine Learning Techniques” is
prepared and submitted by Devansh Bhasin (Register No: 14BCB0045), has been found
satisfactory in terms of scope, quality and presentation as partial fulfillment of the requirements
for the award of the degree of Bachelor of Technology in Computer Science and Engineering
Internal Examiner (Name & Signature) External Examiner (Name & Signature)
iii
ACKNOWLEDGEMENT
The project report couldn’t have been accomplished without the splendid support and cooperation
of my internal guide, Professor Rajkumar R and I would like to express my gratitude to him for
his unwavering support and the infinite amount of time spent with me helping to do my project.
It is my privilege to express heartfelt thanks to the Dean, School of Computer Science and
Engineering, Dr. Saravanan R for this kind of encouragement for all endeavors upon this project.
I would also like to express our sincere gratitude to my college Vellore Institute of Technology
University, Vellore for providing me with the infrastructure and the opportunity to undertake and
complete such an interesting project report. My special thanks to the chancellor of VIT, Dr. G.
Viswanathan, who gave me the opportunity to pursue my studies in this prestigious university.
iv
TABLE OF CONTENTS
2. Literature Survey 3
2.1 Survey of the existing models/work 3
2.2 Summary/gaps identified in the survey 5
3. Overview of the proposed system 6
3.1 Introduction and Related Concepts 6
3.2 Framework for the Proposed System (with
7
explanation)
3.3 Overview of Algorithms Used 8
4. Proposed System Analysis and Design 13
4.1 Introduction 13
4.2 Requirement Analysis 13
4.2.1 Functional Requirements 13
4.2.1.1 Product Perspectives 13
v
4.2.1.2 Product features 14
4.2.1.3 User characteristics 15
4.2.1.4 Assumption and Dependencies 15
4.2.1.5 Domain Requirements 15
4.2.1.6 User Requirements 16
4.2.2 Non-functional Requirements 16
4.2.2.1 Product Requirements 16
4.2.2.1.1 Efficiency 16
4.2.2.1.2 Reliability 16
4.2.2.1.3 Portability 16
4.2.2.1.4 Usability 17
4.2.3 Engineering Standard Requirements 17
Economic 17
Legality 17
Social 17
Ethical 17
Health 17
4.2.4 System Requirements 18
4.2.4.1 H/W Requirements 18
4.2.4.2 S/W Requirements 18
5. Results and Discussion 20
Conclusion, Limitations and Scope for future
6. 39
Work
7. References 40
vi
LIST OF FIGURES
vii
Fig. 27: Decision Tree (Untuned) CV Result with 5 Prameters 35
Fig. 28: Decision Tree Tuned Parameter Result 36
Fig. 29: SVM (Untuned) CV Result with 5 Prameters 36
Fig. 30: SVM Tuned Parameter Result 37
Fig. 31: KNN (Untuned) CV Result with 5 Prameters 37
Fig. 32: KNN Tuned Parameter Result 39
Fig. 33: Random Forest (Untuned) CV Result with 5 Prameters 40
LIST OF TABLES
viii
LIST OF ABBREVIATIONS
Abbreviation Expansion
ANN Artificial Neural Network
ML Machine Learning
AI Artificial Intelligence
NB Naïve Bayes
RF Random Forest
OS Operating System
IT Information Technology
ix
ABSTRACT
Breast cancer is one of those illnesses that has a high toll on human lives each year. It is a standout
amongst the most common types of all cancers and a noteworthy reason for female passings around
the world. As per reports, Indian women are highly affected by this this disease. With recent
computation advancements, data mining methods using machine learning are reliable and effective
ways to make predictions on data. Especially in the medical field, these methods are widely used
in diagnosis and analysis to make analytical decisions. In this project, a comparative study has
been conducted between given machine learning algorithms: Decision Tree Regression (DTR),
Decision Tree Entropy (DTE), Naive Bayes (NB), K-Nearest Neighbors (KNN) and Artificial
Neural Networks (ANN) on the Wisconsin Breast Cancer (Kaggle, 2016) dataset. The project
presents a performance comparison of various algorithms based on different factors such as
accuracy, specificity and sensitivity. Effective Visualization strategies have been used at every
step to understand the methodologies used in a better way. Further, after statistical analysis of data,
feature selection is performed reducing 32 classification attributes to 5 while maintaining
considerable accuracy. More advanced algorithms like Support Vector Machine (SVM) and
Random Forest (RF) also have been introduced in the feature selection part of the project. All the
scripts of the project are written in Python, use open-sourced Machine Learning libraries and run
using tools of Anaconda Distribution
x
1. Introduction
With regards to industries that deal with tremendous amount of information, healthcare
ranks among the top as a result of to several new strategies/methodologies for data
collection, such as - real-time sensor-generated data, genomic mapping, cell-image data
etc.
Machine learning enables building efficient models for rapidly analyzing data and delivery
of reliable results, utilizing both real-time and time-stamped data. By harnessing the power
of machine learning, service providers of the medical & health industry can make better
decisions on patient’s diagnoses and treatment options, which would lead to an overall
improvement of the healthcare services.
1.2. Motivation:
Earlier, it was difficult for physician and doctors to collect and analyze the huge volumes
of data and using it for making effective predictions and treatments since IT industry
wasn’t mature enough and advanced sensor and imaging technology was unavailable.
However, now, with machine learning and modern computers, it’s been relatively easy, as
data-mining and visualizing technologies such as Hive, Hadoop, Python, R etc. are mature
enough for wide-scale adoption. A large number of data scientists perform analytics on
voluminous data that they believe was not possible before.
1
Machine learning models can also be useful in providing crucial insights, real-time data,
and advanced analytics information in terms of the patient’s disease, lab test outcomes,
blood pressure, family lineage, clinical trial data, and more to healthcare providers.
Hence my motivation lies in the question:
What if this data could be used predict a certain disease or the risk of developing a disease?
And how much cost benefits can be provided to the medical community by leveraging the
power of machine learning on initial test data.
2
3. Build model to predict whether breast cell tissue is malignant or benign.
2. Literature Survey
2.1. Survey of the Existing Models/Work
• Prediction/Classification using traditional data mining models usually involves a machine
learning algorithm (e.g., Decision Tree, Random Forest, Support Vector Machine, Naïve
Bayes etc.), and particularly a supervised or semi-supervised learning algorithm by the use
of training data with labels to train the model.
• In the part of dataset used for testing, patients can be classified into groups of either high
risk or low risk, each group requiring different levels of treatment. These models are
valuable in the medical industry and are widely studied.
• In [1] it has been stated that - The Naive Bayes classifier oversimplifies the learning process
by assuming that features are independent of any given class. Although independence is
generally an unreliable assumption, in practice Naive Bayes often competes well with more
sophisticated models. Low-entropy distribution of features yield better performance for
Naive Bayes and Bayesian Network works well for certain functional feature
dependencies, thus reaching its best performance in two cases: completely independent
features (which is expected) and functionally dependent features. The accuracy of Naive
Bayes is not directly related to the extent of feature dependencies measured as the mutual
class conditional information between the features. Rather, a better estimator of Naive
Bayes accuracy is the amount of information about the class that is lost because of the
independence assumption.
• According to [2] with data-mining coming into prominence, decision tree is considered as
one of the most reliable models in the process of data mining and data analysis of certain
cases. Decision Tree process involves - using a set of data for training to generate a decision
tree that correctly classifies the training data itself. If the learning process is successful,
this decision tree will then correctly classify newly entered data as well. Decision Trees are
of different types depending on various dimensions such as splitting criterion, stopping
rules, branch condition (multivariate, univariate), style of branch operation and type of
3
final tree etc. Decision tree models based on information of attribute are biased toward
multi value attributes which have more but likely unreliable information content. Attributes
that have additional values can be of less significance for various applications of decision
tree. Problem affects the accuracy of ID3 DT Models and generate unclassified regions.
• Williams et al. [3] have focused at two data mining techniques namely naïve bayes and j48
decision trees to predict breast cancer risks in Nigerian patients. The analysis is made to
determine the most efficient and effective model. The authors have collected the dataset
from cancer registry of LASUTH, Ikeja in Lagos, Nigeria which contains 69 instances with
17 attributes along with the class label. The dataset holds 11 non-modifiable factors and
five modifiable factors. The experiment is conducted through Weka and the authors have
claimed j48 decision tree is better for the prediction of breast cancer risks with the values
of accuracy (90.2%), precision, recall and error rates.
• In [4] a comparative study has been carried out between Naïve-Bayes, Decision Tree and
Nearest Neighbor algorithms. By comparing the different data-mining classification
algorithms listed, it shows that most Decision Tree's algorithms are more accurate and have
lower error rate than other algorithms such as K-NN and Bayesian Network. The
information in decision tree is depicted in the form of collection of IF-THEN rules which
are easier for humans to comprehend. Each algorithm has its own set of advantages and
disadvantages as well as its own area of implementation. None of the algorithms are
considered perfect and can satisfy all constrains and criteria.
• Chaurasia et al. [5] have investigated the performance of BFTree (Best First Tree), IBK
(K-nearest neighbor classifier) and SMO (Sequential Minimal Optimization) classification
techniques on breast cancer data. The authors have conducted the experiment in Weka data
mining tool and have taken three evaluation criteria such as time, correctly classified
instances and accuracy for assessing the superiority of each algorithm. The authors have
stated that the performance of SMO algorithms has been better than the other two
algorithms in terms of accuracy and low error rate. The authors have also identified the
most important features for enhancing the prediction accuracy.
• Sivakami [6] has presented a disease status prediction model by using a hybrid
methodology of Support Vector Machines (SVM). To alert the severity of the disease the
strategy of the system consists of two main parts namely information treatment and option
4
extraction, and decision tree- support vector machines. The author has compared the results
of the proposed model with Instance-based Learning (IBL), Sequential Minimal
Optimization (SMO) and Naïve Bayes (NB) and has proven that proposed algorithms
works better than the comparative algorithms with 91% of accuracy.
• Shajahaan et al. [7] have explored the applicability of decision trees for breast cancer
prediction and also analyzed performance of conventional supervised learning models such
as CART, ID3, C4.5 and Naïve Bayes. The experiment is conducted through Weka tool.
The authors have highlighted five meaningful attributes that can be considered for the
prediction. The authors have concluded that the random tree serves as the best classification
algorithm for breast cancer with higher accuracy in prediction.
5
• Most of these existing models run on the entire 30+ attributes of the dataset to achieve a
good accuracy score, however these implementations don’t take in account the time and
cost required to accurately measure this data on cell level. Reducing the number of
attributes while maintaining the accuracy could make the entire process considerably more
cost effective and simple.
This project aims at finding breast cancer type (Benign or Malignant) using different
machine learning models. Through thorough research, a noble approach is provided in
order to improve the efficiency and time consumption of these models. Data of cancer
patients was collected from Wisconsin dataset of UCI machine learning Repository. This
dataset consists of total 30+ attributes on which I applied Naive Bayes, Entropy Decision
Tree, K-Nearest Neighbors, Regression Decision Tree and Artificial Neural Networks
(ANN) classification algorithms and calculated various factors of efficiency like accuracy,
sensitivity and specificity. After this, different feature selection methods, helped me to
reduce the number of attributes required and also improve accuracy of some models by
reducing some lower ranked attributes. Not only the contributions of these attributes was
very less, but sometimes their addition has a negative impact on classification algorithms.
The whole project is written in Python. All the parameters of algorithms used for
comparative study are tuned at best possible and the reached accuracy is around 94%. To
be precise the lower Obtained result is around 90% and the best is over 97% with 94% as
mean. Further, notebooks have been created applying feature selection models reducing
the dimension scope of attributes from 32 to mere 5.
6
3.2. Framework for the Proposed System (with explanation):
Given below is a diagram of the framework of predictive data mining process. For
predictive analytics to be effective, data analysts should follow the principle of “living the
process” to best understand the type of data, the required workflow, the target audience
and the action that would be prompted by knowing the prediction.
1. Step One: to skillfully define the problem that is to be addressed, then gather the initial
data necessary and evaluate different models that can be used on the data.
2. Step two: selecting one of the best performing models and testing with a separate data
set to validate our approach.
3. Step three: to run the model in a real world scenarios with novel input values.
7
Particularly, prediction should link carefully to clinical priorities and measurable events
such as cost effectiveness, clinical protocols or patient outcomes.
Data Flow Diagram for the proposed system:
Decision Trees
Decision trees, as the name suggest, is a tree-shaped visual representation of one can reach to a
particular decision by laying down all options and their probability of occurrence. Decision trees
are extremely easy to understand and interpret. At each node of the tree, one can interpret what
would be the consequence of selecting that node or option.
8
Neural Networks
Neural network (also known as Artificial Neural Network) is inspired by human nervous system,
how complex information is absorbed and processed by the system. Just like humans, neural
networks learn by example and are configured to a specific application.
Neural networks are used to find patterns in complex data and thus provide forecast and classify
data points. Neural networks are normally organized in layers. Layers are made up of a number of
interconnected ‘nodes’. Patterns are presented to the network via the ‘input layer’, which
communicates to one or more ‘hidden layers’ where the actual processing is done. The hidden
layers then link to an ‘output layer’ where the answer is output as shown in the graphic below.
Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of independence
between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. For example, a fruit
may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, a naive Bayes classifier
would consider all of these properties to independently contribute to the probability that this fruit
is an apple.
9
Naive Bayesian model is easy to construct and particularly useful for very big data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Here,
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First
three functions are used for continuous function and fourth one (Hamming) for categorical
10
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times,
choosing K turns out to be a challenge while performing KNN modeling.
KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you
have no information, you might like to find out about his close friends and the circles he moves in
and gain access to his/her information!
11
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is
a frontier which best segregates the two classes (hyper-plane/ line).
The difference between Random Forest algorithm and the decision tree algorithm is that in
Random Forest, the process of finding the root node and splitting the feature nodes will run
randomly.
12
4. Proposed System Analysis and Design
4.1. Introduction
This project started with the goal of using machine learning algorithms and learn how to optimize
the tuning parameters and also and hopefully to help some diagnosis. It is aimed to perform a
comparative analysis of different machine learning algorithms an
The goal of the project is to predict the right way if the breast cancer is malignant or not. The
project is based on Python and can be reproduced on Anaconda Environment. All the parameters
of algorithms are tuned at best possible and the Reached accuracy is around 94%. To be precise
the lower Obtained result is around 90% and the best is over 97% with 94% as mean.
This different results are caused by the shuffling of the elements. That is Necessary to make the
data more "reals". Each algorithm work trainset on 70% of initial dataset and it is tested with the
30%. Further, the necessary analysis of data is done to understand which features are redundant
and which the more important features are, allowing us to reduce the feature set to 5 attributes
while maintaining a good accuracy score with chosen algorithm.
13
Each script makes prediction on test set and computes the accuracy, sensitivity,
selectivity and time elapsed of the given algorithm. Further, the project tries to
identify which of the attributes contain more weight for prediction and reduces the
number of attributes to 5. This process is accompanied by a thorough visual analysis
of dataset and introduction of 2 new algorithms:
Random Forest
Support Vector Machine
14
o Feature Selection: Reducing the number of attributes required to a maximum of 5
using different ways without much loss of accuracy. Different attribute sets
obtained by feature selection are also compared based on their feasibility and
accuracy.
15
4.2.1.6. User Requirements:
Good computer technical knowhow
Basic programming knowledge
Basic statistical knowledge
Hands on experience with Anaconda, Python Scripts and JuPyter notebooks
4.2.2.1.1. Security:
The scripts of the project would not leave any cookies/cache on the system
computer unless authorized to do so by the admin rights. The system’s back-end
servers should only be accessible to authenticated users.
4.2.2.1.2. Reliability:
The project comprises of various scripts the reliability of the complete project
depends on the reliability of these separate scripts. The main pillar of reliability of
the system is the database which is continuously maintained with most accurate
medical values and updated to reflect the most recent changes. Also the system will
be functioning inside a terminal. Thus the overall stability of the system depends
on the stability of environment configured and its underlying operating system.
4.2.2.1.3. Portability:
An open-source database is used as the training/testing dataset. In case of a system
crash or failure, re-initialization of the project can be done. Also the project is
designed keeping modularity in mind so that portability can be provided with
relative ease.
16
4.2.2.1.4. Usability:
The scripts and its supporting modules of the system will be well documented and
easy to understand with use of comments. It runs on free open source python
environments which has server guides and documentations available online.
17
4.2.3. System Requirements
The project is written in Python and requires the following libraries to function
properly:
• NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with other
low level languages like FORTRAN, C and C++
• SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most
useful library for variety of high level science and engineering modules like discrete
Fourier transform, Linear Algebra, Optimization and Sparse matrices.
• Matplotlib for plotting vast variety of graphs, starting from histograms to line plots
to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab
= inline) to use these plotting features inline. If you ignore the inline option, then pylab
18
converts ipython environment to an environment, very similar to Matlab. You can also
use Latex commands to add math to your plot.
• Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.
• Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this
library contains a lot of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction.
• Statsmodels for statistical modeling. Statsmodels is a Python module that allows
users to explore data, estimate statistical models, and perform statistical tests. An
extensive list of descriptive statistics, statistical tests, plotting functions, and result
statistics are available for different types of data and each estimator.
• Seaborn for statistical data visualization. Seaborn is a library for making attractive
and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims
to make visualization a central part of exploring and understanding data.
19
5. Results and Discussion:
The scripts used for comparative were run in Anaconda prompt using the command:
python script_name.py
Each algorithm work trainset on 70% of initial dataset and it is tested with the 30%. The scripts
print accuracy, sensitivity, specificity, confusion matrix and time elapsed as general output. The
rows of the dataset are randomly shuffled resulting in difference in output each time script is run.
Hence, we take the average of every parameter (accuracy, sensitivity and selectivity) of each
algorithm used on the cancer dataset by taking an arithmetic mean over its 10 iterations.
Average results of each algorithm is listed along with the screenshots as follows:
20
Fig. 11: Screenshot of Decision Tree Regressor Algorithm
2) Decision Tree Entropy:
Average Accuracy= 92.606
Average Sensitivity=0.9062
Average Selectivity= 0.94
21
Fig. 13: Screenshot of ANN Algorithm
4) K Nearest Neighbors
Average Accuracy= 92.634
Average Sensitivity=0.862
Average Selectivity= 0.96
5) Naïve Bayes
Average Accuracy= 90.642
Average Sensitivity=0.81
Average Selectivity= 0.96
22
Fig. 15: Screenshot of Naïve Bayes Algorithm
For inference, the results have been tabulated in the table below for better visualization:
In terms of the accuracy results, it is displayed that Artificial Neural Networks is the best algorithm
for malignant breast cancer prediction. This algorithm, based on multi-layer perceptron model
provided highest accuracy scores of 97% and averaged at about 94.5%. Both the Decision Tree
Algorithms and KNN also fared well while accuracy of Naïve Bayes was the lowest probably due
to its overly simplistic nature.
23
However, since our dataset is imbalanced and detection of malignant cancer stages hold a much
larger value, the sensitivity scores also hold high significance. Judging by this parameter, both
Decision Tree Entropy and Decision Tree Repressor appear more robust algorithm as they are
more sensitive in detecting critical stages of cancer which is important to save a patient’s live.
The variant of Neural Network used in our model is Multi-Layer Perceptron. The classifier
parameters are tuned in the following ways to achieve the best accuracy:
clf = MLPClassifier(
activation='tanh',
solver='lbfgs',
alpha=1e-5,
early_stopping=False,
hidden_layer_sizes=(40,40),
max_iter=20000,
learning_rate_init=1e-5,
power_t=0.5,
tol=1e-4,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-08
)
24
Data Analysis and Visualization:
The given dataset contains 33 columns in total out of which some are completely unrelated to
prediction. Moreover, inclusion of them can rather reduce our accuracy and increase the time
required for training/testing of the model.
Here, ID and Unmamed: 32 are the ineffectual columns which have been dropped from the dataset
in the data cleaning process.
Also, looking at the nature of the attributes – by nature of these, the data can be divided into three
parts which are:
Features mean
Features standard error
Features worst
25
Fig. 17: Screenshot of Division of Features into 3 Parts
For analysis and prediction process, only one set of features could also suffice and provide
effective results given the 3 distinct part nature of the dataset.
As mean data is generally considered a robust central measure and suitable for medical data
analysis, we select that first. A snapshot of the statistical features of mean attribute set looks like:
Various measures of data distribution are displayed here for each column.
Next, we look at the frequency of the different Cancer stages we want to predict, namely: Benign
or Malignant
26
Fig. 19: Count of Malignant/Benign Classes
From this graph it is visible that there is a larger number benign stage of cancer which can be cured
with relative ease and less resources.
(Note: I have mapped Malignant(M) classes to 1 and and Benign Classes to 0 as we have an binary
type target class)
Feature Selection:
As we have already divided the data into 3 parts and have chosen one set (features mean, initially)
for making prediction - we have already reduced the feature set by one third.
Further, a correlation graph has been made to remove multicollinearity between the features. It
means, the columns which depend on each other shall be considered redundant as they have no
apparent effect on predictive analysis.
27
The following is a Heat-map representing the correlation between the attributes of features mean:
Here, the values closer to 1 represent a high correlation between the attributes (these values are
directly dependent on each other and their presence in a dataset increases redundancy)
28
Observation:
the radius, perimeter and area are highly correlated as expected from their relationship so
from these we will select any one of them. However, the other attributes.
compactness_mean, concavity_mean and concavepoint_mean are highly correlated so we
will use only compactness_mean from this set.
So the selected parameters for use are perimeter_mean, texture_mean, compactness_mean,
symmetry_mean and smoothness_mean.
Next, we check the accuracy of prediction with these chosen parameters using two new models:
Random Forests (RF) and Support Vector Machine (SVM)
With the 5 selected parameters, a normal accuracy score was achieved but there was a visible scope
for improvement.
Now, we test the RF and Random Forest Models with the whole mean features set (10 attributes)
29
The following accuracy was achieved taking all mean features:
Random Forest =95%
Support Vector Machine =69%
By taking all mean features accuracy of Random Forest Model turned out to be considerably good
but the increase in accuracy was subtle. However Razor’s rule states that suppose there exist two
explanations for an occurrence, in this case the simpler one is usually better.
The accuracy of SVM model displayed a sharp decrease and the model is considered as unreliable
This method uses of forests of trees to evaluate the importance of features on an artificial
classification task. It is well defined and useful property of Random Forest Classifier which can
be used for feature selection.
30
Now for selection, we chose the top 5 features from above and applied the two models.
Accuracy achieved:
Random Forest = 94%
SVM =77%
Here it is visible that that chosen parameters suffer from a high degree of collinearity which seem
to have a profound negative impact on SVM but doesn’t affect Random Forest due to its random
tree building nature.
A similar test case is reproduced on the Features Worst set which gives the following order of
features
Using the top 5 features from the above set also gives similar accuracy scores as the previous test
due to parameters of similar collinearity selected.
31
Accuracy Scores:
Random Forest =95%
SVM =70%
It can be concluded that for simplicity, random forest is a better algorithm for breast cancer
prediction even after reducing the set to 5 features.
Taking the features mean set, the following scatter plot is generated
32
From the scatter plot, it can be observed that which features can be used to separate two categories
which are Malignant (Red) and Benign (Blue.)
A better representation of attributes to be selected can be provided by using the seaborn library
and visualizing the data in the form of a swarm plot. Below is the swarm plot analysis of the
features mean set:
33
For developing the prediction models using feature selection, those attributes must be selected
which appear separable in the above plot. For example, here, look at radius_mean and
symmetry_mean. While the plot values of radius_mean appear largely separated, a lot of mixability
can be observed in the symmetry_mean column. Hence, it is clear that radius_mean is a more
suitable attribute for determining results.
Similarly, a swarm plot analysis of features SE set shows why it isn’t the most suitable feature set
for making predictions.
34
Cross Validation Testing with Feature Selection and Parameter Tuning for Certain
Algorithms:
Deriving from the feature importance model and confirming the results with the scatter correlation
plot and swarm plot analysis, the following prediction variables have been selected for the testing:
Cross Validation is a method which involves reserving a particular sample of a data set on which
you do not train the model. Later, you test the model on this sample before finalizing the model.
Here are the steps involved in cross validation:
1. You reserve a sample data set.
2. Train the model using the remaining part of the data set.
3. Use the reserve sample of the data set test (validation) set. This will help you to know
the effectiveness of model performance.
The 5 fold Cross Validation has been applied on different stock machine learning models and then
using Grid Search CV function which explores the model exhaustively by applying different
parameter values, the algorithms have been tuned to be more robust and have better efficiency with
the 5 selected features.
35
Here, the 100% accuracy is not a good stat necessarily. It shows a case of over-fitting for this
model. Also, the cross validation scores are intermediate – hence there exists a necessity to tune
this algorithm.
The parameters for grid search are:
Result:
SVM is showing really poor cross validation results and also requires tuning of parameters. The
parameter grid taken is:
36
Result:
This is one of the best showcase of running a parameter search. The SVM works fine with good
parameters as compared to its previous implementations which displayed really unreliable results.
The accuracy jumped from almost 70% to 94%.
K Nearest Neighbors:
37
Again, the cross validation scores for this model are not very good. The parameter grid for this
model is taken as:
Result:
Both accuracy and cross validation scores for this model are good enough. Also, due to the random
nature of this algorithm, it doesn’t require any parameter tuning as results are already good.
38
Inference: Random Forest algorithm is the best machine learning model for predicting the
malignant or benign stages of breast cancer using our reduced feature set. After parameter tuning,
even other models have achieved quite reliable accuracy scores and can be applied in specific
cases.
However, as is generally the case with many medical datasets, it takes significant time and cost to
measure the accurate breast cancer cell values to feed into the machine learning models. For this
purpose, a feature selection has been carried out in this project in which we have minimized the
number of attributes to just 5 which includes correlated attributes. The most robust algorithm for
this reduced feature set is Random Forest which is resulting in accuracy scores as high as 96%.
With this, tools can be built for physicians which can be used as an effective and cost efficient
mechanism for early detection and diagnosis of breast cancer that can result in enhancement of
survival rate of patients.
But the fact is, at this current stage, close supervision of qualified researchers and doctors would
be required for using these algorithms as the accuracy scores are still not 100% and results of
prediction are directly linked to lives of patients. Any automatic prediction in medical field must
be 100% reliable and should not require any intervention of a human. Also, currently only a limited
amount of open data is available to train the machine learning models. In the future, as more data
would be readily available for use – the accuracy must be improved to 100
39
7. References:
Journal:
1. Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001
workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
IBM.L. Eschenauer, V. D. Gligor and J. Bara, ‘On Trust Establishment in Mobile Ad Hoc
Networks’, Security Protocols Springer, (2004), pp. 47-66.
2. Singh, D. (2014). Analysis of Data Mining Classification with Decision tree Technique
(Doctoral dissertation, MPUAT, Udaipur)
3. Williams, Kehinde, Peter Adebayo Idowu, Jeremiah AdemolaBalogun, and
AdeniranIsholaOluwaranti. “Breast cancer risk prediction using data mining classification
techniques.” Transactions on Networks and Communications3(2), 01-11, 2015.
4. Jadhav, S. D., & Channe, H. P. (2016). Comparative Study of K-NN, Naive Bayes and
Decision Tree Classification Techniques. International Journal of Science and
Research, 5(1).
5. Chaurasia, Vikas, and Saurabh Pal. “A novel approach for breast cancer detection using
data mining techniques.” International Journal of Innovative Research in Computer and
Communication Engineering2(1), 2456-2465, 2014.
6. Sivakami, K. “Mining Big Data: Breast Cancer Prediction using DT-SVM Hybrid
Model.”International Journal of Scientific Engineering and Applied Science (IJSEAS)–
1(5), 418-429, 2015
7. Shajahaan, S. Syed, S. Shanthi, and V. M. Chitra. “Application of Data Mining techniques
to model breast cancer data.” International Journal of Emerging Technology and Advanced
Engineering3(11), 362-369, 2013
8. Thein, HtetThazin Tike, and Khin Mo Mo Tun. “An Approach for Breast Cancer Diagnosis
Classification Using Neural Network.” Advanced Computing6(1), 1-10, 2015.
9. Aalaei, S., Shahraki, H., Rowhanimanesh, A., & Eslami, S. (2016). Feature selection using
genetic algorithm for breast cancer diagnosis: experiment on three different datasets.
Iranian Journal of Basic Medical Sciences, 19(5), 476–482.
10. Venkatesan, E., and T. Velmurugan. “Performance analysis of decision tree algorithms for
breast cancer classification.” Indian Journal of Science and Technology8(29), 1-8, 2015.
40
Weblinks:
1. https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
2. https://www.healthcatalyst.com/predictive-analytics
3. https://pythonprogramming.net/machine-learning-tutorial-python-introduction/
4. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
5. https://www.analyticsvidhya.com/
6. https://www.scipy.org/docs.html
7. https://seaborn.pydata.org/
8. https://stackoverflow.com/
9. http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/
10. https://www.tutorialspoint.com/python/index.htm
Books:
1. Hearty, J. (2016). Advanced Machine Learning with Python. Packt Publishing Ltd.
2. E.Coiera. The Guide to Health Informatics. 2nd ed.London, U.K.: Arnold, October
2003.
3. Jiawei Han, Micheline Kamber (2011). Data Mining: Concepts and Techniques, 3rd
Edition. Morgan Kaufmann Publishers Inc.
41