You are on page 1of 51

Medical Data Analysis Using Machine Learning Techniques

A PROJECT REPORT

Submitted by

DEVANSH BHASIN (14BCB0045)

in partial fulfillment for the award of the degree of

B. Tech

in

Computer Science and Engineering


(Specialization in Bio Informatics)

Vellore-632014, Tamil Nadu, India

School of Computer Science and Engineering

APRIL, 2018

i
School of Computer Science and Engineering

DECLARATION

I hereby declare that the project entitled “Medical Data Analysis Using Machine
Learning Techniques” submitted by me to the School of Computer Science and Engineering,
Vellore Institute of Technology, Vellore-14 towards the partial fulfillment of the requirements for
the award of the degree of Bachelor of Technology in Computer Science and Engineering
(Specialization in Bio Informatics) is a record of bonafide work carried out by me under the
supervision of Prof. Rajkumar R., Designation. I further declare that the work reported in this
project has not been submitted and will not be submitted, either in part or in full, for the award of
any other degree or diploma of this institute or of any other institute or university.

Signature

Name : Devansh Bhasin


Reg. No: 14BCB0045

ii
School of Computer Science and Engineering

CERTIFICATE

The project report entitled “Medical Data Analysis Using Machine Learning Techniques” is

prepared and submitted by Devansh Bhasin (Register No: 14BCB0045), has been found

satisfactory in terms of scope, quality and presentation as partial fulfillment of the requirements

for the award of the degree of Bachelor of Technology in Computer Science and Engineering

(Specialization in Bio Informatics) in Vellore Institute of Technology, Vellore-14, India.

Guide (Name & Signature)

Internal Examiner (Name & Signature) External Examiner (Name & Signature)

iii
ACKNOWLEDGEMENT

The project report couldn’t have been accomplished without the splendid support and cooperation
of my internal guide, Professor Rajkumar R and I would like to express my gratitude to him for
his unwavering support and the infinite amount of time spent with me helping to do my project.

I am very thankful to my respected Head of Department, Professor Lakshmanan K. for the


confidence he had on me regarding this project and for inspiring and motivating me to bring out a
successful project.

It is my privilege to express heartfelt thanks to the Dean, School of Computer Science and
Engineering, Dr. Saravanan R for this kind of encouragement for all endeavors upon this project.

I would also like to express our sincere gratitude to my college Vellore Institute of Technology
University, Vellore for providing me with the infrastructure and the opportunity to undertake and
complete such an interesting project report. My special thanks to the chancellor of VIT, Dr. G.
Viswanathan, who gave me the opportunity to pursue my studies in this prestigious university.

iv
TABLE OF CONTENTS

S. NO. TITLE PAGE NO.


Title Page i
Declaration ii
Certificate iii
Acknowledgement iv
Table of Contents v
List of Figures vii
List of Tables viii
List of Abbreviations ix
Abstract x
1.
Introduction 1

1.1 Theoretical Background 1


1.2 Motivation 1
1.3 Aim of the proposed Work 2

1.4 Objective(s) of the proposed Work 2

2. Literature Survey 3
2.1 Survey of the existing models/work 3
2.2 Summary/gaps identified in the survey 5
3. Overview of the proposed system 6
3.1 Introduction and Related Concepts 6
3.2 Framework for the Proposed System (with
7
explanation)
3.3 Overview of Algorithms Used 8
4. Proposed System Analysis and Design 13
4.1 Introduction 13
4.2 Requirement Analysis 13
4.2.1 Functional Requirements 13
4.2.1.1 Product Perspectives 13

v
4.2.1.2 Product features 14
4.2.1.3 User characteristics 15
4.2.1.4 Assumption and Dependencies 15
4.2.1.5 Domain Requirements 15
4.2.1.6 User Requirements 16
4.2.2 Non-functional Requirements 16
4.2.2.1 Product Requirements 16
4.2.2.1.1 Efficiency 16
4.2.2.1.2 Reliability 16
4.2.2.1.3 Portability 16
4.2.2.1.4 Usability 17
4.2.3 Engineering Standard Requirements 17
 Economic 17
 Legality 17
 Social 17
 Ethical 17
 Health 17
4.2.4 System Requirements 18
4.2.4.1 H/W Requirements 18
4.2.4.2 S/W Requirements 18
5. Results and Discussion 20
Conclusion, Limitations and Scope for future
6. 39
Work

7. References 40

vi
LIST OF FIGURES

Title Page Number


Fig. 1: General Framework of the proposed system 7
Fig 2: Data Flow Diagram of the system 8
Fig. 3: Decision Tree Representation 8
Fig. 4: Diagram of a Basic Neural Network 9
Fig 5: Bayes Theorem Equation 10
Fig. 6: K-Nearest Neighbours Representation 11
Fig. 7: Support Vector Machine Representation 11
Fig. 8: Random Forest Representation 12
Fig. 9: Architecture Diagram of the given Project 13
Fig. 10: Formulas of Different Evaluation Parameters 20
Fig. 11: Screenshot of Decision Tree Regressor Algorithm 21
Fig. 12: Screenshot of Decision Tree Entropy Algorithm 21
Fig. 13: Screenshot of ANN Algorithm 22
Fig. 14: Screenshot of KNN Algorithm 22
Fig. 15: Screenshot of Naïve Bayes Algorithm 23
Fig. 16: Screenshot of 1st 2 rows of the dataset 25
Fig. 17: Screenshot of Division of Features into 3 Parts 26
Fig. 18: Statistical Features of the Mean Attribute Set 26
Fig. 19: Count of Malignant/Benign Classes 27
Fig. 20: Heatmap of Attributes of Features Mean 28
Fig. 21: Rows and Columns of Train/Test Data (Divided) 29
Fig. 22: Features Mean Set in Order of Descending Importance 30
Fig. 23: Features Worst Set in Order of Descending Importance 31
Fig. 24: Scatter Plot Analysis of Features Mean Set 32
Fig. 25: Swarm Plot Analysis of Features Mean Set 33
Fig. 26: Swarm Plot Analysis of Features SE Set 34

vii
Fig. 27: Decision Tree (Untuned) CV Result with 5 Prameters 35
Fig. 28: Decision Tree Tuned Parameter Result 36
Fig. 29: SVM (Untuned) CV Result with 5 Prameters 36
Fig. 30: SVM Tuned Parameter Result 37
Fig. 31: KNN (Untuned) CV Result with 5 Prameters 37
Fig. 32: KNN Tuned Parameter Result 39
Fig. 33: Random Forest (Untuned) CV Result with 5 Prameters 40

LIST OF TABLES

Title Page Number


Table 1: Comparative Study of Different Algorithms 23

viii
LIST OF ABBREVIATIONS

Abbreviation Expansion
ANN Artificial Neural Network

ML Machine Learning

KNN K Nearest Neighbours

AI Artificial Intelligence

NB Naïve Bayes

RF Random Forest

SVM Support Vector Machine

CSV Comma Seperated Values

UML Unified Modeling Language

DFD Data Flow Diagram

DTE Decision Tree Entropy

DTR Decision Tree Regressor

OS Operating System

IT Information Technology

ix
ABSTRACT

Breast cancer is one of those illnesses that has a high toll on human lives each year. It is a standout
amongst the most common types of all cancers and a noteworthy reason for female passings around
the world. As per reports, Indian women are highly affected by this this disease. With recent
computation advancements, data mining methods using machine learning are reliable and effective
ways to make predictions on data. Especially in the medical field, these methods are widely used
in diagnosis and analysis to make analytical decisions. In this project, a comparative study has
been conducted between given machine learning algorithms: Decision Tree Regression (DTR),
Decision Tree Entropy (DTE), Naive Bayes (NB), K-Nearest Neighbors (KNN) and Artificial
Neural Networks (ANN) on the Wisconsin Breast Cancer (Kaggle, 2016) dataset. The project
presents a performance comparison of various algorithms based on different factors such as
accuracy, specificity and sensitivity. Effective Visualization strategies have been used at every
step to understand the methodologies used in a better way. Further, after statistical analysis of data,
feature selection is performed reducing 32 classification attributes to 5 while maintaining
considerable accuracy. More advanced algorithms like Support Vector Machine (SVM) and
Random Forest (RF) also have been introduced in the feature selection part of the project. All the
scripts of the project are written in Python, use open-sourced Machine Learning libraries and run
using tools of Anaconda Distribution

x
1. Introduction

1.1. Theoretical Background


Our current era with dominating presence of computers and other information systems is
an age of data-based results and decisions. It is said that – “the segments or industries that
generate more data will grow faster and the organizations that utilize this data to make
important decisions will be ahead of the curve.”

With regards to industries that deal with tremendous amount of information, healthcare
ranks among the top as a result of to several new strategies/methodologies for data
collection, such as - real-time sensor-generated data, genomic mapping, cell-image data
etc.

Machine learning enables building efficient models for rapidly analyzing data and delivery
of reliable results, utilizing both real-time and time-stamped data. By harnessing the power
of machine learning, service providers of the medical & health industry can make better
decisions on patient’s diagnoses and treatment options, which would lead to an overall
improvement of the healthcare services.

1.2. Motivation:
Earlier, it was difficult for physician and doctors to collect and analyze the huge volumes
of data and using it for making effective predictions and treatments since IT industry
wasn’t mature enough and advanced sensor and imaging technology was unavailable.
However, now, with machine learning and modern computers, it’s been relatively easy, as
data-mining and visualizing technologies such as Hive, Hadoop, Python, R etc. are mature
enough for wide-scale adoption. A large number of data scientists perform analytics on
voluminous data that they believe was not possible before.

1
Machine learning models can also be useful in providing crucial insights, real-time data,
and advanced analytics information in terms of the patient’s disease, lab test outcomes,
blood pressure, family lineage, clinical trial data, and more to healthcare providers.
Hence my motivation lies in the question:

What if this data could be used predict a certain disease or the risk of developing a disease?
And how much cost benefits can be provided to the medical community by leveraging the
power of machine learning on initial test data.

1.3. Aim of the proposed Work


The aim of the project is an organized analysis of structured and unstructured medical data
(Breast Cancer Dataset) using the most relevant machine learning models. Different
factors of various machine learning models would be compared, and the best models
applicable for real world scenarios would be published in the research findings. The
project further goes on to apply feature selection on the given dataset reducing the number
of attributes by multiple folds while maintaining considerable accuracy.

1.4. Objective(s) of the proposed work


The repository is a learning exercise to:
•Apply the fundamental concepts of machine learning on an available dataset (Breast
Cancer – UCI.)
•Evaluate and interpret my results and justify my interpretation based on observed data set.
•Perform necessary statistical data analysis for more comprehensive understanding of the
process and write notebooks that serve as computational records and document my thought
process.
•Apply feature selection algorithms on the given database to produce more accurate and
time efficient results.

The analysis mainly consists of the following processes:


1. Identifying the problem and studying the data resources
2. Exploratory Data Analysis

2
3. Build model to predict whether breast cell tissue is malignant or benign.

2. Literature Survey
2.1. Survey of the Existing Models/Work
• Prediction/Classification using traditional data mining models usually involves a machine
learning algorithm (e.g., Decision Tree, Random Forest, Support Vector Machine, Naïve
Bayes etc.), and particularly a supervised or semi-supervised learning algorithm by the use
of training data with labels to train the model.
• In the part of dataset used for testing, patients can be classified into groups of either high
risk or low risk, each group requiring different levels of treatment. These models are
valuable in the medical industry and are widely studied.
• In [1] it has been stated that - The Naive Bayes classifier oversimplifies the learning process
by assuming that features are independent of any given class. Although independence is
generally an unreliable assumption, in practice Naive Bayes often competes well with more
sophisticated models. Low-entropy distribution of features yield better performance for
Naive Bayes and Bayesian Network works well for certain functional feature
dependencies, thus reaching its best performance in two cases: completely independent
features (which is expected) and functionally dependent features. The accuracy of Naive
Bayes is not directly related to the extent of feature dependencies measured as the mutual
class conditional information between the features. Rather, a better estimator of Naive
Bayes accuracy is the amount of information about the class that is lost because of the
independence assumption.
• According to [2] with data-mining coming into prominence, decision tree is considered as
one of the most reliable models in the process of data mining and data analysis of certain
cases. Decision Tree process involves - using a set of data for training to generate a decision
tree that correctly classifies the training data itself. If the learning process is successful,
this decision tree will then correctly classify newly entered data as well. Decision Trees are
of different types depending on various dimensions such as splitting criterion, stopping
rules, branch condition (multivariate, univariate), style of branch operation and type of

3
final tree etc. Decision tree models based on information of attribute are biased toward
multi value attributes which have more but likely unreliable information content. Attributes
that have additional values can be of less significance for various applications of decision
tree. Problem affects the accuracy of ID3 DT Models and generate unclassified regions.
• Williams et al. [3] have focused at two data mining techniques namely naïve bayes and j48
decision trees to predict breast cancer risks in Nigerian patients. The analysis is made to
determine the most efficient and effective model. The authors have collected the dataset
from cancer registry of LASUTH, Ikeja in Lagos, Nigeria which contains 69 instances with
17 attributes along with the class label. The dataset holds 11 non-modifiable factors and
five modifiable factors. The experiment is conducted through Weka and the authors have
claimed j48 decision tree is better for the prediction of breast cancer risks with the values
of accuracy (90.2%), precision, recall and error rates.
• In [4] a comparative study has been carried out between Naïve-Bayes, Decision Tree and
Nearest Neighbor algorithms. By comparing the different data-mining classification
algorithms listed, it shows that most Decision Tree's algorithms are more accurate and have
lower error rate than other algorithms such as K-NN and Bayesian Network. The
information in decision tree is depicted in the form of collection of IF-THEN rules which
are easier for humans to comprehend. Each algorithm has its own set of advantages and
disadvantages as well as its own area of implementation. None of the algorithms are
considered perfect and can satisfy all constrains and criteria.
• Chaurasia et al. [5] have investigated the performance of BFTree (Best First Tree), IBK
(K-nearest neighbor classifier) and SMO (Sequential Minimal Optimization) classification
techniques on breast cancer data. The authors have conducted the experiment in Weka data
mining tool and have taken three evaluation criteria such as time, correctly classified
instances and accuracy for assessing the superiority of each algorithm. The authors have
stated that the performance of SMO algorithms has been better than the other two
algorithms in terms of accuracy and low error rate. The authors have also identified the
most important features for enhancing the prediction accuracy.
• Sivakami [6] has presented a disease status prediction model by using a hybrid
methodology of Support Vector Machines (SVM). To alert the severity of the disease the
strategy of the system consists of two main parts namely information treatment and option

4
extraction, and decision tree- support vector machines. The author has compared the results
of the proposed model with Instance-based Learning (IBL), Sequential Minimal
Optimization (SMO) and Naïve Bayes (NB) and has proven that proposed algorithms
works better than the comparative algorithms with 91% of accuracy.
• Shajahaan et al. [7] have explored the applicability of decision trees for breast cancer
prediction and also analyzed performance of conventional supervised learning models such
as CART, ID3, C4.5 and Naïve Bayes. The experiment is conducted through Weka tool.
The authors have highlighted five meaningful attributes that can be considered for the
prediction. The authors have concluded that the random tree serves as the best classification
algorithm for breast cancer with higher accuracy in prediction.

2.2. Summary/Gaps identified in the Survey:


• In the existing systems the data-set of patients are typically small and experience is used
for describing the characteristics of diseases with specific conditions. However, these pre-
selected characteristics may not satisfy the changes in the disease over time and its
influencing factors.
• Naive Bayes classifier is highly dependent on the shape of the data distribution, i.e any two
features are independent given the output class. Because of this, unreliable or incorrect
results can be achieved at times which can be catastrophic in healthcare field.
• Another problem arises for continuous features present in the dataset. It is common to use
binning procedure to make continuous discrete, but if this process is not carried out by
carefulness, a lot of information can be lost.
• The fidelity of the information in the decision tree depends on feeding the precise internal
and external information at the onset. Even a small change in input data can at times, cause
large changes in the tree. Changing variables, excluding duplication information, or
altering the sequence midway can lead to major changes and might possibly require
redrawing the tree.
• Among the major disadvantages of decision tree are its complexity. Decision trees are easy
to use compared to other decision-making models, but preparing decision trees, especially
large ones with many branches, are complex and time-consuming affairs. Implementation
of Decision trees is very costly.

5
• Most of these existing models run on the entire 30+ attributes of the dataset to achieve a
good accuracy score, however these implementations don’t take in account the time and
cost required to accurately measure this data on cell level. Reducing the number of
attributes while maintaining the accuracy could make the entire process considerably more
cost effective and simple.

3. Overview of the Proposed System


3.1. Introduction and Related Concepts
Medical diagnosis is an on-going research in medical field. Here the prediction of various
diseases like heart, lungs and various cancers are supported by the previous data collected.
For the given project, I have applied my knowledge on the UCI Breast Cancer Dataset.

This project aims at finding breast cancer type (Benign or Malignant) using different
machine learning models. Through thorough research, a noble approach is provided in
order to improve the efficiency and time consumption of these models. Data of cancer
patients was collected from Wisconsin dataset of UCI machine learning Repository. This
dataset consists of total 30+ attributes on which I applied Naive Bayes, Entropy Decision
Tree, K-Nearest Neighbors, Regression Decision Tree and Artificial Neural Networks
(ANN) classification algorithms and calculated various factors of efficiency like accuracy,
sensitivity and specificity. After this, different feature selection methods, helped me to
reduce the number of attributes required and also improve accuracy of some models by
reducing some lower ranked attributes. Not only the contributions of these attributes was
very less, but sometimes their addition has a negative impact on classification algorithms.

The whole project is written in Python. All the parameters of algorithms used for
comparative study are tuned at best possible and the reached accuracy is around 94%. To
be precise the lower Obtained result is around 90% and the best is over 97% with 94% as
mean. Further, notebooks have been created applying feature selection models reducing
the dimension scope of attributes from 32 to mere 5.

6
3.2. Framework for the Proposed System (with explanation):
Given below is a diagram of the framework of predictive data mining process. For
predictive analytics to be effective, data analysts should follow the principle of “living the
process” to best understand the type of data, the required workflow, the target audience
and the action that would be prompted by knowing the prediction.

Fig. 1: General Framework of the proposed system

1. Step One: to skillfully define the problem that is to be addressed, then gather the initial
data necessary and evaluate different models that can be used on the data.
2. Step two: selecting one of the best performing models and testing with a separate data
set to validate our approach.
3. Step three: to run the model in a real world scenarios with novel input values.

A more advanced process is prescriptive analytics, which includes evidence,


recommendations and actions for each prediction made by qualified professionals.

7
Particularly, prediction should link carefully to clinical priorities and measurable events
such as cost effectiveness, clinical protocols or patient outcomes.
Data Flow Diagram for the proposed system:

Fig 2: Data Flow Diagram of the system

3.3 Overview of Algorithms Used

Decision Trees
Decision trees, as the name suggest, is a tree-shaped visual representation of one can reach to a
particular decision by laying down all options and their probability of occurrence. Decision trees
are extremely easy to understand and interpret. At each node of the tree, one can interpret what
would be the consequence of selecting that node or option.

Fig. 3: Decision Tree Representation

8
Neural Networks
Neural network (also known as Artificial Neural Network) is inspired by human nervous system,
how complex information is absorbed and processed by the system. Just like humans, neural
networks learn by example and are configured to a specific application.

Fig. 4: Diagram of a Basic Neural Network

Neural networks are used to find patterns in complex data and thus provide forecast and classify
data points. Neural networks are normally organized in layers. Layers are made up of a number of
interconnected ‘nodes’. Patterns are presented to the network via the ‘input layer’, which
communicates to one or more ‘hidden layers’ where the actual processing is done. The hidden
layers then link to an ‘output layer’ where the answer is output as shown in the graphic below.

Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of independence
between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. For example, a fruit
may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, a naive Bayes classifier
would consider all of these properties to independently contribute to the probability that this fruit
is an apple.

9
Naive Bayesian model is easy to construct and particularly useful for very big data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:

Fig 5: Bayes Theorem Equation

Here,
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

KNN (K- Nearest Neighbors)


It can be used for both classification and regression problems. However, it is more widely used in
classification problems in the industry. K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases by a majority vote of its k neighbors. The case being
assigned to the class is most common amongst its K nearest neighbors measured by a distance
function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First
three functions are used for continuous function and fourth one (Hamming) for categorical

10
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times,
choosing K turns out to be a challenge while performing KNN modeling.

Fig. 6: K-Nearest Neighbors Representation

KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you
have no information, you might like to find out about his close friends and the circles he moves in
and gain access to his/her information!

Support Vector Machine (SVM)


“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used
for both classification and regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is
number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two
classes very well (look at the below snapshot).

Fig. 7: Support Vector Machine Representation

11
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is
a frontier which best segregates the two classes (hyper-plane/ line).

Random Forest (RF)


Random Forest algorithm is a supervised learning classification algorithm. We can see it from its
name, which is to create a forest by some way and make it random. There is a direct relationship
between the number of trees in the forest and the results it can get: the larger the number of trees,
the more accurate the result. But one thing to note is that creating the forest is not the same as
constructing the decision with information gain or gain index approach.

Fig. 8: Random Forest Representation

The difference between Random Forest algorithm and the decision tree algorithm is that in
Random Forest, the process of finding the root node and splitting the feature nodes will run
randomly.

12
4. Proposed System Analysis and Design

4.1. Introduction
This project started with the goal of using machine learning algorithms and learn how to optimize
the tuning parameters and also and hopefully to help some diagnosis. It is aimed to perform a
comparative analysis of different machine learning algorithms an

The goal of the project is to predict the right way if the breast cancer is malignant or not. The
project is based on Python and can be reproduced on Anaconda Environment. All the parameters
of algorithms are tuned at best possible and the Reached accuracy is around 94%. To be precise
the lower Obtained result is around 90% and the best is over 97% with 94% as mean.

This different results are caused by the shuffling of the elements. That is Necessary to make the
data more "reals". Each algorithm work trainset on 70% of initial dataset and it is tested with the
30%. Further, the necessary analysis of data is done to understand which features are redundant
and which the more important features are, allowing us to reduce the feature set to 5 attributes
while maintaining a good accuracy score with chosen algorithm.

4.2. Requirement Analysis

4.2.1. Functional Requirements

4.2.1.1. Product Perspective:


The project provides the following collection of scripts to train and test on the given
UCI breast cancer dataset.
 Artificial Neural Networks
 K Nearest Neighbors
 Naïve Bayes
 Decision Tree - Entropy
 Decision Tree – Regression

13
Each script makes prediction on test set and computes the accuracy, sensitivity,
selectivity and time elapsed of the given algorithm. Further, the project tries to
identify which of the attributes contain more weight for prediction and reduces the
number of attributes to 5. This process is accompanied by a thorough visual analysis
of dataset and introduction of 2 new algorithms:
 Random Forest
 Support Vector Machine

4.2.1.2. Product features:

Fig. 9: Architecture Diagram of the given Project

The project scripts provides following features:


o Read CSV File: Provides a function to read a given CSV file and convert it into a
pandas DataFrame for application Machine Learning techniques
o Random Shuffler: The scripts shuffle the data in the file randomly each time the
script is run for analysis over multiple iterations.
o Training the data: To train the given breast cancer dataset on one of the 5 machine
learning models mentioned above.
o Making Predictions: Making predictions on the test dataset i.e testing the given
model.
o Evaluating the model: Computing the accuracy of a given model based on hits and
misses.

14
o Feature Selection: Reducing the number of attributes required to a maximum of 5
using different ways without much loss of accuracy. Different attribute sets
obtained by feature selection are also compared based on their feasibility and
accuracy.

4.2.1.3. User characteristics


 Educational level: The User of the program should be well versed in English
and should have basic knowledge of programming and working with
development environments.
 Technical Expertise: User should be comfortable with the Anaconda software
package and basic usage of Python machine learning libraries.

4.2.1.4. Assumption & Dependencies:


The system is designed with the following assumptions:
 User working with the program is granted necessary admin rights.
 The program needs the user to have sufficient technical knowledge of using
Anaconda Command Prompt and JuPyter.
 Software package may require internet access for keeping libraries updated.

4.2.1.5. Domain Requirements:


Since the data here is patient related, such data needs to tested thoroughly and stored
in a secured enclave. Only after such thorough testing the system should be used in real
cases. Also, such systems need to meet the regulatory requirements of the Governments
and Healthcare systems.

15
4.2.1.6. User Requirements:
 Good computer technical knowhow
 Basic programming knowledge
 Basic statistical knowledge
 Hands on experience with Anaconda, Python Scripts and JuPyter notebooks

4.2.2. Non-Functional Requirements:

4.2.2.1. Product Requirements:

4.2.2.1.1. Security:
The scripts of the project would not leave any cookies/cache on the system
computer unless authorized to do so by the admin rights. The system’s back-end
servers should only be accessible to authenticated users.

4.2.2.1.2. Reliability:
The project comprises of various scripts the reliability of the complete project
depends on the reliability of these separate scripts. The main pillar of reliability of
the system is the database which is continuously maintained with most accurate
medical values and updated to reflect the most recent changes. Also the system will
be functioning inside a terminal. Thus the overall stability of the system depends
on the stability of environment configured and its underlying operating system.

4.2.2.1.3. Portability:
An open-source database is used as the training/testing dataset. In case of a system
crash or failure, re-initialization of the project can be done. Also the project is
designed keeping modularity in mind so that portability can be provided with
relative ease.

16
4.2.2.1.4. Usability:
The scripts and its supporting modules of the system will be well documented and
easy to understand with use of comments. It runs on free open source python
environments which has server guides and documentations available online.

4.2.2.2. Operational Requirements:


 Economic – The scripts are lightweight and are capable of running efficiently
even on low cost computers. These script can save both patients and doctors,
sufficient time and cost leading to widespread economic benefits to the
community.
 Legality – The programs run on open source python libraries freely available
on sources like GitHub. Even the datasets used are freely available for research
through the UCI machine learning repository and Kaggle.
 Social – The research in this field would promote further use of machine
learning on a variety of health data. It would encourage organizations to release
more data publicly enabling creation of better training models. Successful
implementation would coax increased funding in this field.
 Ethical – Every new data or result produced in this project would be uploaded
on open source forms online for assistance of further research in this field.
 Health – The research in this project could help medical community save a
large sum of monetary funds in the initial testing procedures.

17
4.2.3. System Requirements

4.2.3.1. H/W Requirements:


The following are recommended minimum hardware requirements:
• Processor: i3 Processor or newer
• System Storage: 250GB HDD or more
• Monitor: 14 inch+ LED/TFT Panel
• Input Devices: Keyboard, Mouse
• Memory: 4GB or more

4.2.3.2. S/W Requirements:


• Operating system: Windows 7,10
• Coding Language: Python 3.6
• IDE: Spyder, JuPyter (Anaconda)
• Database: CSV, SQL

The project is written in Python and requires the following libraries to function
properly:
• NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with other
low level languages like FORTRAN, C and C++
• SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most
useful library for variety of high level science and engineering modules like discrete
Fourier transform, Linear Algebra, Optimization and Sparse matrices.
• Matplotlib for plotting vast variety of graphs, starting from histograms to line plots
to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab
= inline) to use these plotting features inline. If you ignore the inline option, then pylab

18
converts ipython environment to an environment, very similar to Matlab. You can also
use Latex commands to add math to your plot.
• Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.
• Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this
library contains a lot of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction.
• Statsmodels for statistical modeling. Statsmodels is a Python module that allows
users to explore data, estimate statistical models, and perform statistical tests. An
extensive list of descriptive statistics, statistical tests, plotting functions, and result
statistics are available for different types of data and each estimator.
• Seaborn for statistical data visualization. Seaborn is a library for making attractive
and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims
to make visualization a central part of exploring and understanding data.

Dataset Attribute Information:


1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1

19
5. Results and Discussion:

Comparative study of algorithms:

The scripts used for comparative were run in Anaconda prompt using the command:
python script_name.py

Each algorithm work trainset on 70% of initial dataset and it is tested with the 30%. The scripts
print accuracy, sensitivity, specificity, confusion matrix and time elapsed as general output. The
rows of the dataset are randomly shuffled resulting in difference in output each time script is run.
Hence, we take the average of every parameter (accuracy, sensitivity and selectivity) of each
algorithm used on the cancer dataset by taking an arithmetic mean over its 10 iterations.

Fig. 10: Formulas of Different Evaluation Parameters

Average results of each algorithm is listed along with the screenshots as follows:

1) Decision Tree Regressor:


Average Accuracy = 93.1
Average Sensitivity=0.9212
Average Selectivity= 0.934

20
Fig. 11: Screenshot of Decision Tree Regressor Algorithm
2) Decision Tree Entropy:
Average Accuracy= 92.606
Average Sensitivity=0.9062
Average Selectivity= 0.94

Fig. 12: Screenshot of Decision Tree Entropy Algorithm

3) Artificial Neural Network:


Average Accuracy= 94.384
Average Sensitivity=0.89
Average Selectivity= 0.9768

21
Fig. 13: Screenshot of ANN Algorithm

4) K Nearest Neighbors
Average Accuracy= 92.634
Average Sensitivity=0.862
Average Selectivity= 0.96

Fig. 14: Screenshot of KNN Algorithm

5) Naïve Bayes
Average Accuracy= 90.642
Average Sensitivity=0.81
Average Selectivity= 0.96

22
Fig. 15: Screenshot of Naïve Bayes Algorithm

For inference, the results have been tabulated in the table below for better visualization:

Accuracy Sensitivity Specificity

Artificial Neural Networks 94.384 0.89 0.9768

Decision Tree Regression 93.1 0.9212 0.934

Decision Tree Entropy 92.606 0.9062 0.94

K Nearest Neighbors 92.634 0.862 0.96

Naive Bayes 90.642 0.81 0.96

Table 1: Comparative Study of Different Algorithms

Takeaway from study:

In terms of the accuracy results, it is displayed that Artificial Neural Networks is the best algorithm
for malignant breast cancer prediction. This algorithm, based on multi-layer perceptron model
provided highest accuracy scores of 97% and averaged at about 94.5%. Both the Decision Tree
Algorithms and KNN also fared well while accuracy of Naïve Bayes was the lowest probably due
to its overly simplistic nature.

23
However, since our dataset is imbalanced and detection of malignant cancer stages hold a much
larger value, the sensitivity scores also hold high significance. Judging by this parameter, both
Decision Tree Entropy and Decision Tree Repressor appear more robust algorithm as they are
more sensitive in detecting critical stages of cancer which is important to save a patient’s live.

The variant of Neural Network used in our model is Multi-Layer Perceptron. The classifier
parameters are tuned in the following ways to achieve the best accuracy:

clf = MLPClassifier(
activation='tanh',
solver='lbfgs',
alpha=1e-5,
early_stopping=False,
hidden_layer_sizes=(40,40),
max_iter=20000,
learning_rate_init=1e-5,
power_t=0.5,
tol=1e-4,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-08
)

24
Data Analysis and Visualization:

The given dataset contains 33 columns in total out of which some are completely unrelated to
prediction. Moreover, inclusion of them can rather reduce our accuracy and increase the time
required for training/testing of the model.

Fig. 16: Screenshot of 1st 2 rows of the dataset

Here, ID and Unmamed: 32 are the ineffectual columns which have been dropped from the dataset
in the data cleaning process.

Also, looking at the nature of the attributes – by nature of these, the data can be divided into three
parts which are:
 Features mean
 Features standard error
 Features worst

25
Fig. 17: Screenshot of Division of Features into 3 Parts

For analysis and prediction process, only one set of features could also suffice and provide
effective results given the 3 distinct part nature of the dataset.

As mean data is generally considered a robust central measure and suitable for medical data
analysis, we select that first. A snapshot of the statistical features of mean attribute set looks like:

Fig. 18: Statistical Features of the Mean Attribute Set

Various measures of data distribution are displayed here for each column.

Next, we look at the frequency of the different Cancer stages we want to predict, namely: Benign
or Malignant

26
Fig. 19: Count of Malignant/Benign Classes

From this graph it is visible that there is a larger number benign stage of cancer which can be cured
with relative ease and less resources.

(Note: I have mapped Malignant(M) classes to 1 and and Benign Classes to 0 as we have an binary
type target class)

Feature Selection:

As we have already divided the data into 3 parts and have chosen one set (features mean, initially)
for making prediction - we have already reduced the feature set by one third.

Further, a correlation graph has been made to remove multicollinearity between the features. It
means, the columns which depend on each other shall be considered redundant as they have no
apparent effect on predictive analysis.

27
The following is a Heat-map representing the correlation between the attributes of features mean:

Fig. 20: Heatmap of Attributes of Features Mean

Here, the values closer to 1 represent a high correlation between the attributes (these values are
directly dependent on each other and their presence in a dataset increases redundancy)

28
Observation:
 the radius, perimeter and area are highly correlated as expected from their relationship so
from these we will select any one of them. However, the other attributes.
 compactness_mean, concavity_mean and concavepoint_mean are highly correlated so we
will use only compactness_mean from this set.
 So the selected parameters for use are perimeter_mean, texture_mean, compactness_mean,
symmetry_mean and smoothness_mean.

Next, we check the accuracy of prediction with these chosen parameters using two new models:
Random Forests (RF) and Support Vector Machine (SVM)

The training and testing split is again kept as 70-30.

Fig. 21: Rows and Columns of Train/Test Data (Divided)

With, Random Forests - accuracy achieved = 91%


With, Support Vector Machine – accuracy achieved =86%

With the 5 selected parameters, a normal accuracy score was achieved but there was a visible scope
for improvement.

Now, we test the RF and Random Forest Models with the whole mean features set (10 attributes)

29
The following accuracy was achieved taking all mean features:
Random Forest =95%
Support Vector Machine =69%

By taking all mean features accuracy of Random Forest Model turned out to be considerably good
but the increase in accuracy was subtle. However Razor’s rule states that suppose there exist two
explanations for an occurrence, in this case the simpler one is usually better.

The accuracy of SVM model displayed a sharp decrease and the model is considered as unreliable

Feature selection using model.feature_importances_:

This method uses of forests of trees to evaluate the importance of features on an artificial
classification task. It is well defined and useful property of Random Forest Classifier which can
be used for feature selection.

Features in descending order of their importance are in order

Fig. 22: Features Mean Set in Order of Descending Importance

30
Now for selection, we chose the top 5 features from above and applied the two models.

Accuracy achieved:
Random Forest = 94%
SVM =77%

Here it is visible that that chosen parameters suffer from a high degree of collinearity which seem
to have a profound negative impact on SVM but doesn’t affect Random Forest due to its random
tree building nature.

A similar test case is reproduced on the Features Worst set which gives the following order of
features

Fig. 23: Features Worst Set in Order of Descending Importance

Using the top 5 features from the above set also gives similar accuracy scores as the previous test
due to parameters of similar collinearity selected.

31
Accuracy Scores:
Random Forest =95%
SVM =70%

It can be concluded that for simplicity, random forest is a better algorithm for breast cancer
prediction even after reducing the set to 5 features.

Scatter Plot Analysis:

Taking the features mean set, the following scatter plot is generated

Fig. 24: Scatter Plot Analysis of Features Mean Set

32
From the scatter plot, it can be observed that which features can be used to separate two categories
which are Malignant (Red) and Benign (Blue.)

Here, it can be seen that:


Radius, area and perimeter have a strong linear relationship as is expected. Graph shows that
features like texture_mean, smoothness_mean, symmetry_mean and fractal_dimension_mean
cannott be used to classify two category because both category are mixed and there is no separable
plane.

A better representation of attributes to be selected can be provided by using the seaborn library
and visualizing the data in the form of a swarm plot. Below is the swarm plot analysis of the
features mean set:

Fig. 25: Swarm Plot Analysis of Features Mean Set

33
For developing the prediction models using feature selection, those attributes must be selected
which appear separable in the above plot. For example, here, look at radius_mean and
symmetry_mean. While the plot values of radius_mean appear largely separated, a lot of mixability
can be observed in the symmetry_mean column. Hence, it is clear that radius_mean is a more
suitable attribute for determining results.

Similarly, a swarm plot analysis of features SE set shows why it isn’t the most suitable feature set
for making predictions.

Fig. 26: Swarm Plot Analysis of Features SE Set

34
Cross Validation Testing with Feature Selection and Parameter Tuning for Certain
Algorithms:

Deriving from the feature importance model and confirming the results with the scatter correlation
plot and swarm plot analysis, the following prediction variables have been selected for the testing:

Cross Validation is a method which involves reserving a particular sample of a data set on which
you do not train the model. Later, you test the model on this sample before finalizing the model.
Here are the steps involved in cross validation:
1. You reserve a sample data set.
2. Train the model using the remaining part of the data set.
3. Use the reserve sample of the data set test (validation) set. This will help you to know
the effectiveness of model performance.

The 5 fold Cross Validation has been applied on different stock machine learning models and then
using Grid Search CV function which explores the model exhaustively by applying different
parameter values, the algorithms have been tuned to be more robust and have better efficiency with
the 5 selected features.

The following are the results:


Decision Tree Classifier:

Fig. 27: Decision Tree (Untuned) CV Result with 5 Prameters

35
Here, the 100% accuracy is not a good stat necessarily. It shows a case of over-fitting for this
model. Also, the cross validation scores are intermediate – hence there exists a necessity to tune
this algorithm.
The parameters for grid search are:

Result:

Fig. 28: Decision Tree Tuned Parameter Result

Support Vector Machine:

Fig. 29: SVM (Untuned) CV Result with 5 Prameters

SVM is showing really poor cross validation results and also requires tuning of parameters. The
parameter grid taken is:

36
Result:

Fig. 30: SVM Tuned Parameter Result

This is one of the best showcase of running a parameter search. The SVM works fine with good
parameters as compared to its previous implementations which displayed really unreliable results.
The accuracy jumped from almost 70% to 94%.

K Nearest Neighbors:

Fig. 31: KNN (Untuned) CV Result with 5 Prameters

37
Again, the cross validation scores for this model are not very good. The parameter grid for this
model is taken as:

Result:

Fig. 32: KNN Tuned Parameter Result


The result for parameter tuning for this algorithm doesn’t show significant improvements.

Random Forest Classifier:

Fig. 33: Random Forest (Untuned) CV Result with 5 Prameters

Both accuracy and cross validation scores for this model are good enough. Also, due to the random
nature of this algorithm, it doesn’t require any parameter tuning as results are already good.

38
Inference: Random Forest algorithm is the best machine learning model for predicting the
malignant or benign stages of breast cancer using our reduced feature set. After parameter tuning,
even other models have achieved quite reliable accuracy scores and can be applied in specific
cases.

6. Conclusion, Limitations and Scope for Future Work:


The results obtained from this project vibrantly showcases that prediction of breast cancer stages
is under the scope of the machine learning algorithms in data mining. These algorithms build a
knowledge prediction model using training dataset which is then used for the prediction of test
data. This project encompassed an extensive survey of 5 types of classification models. Among
them, Artificial Neural Networks is the most predominant algorithm achieving best accuracy
scores of 97% and average scores of 94%. With these accuracy scores, data mining techniques can
support doctors in the diagnosis decision-making process. Leveraging the power of modern
computers, a diagnosis prediction result can be obtained in a matter of few seconds with
considerable accuracy.

However, as is generally the case with many medical datasets, it takes significant time and cost to
measure the accurate breast cancer cell values to feed into the machine learning models. For this
purpose, a feature selection has been carried out in this project in which we have minimized the
number of attributes to just 5 which includes correlated attributes. The most robust algorithm for
this reduced feature set is Random Forest which is resulting in accuracy scores as high as 96%.
With this, tools can be built for physicians which can be used as an effective and cost efficient
mechanism for early detection and diagnosis of breast cancer that can result in enhancement of
survival rate of patients.

But the fact is, at this current stage, close supervision of qualified researchers and doctors would
be required for using these algorithms as the accuracy scores are still not 100% and results of
prediction are directly linked to lives of patients. Any automatic prediction in medical field must
be 100% reliable and should not require any intervention of a human. Also, currently only a limited
amount of open data is available to train the machine learning models. In the future, as more data
would be readily available for use – the accuracy must be improved to 100

39
7. References:
Journal:
1. Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001
workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
IBM.L. Eschenauer, V. D. Gligor and J. Bara, ‘On Trust Establishment in Mobile Ad Hoc
Networks’, Security Protocols Springer, (2004), pp. 47-66.
2. Singh, D. (2014). Analysis of Data Mining Classification with Decision tree Technique
(Doctoral dissertation, MPUAT, Udaipur)
3. Williams, Kehinde, Peter Adebayo Idowu, Jeremiah AdemolaBalogun, and
AdeniranIsholaOluwaranti. “Breast cancer risk prediction using data mining classification
techniques.” Transactions on Networks and Communications3(2), 01-11, 2015.
4. Jadhav, S. D., & Channe, H. P. (2016). Comparative Study of K-NN, Naive Bayes and
Decision Tree Classification Techniques. International Journal of Science and
Research, 5(1).
5. Chaurasia, Vikas, and Saurabh Pal. “A novel approach for breast cancer detection using
data mining techniques.” International Journal of Innovative Research in Computer and
Communication Engineering2(1), 2456-2465, 2014.
6. Sivakami, K. “Mining Big Data: Breast Cancer Prediction using DT-SVM Hybrid
Model.”International Journal of Scientific Engineering and Applied Science (IJSEAS)–
1(5), 418-429, 2015
7. Shajahaan, S. Syed, S. Shanthi, and V. M. Chitra. “Application of Data Mining techniques
to model breast cancer data.” International Journal of Emerging Technology and Advanced
Engineering3(11), 362-369, 2013
8. Thein, HtetThazin Tike, and Khin Mo Mo Tun. “An Approach for Breast Cancer Diagnosis
Classification Using Neural Network.” Advanced Computing6(1), 1-10, 2015.
9. Aalaei, S., Shahraki, H., Rowhanimanesh, A., & Eslami, S. (2016). Feature selection using
genetic algorithm for breast cancer diagnosis: experiment on three different datasets.
Iranian Journal of Basic Medical Sciences, 19(5), 476–482.
10. Venkatesan, E., and T. Velmurugan. “Performance analysis of decision tree algorithms for
breast cancer classification.” Indian Journal of Science and Technology8(29), 1-8, 2015.

40
Weblinks:

1. https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
2. https://www.healthcatalyst.com/predictive-analytics
3. https://pythonprogramming.net/machine-learning-tutorial-python-introduction/
4. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
5. https://www.analyticsvidhya.com/
6. https://www.scipy.org/docs.html
7. https://seaborn.pydata.org/
8. https://stackoverflow.com/
9. http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/
10. https://www.tutorialspoint.com/python/index.htm

Books:
1. Hearty, J. (2016). Advanced Machine Learning with Python. Packt Publishing Ltd.
2. E.Coiera. The Guide to Health Informatics. 2nd ed.London, U.K.: Arnold, October
2003.
3. Jiawei Han, Micheline Kamber (2011). Data Mining: Concepts and Techniques, 3rd
Edition. Morgan Kaufmann Publishers Inc.

41

You might also like