You are on page 1of 12

INTRODUCTION:

Breast cancer is the deadliest form of cancer and is very common in women. At present there is a 10%

chance of getting breast cancer in their lifetime (Kharya, 2012). This has become a cause of death in most

of the women, this can also be seen in well developed countries and we can decrease the death ratio by

detecting the cancer in initial stages (Kharya, 2012). Overall 7.9 million women were died all over the

world in the year of 2007 and some statistics stated that the figure will reach to 12 million in 2030 (Devi,

2011). Some clustering algorithms of data mining are used for diagnosis of cancer patients (Joshi, Doshi,

& Patel, 2014). Including detecting the cancer in initial stages, whatever the data mining tools which are

used in this generation has increased the hope in the patients (Devi, 2011). There are some tools like

Treatment Assistant to visualize the data to get a clear understanding of the problem (Cakir & Demirel,

2010).

Previous research predicting breast cancer using data mining tools and techniques has some of

the efficient data mining techniques like Logistic Regression, Support Vector Machines, Decision trees,

Neural Networks, Naïve Bayes. After doing proper analysis on these data mining techniques it is known Commented [PM1]: Citation needed here

that Support Vector Machines is one of the most efficient technique with 97.3% accuracy to predict and

diagnose breast cancer in terms of specificity, precision, accuracy and sensitivity (Hiba, Moatassime, &

Noel, 2016). There are many studies on predicting of breast cancer, but they are not integrated together.

So, my aim is to bring all those studies together and examine them in one study, so it could be helpful in

getting a clear understanding about all the performances of the data mining techniques that are used to

predict and diagnose the breast cancer which could be helpful for me to suggest one of the efficient data

mining technique with highest accuracy. Because of this breast cancer, many women are losing their lives, Commented [PM2]: Run-on sentence, break this into 2
sentences
the period of recognition of cancer from symptoms could be a very big problem which is suspected to

different types of assumptions and also accustomed to the volatile effects. So, these issues regarding the

cancer motivated the authors to do research on the life-threatening disease like breast cancer and they
did research and proposed certain methods like genetic search, greedy step wise, to get a clear overview

of the breast cancer dataset using the data mining techniques which could be helpful in early diagnosis of

breast cancer (Devi, 2011).

The purpose of this article is to bring all the studies together and organize existing separate

studies on predicting breast cancer. Decision tree can make best outcomes with 93.62% of accurate results

in both Benchmark dataset and SEER dataset (Kharya, 2012). Apart from this, Bayesian network is also

well-known technique to predict breast cancer with good percent of outcomes and errors will be very

minimal (Kharya, 2012). Research question is to what extent could the diagnosis and prediction of breast Commented [PM3]: The

cancer be effective by using the data mining tools and methods?

Breast Cancer and its symptoms:

Breast cancer is known as a cancer which is developed from breast tissues and the cells grow out of

control. These cells will form a tumor which is also known as malignant tumor which is grown from the

cells present in the breast (Guptha, Kumar, & Sharma, 2011). Breast cancer is mainly caused in ducts

which carry the milk to nipples.

The symptoms of breast cancer include fluids coming from nipples, yellow skin, swollen breast, warm,

dimpling of skin, shortness of breath, change in breast shape and many more. By this breast cancer many Commented [PM4]: Citation needed

women are losing their lives, overall 7.9 million women were died all over the world in the year of 2007

and some statistics stated that the figure will reach to 12 million in 2030 (Devi, 2011)

Most of the researchers have stated that it is very difficult to detect symptoms of cancer at the earlier

stage, for the initial treatment which is turning out to be difficult to draw conclusions. However, the

researchers knew some of the risk factors that is accelerating the likelihood of breast cancer growing in
women (Guptha, Kumar & Sharma, 2011; Kharya, 2012). For example, it is difficult to detect breast cancer

at a very early stage. So, by taking proper effort and use effective data mining techniques to predict the

cancer. Even though some of the risk factors can be avoidable, some are not.

Based on the symptoms observed, there are some of the efficient data mining techniques used, these

techniques could be helpful for proper analysis that could be useful in predicting and diagnosing the breast

cancer to a satisfactory level.

DATA MINING AND KNOWLEDGE DISCOVERY PROCESS.

Data mining is a process of discovering patterns from huge data sets and extract patterns from data (Joshi,

Doshi, & Patel, 2014), according to Shelly et al. (2011), have defined data mining as extraction of data Commented [PM5]: New sentence

patterns which is equivalent to knowledge discovery. KDD involves certain stages to extract the required

data from huge amount of data. The stages involved are Data cleaning (removes noise and invalid

data),Data integration (combining the data from multiple sources into a single repository), Data selection

(collects the relevant data for analysis), Data transformation (transforming the data into understandable

format), Data mining (extract the patterns), Pattern evaluation (evaluates the data patterns), Knowledge

representation (exhibits the data visually to the user).

Figure 1: Knowledge Discovery and Database model (Shrivastava, Sant, & Aharwal, 2013).
That data may be collected from various sources or by the methods which researchers follow. Abdulkadir

Cakir & Burcin Demirel (2010) explained how they collected data from numerous sources. They stated

that in the first stage, they take the appropriate data to database. Even though downloading the web log

data from various web servers there will be a need to gathering data from other sources. The applications

of data mining can use typical data to visualize and analyze and that data can be processed by data mining

techniques (Guptha, Kumar, & Sharma, 2011). Deeper analysis has to be done to understand the format

of data so that no quality issues can be encountered, and visualization has to be done to understand the

data more clearly.

1. Data mining process is defined in five stages they are, the first step is called as problem definition,

in this to implement correct tools we need to be about goals. Based on the goals tools can be

chosen for the model. The next stage is known as exploration, in this we can make data collection

and methods of storing them (Nakte & Himmatramaka, 2016). To analyze the data, the data must

be secure in order to work continuously on it.

Figure 2: Data Mining Process Representation (Nakte & Himmatramaka, 2016).


In third stage is data preparation, in this stage, data undergoes cleaning, transform so that the missing of

data is removed for further analysis (Nakte & Himmatramaka, 2016). In fourth stage is modeling the data,

in this stage to analyze the data some algorithms of data mining such as decision tree, neural network is

implemented (Nakte & Himmatramaka, 2016). In the final stage is evaluation and deployment, here it

takes the data as input from the modeling stage and analysis will be conducted in way to produce key

conclusions (Nakte & Himmatramaka, 2016).

CLASSIFICATION METHODS:

This paper explores some of the important data mining classification methods to diagnosis and predict

the breast cancer. The five classification methods are Decision tree, Neural Network, Naive Bayes, Support

Vector Machine, Logistic Regression. These techniques mentioned above are some of the efficient data

mining techniques.

Neural Network: This is one of the interesting and efficient data mining classification methods to predict

and diagnose the breast cancer. Neural networks are excellent in dealing with non-linear functions which

are more complex (Kumar, Ramachandra, & Nagamani, 2013). For example, when we are identifying the

relation between two points, that relation is said to be in a polynomial way. The working of Neural

network was explained by multiple researches as, the artificial neurons consisting of both input and

output characteristics used to form a function, which can be a computation of sum of all inputs makes an

output only if it overcomes the value of threshold (Kumar, Ramchandra & Nagamani, 2013; Guptha, Kumar

& Sharma, 2011; Kharya 2012).

Furthermore, the researchers stated that the output of one network will be the input to another neuron

within the network (Kharya, 2012; Kumar, Ramchandra & Nagamani, 2013). So, we can make different

results when we take those neurons as inputs. Kumar et al. (2013) stated about two layers of neural
network output and its accuracy, in contrast to this Senturk & Kara (2014) explained that multi layered

perceptron is also used, and this can be build by both hands and algorithms. This can be monitored, and

changes can also made during the process time.

By using a/the feed forward neural network model, back propagation learning algorithm the neural

network can get the breast cancer data from database. The average success ratio is 81.24%, correct

instance is 699 accuracies in finding the results is 86.5% (Kharya, 2012).

Logistic Regression: Predicting the breast cancer is a big challenging task. This model is to approach the

data statistically for modeling binary data or multi class dependent variables (Kumar, Ramachandra, &

Nagamani, 2013). Logistic regression is used to perform the research on data which is present in SEER

database. SEER is a breast cancer public database consisting of 433,272 records, 72 variables (Kharya,

2012). The main idea of this not only to reduce the errors but also to make it cost efficient. Instead of

predicting the estimated value while performing, it states to predict the odds of its own occurrence

(Kumar, Ramachandra, & Nagamani, 2013). The binary data will involve “0” and “1” to represent the true

or false. This model achieved a classification accuracy of 89.02% with a specificity of 87.8% and 90.01%

sensitivity.

Decision tree: This is one of the classification techniques, which is used in the prediction of breast cancer

by mapping out the treatment options for each and every patient. Decision tree consists of nodes and

leaves and is implemented in data mining because it can make different outputs when we follow different

paths. Apart from nodes and leaves, it also contains non-terminals that are said to be a test on the selected

data sets. The process involved in decision tree is classifying the data by starting from the root node and

passing all the way towards the leaf (terminal) node (Guptha, Kumar, & Sharma, 2011). By this process we

can discriminate the number of patients who are suffering from breast cancer and who are not.
From the study of Kumar, Ramachandra & Nagamani’s, (2013) explained the concept of decision

tree in an alike way that states decision trees is an important process involved in classifying the data to

obtain the accurate results. After reaching the end node a decision is made Choosing a specific branch will

depend on the results (Guptha, Kumar, & Sharma, 2011). With the help of decision tree, data can be

collected from the patients that could be helpful to find some group of people with the perceptivity of

facing huge problems and suffering from breast cancer (Kharya, 2012).

From the detailed analysis of patients with the help of decision tree, time to build a predictive

model is 0.06seconds through this we could correctly classify 665 instances and Incorrect instances are 34

, this could be helpful in proper prediction and diagnosis of the breast cancer which gives 95.13% of

accurate results (Hiba, Moatassime, & Noel, 2016).

Naïve Bayes: From this study, we could get to know and understand about another classification data

mining techniques called Naïve Bayes which is used to predict and diagnose cancer. This technique is

mainly based on Bayesian thermos, which is used to make a statistical predictive model (Kumar,

Ramachandra, & Nagamani, 2013).

This method mainly explains about the relationship between attributes and classes to derive some

conditional probabilities (Kumar, Ramachandra, & Nagamani, 2013). If an event occurs, then it is assumed

that there are chances for another event to take place which is defined as conditional probability. This

method is used to analyze and predict the survivability rate of many breast cancer patients with the help

of WEKA tool (Kharya, 2012). WEKA is one of the data mining tools used for analyzing the data of breast

cancer patients, visualizing the data of the patients to get a clear understanding which is analyzed and

making the best possible predictions.

Using all the data of the patients whoa re suffering from breast cancer we need to build an efficient model

to understand the performance of different classifiers. The time taken to build an efficient model is 0.05
seconds, which correctly classifies 671 instances and incorrect classifies 28 instances and the resulting in

95.99% of accuracy (Hiba, Moatassime, & Noel, 2016).

Support Vector Machine: This is one of the most efficient classification data mining techniques used to

predict and diagnose the cancer, it is a supervised learning models that helps in analyzing the data and

recognizing the data patterns which are used in classification methods. Some of the researchers analyzed

and found that the linear hyperplane between the points of different classes which are located in

multidimensional space (Kumar, Ramchandra & Nagamani, 2013; Guptha, Kumar & Sharma, 2011).

As it is used to predict and diagnose cancer, the main purpose of it needs to be understood as it is using

SVM is to minimize the errors in classification when finding the hyperplane between the two classes

(Senturk & Kara, 2014). The objective of SVM model is to identify patients suffering from breast cancer

for chemotherapy extends survival time (Kharya, 2012). This paper examined about the SVM which are

classified into three prognostic groups such as Good, Intermediate and Poor. Where the good states that Commented [PM6]: cite

patient should not receive any chemotherapy, Intermediate group should get chemotherapy based on

survival curve analysis and poor group must receive the chemotherapy (Kharya, 2012).

To understand more about the performance of classifiers we need to build a model. To build a SVM model

it takes 0.07 seconds and for the, SVM correctly classifies 678instances and incorrectly classifies 21

instances, the accuracy of producing the results is 97.13% (Hiba, Moatassime, & Noel, 2016).

It is concluded from all the above techniques is Support Vector Machines(SVM) ranks as one of the

efficient data mining technique with 97.13% accuracy to predict and diagnose breast cancer, with suitable

analysis and classification, whereas Neural Networks ranked very low in predicting and diagnosing the

cancer with 86.5% of accuracy. Commented [PM7]: cite


DATABASE: Database is a collection of data, schemas, tables, and different other elements. In this SEER is

database which is used to store breast cancer patients’ data. The data is collected from the various

hospitals and stored in a SEER database.

SEER: SEER is the breast cancer database consisting of 433,273 records and 72 different variables

(Kharya, 2012). This Database will collect the data of breast cancer patients and death ratio, life time risk,

prevalence of cancer, makes few estimates of the death that occur every year. New version of SEER has

been used three extra fields called VSR (vital status record), COD (cause of death), STR (survival time

record) (Kharya, 2012). By including these records into the database, it makes it easier to store and

retrieve the data when needed (Guptha, Kumar, & Sharma, 2011).

DATA MINING TOOLS:

They are two important tools say treatment assistant and Weka with two different purposes. They are

treatment assistants for analyzing the data mining algorithms and suggesting the best efficient technique,

whereas WEKA on other side is used not only to analyze but also visually present the data for a clear

understanding of the data which could be helpful in predicting and diagnosing the breast cancer.

Treatment Assistant: It is a software program which works using Java net beans interface. This

software uses data mining algorithms such as decision tree, multilayer perception and many more (Cakir

& Demirel, 2010). It Analyze the algorithms and suggests the most suitable methods for better results for

every attribute. This tool will also suggest treatments to the patients according the results (Cakir &

Demirel, 2010).

WEKA: WEKA (Waikato Environment of Knowledge Analysis) developed in Waikato University in

New Zealand (Shrivastava, Sant, & Aharwal, 2013). This is a data mining tool available as open source, it
is written in Java. It is used as visualizing, analysis, and a prediction tool. While using this tool the caner

dataset will be passed under specific clusters stages such as, clustering algorithm, hierarchal clustering

and finally K-mean clusters (Joshi, Doshi, & Patel, 2014). The WEKA toolkit can calculate different metrics

when completion of k-fold cross validation (Bellaachia & Guren , 2006). WEKA allows every user to make

comparisons of different machine learning algorithms and methods for new datasets (Devi, 2011). WEKA

can reduce the difficulty of using huge amount of raw data and can run on any platforms.

DISCUSSION AND IMPLIMENTATION:

In this paper, I observed that Decision Tree, support vector machines, neural networks are some of the

efficient data mining techniques suggested by different researchers in numerous studies and are used to

predict the breast cancer. A valuable tool is also used named WEKA which is used for the most important

part of the development cycle for the analysis and visualization phases that could be very helpful in

predicting effectively the breast cancer. Another tool named Treatment Assistant also produced

satisfactory results in analysis and visualization of the important data. This article is clearer and more

understandable than the previous one as it states about the performances of all the individual efficient

data mining techniques. But there are certain limitations with a few of the data mining techniques like the

neural networks, so future research must be conducted to overcome all the limitations for these kinds of

techniques as stated like the neural networks which need to analyze further so as to yield its accuracy

level to more than 95% in order to be an efficient data mining technique.

References

Bellaachia, A., & Guren , E. (2006). Predicting Breast Cancer Using Data Mining Techniques. The Age
(Melbourne, vic), 1-4.
Cakir, A., & Demirel, B. (2010). A Software Tool for Determination od Breast CancerTreatment Methods
Using Data Mining Approach. Springer Science + Business Media, 1503-1511.

Devi.S, G. (2011). Breast Cancer Prediction System Using Feature Selection and Data Mining Methods.
Inernational Journal of Advanced Research in Computer Science., 2(1), 81-87.

Guptha, S., Kumar, D., & Sharma, A. (2011). Data Mining Classification Techniques Applied for Breast
Cancer Diagnosis and Prognosis. Indian Journal of Computer Science and Engineering., 2(2), 188-
195.

Hiba Asri, Moatassime, H. A., & Noel, T. (2016). Using Machine Learning Algorithm for Breast Cancer Risk
Prediction and Diagnosis. The 6th International Symposium on Frontiers in Ambient and Mobile
Systems., 83(2016), 1064-1069.

Joshi, J., Doshi, R., & Patel, J. (2014). Diagnosis of Breast Cancer Uisng Clustering Data Mining Approach.
International Journal of Computer Applications, 101(10), 13-17.

Kharya, S. (2012). Uisng Data Mining Techniques for Diagnosis and Prognosis of Breast Cancer Disease.
International Journal of Computer Science, Engineering and Information Technology, 2(2), 55-66.

Kumar, G. R., Ramachandra , D., & Nagamani, K. (2013). An Effective Prediction of Breast Cancer Data
Mining Techniques. International Journal of Innovation in Engineering and Technology, 2(4), 139-
144.

Nakte, J., & Himmatramaka, V. (2016). Breast Cancer Prediction Using Data Mining Techniques.
International Journal on Recent and Innovation Trends in Computing and Communication., 4(11),
55-60.

Senturk, Z. K., & Kara, R. (2014). Breast Cancer Diagnosis via data Mining: Performance Analysis of Seven
Different Algorithms. Computer Science & Engineering: An International Journal., 4(1), 35-46.

Shrivastava, S. S., Sant, A., & Aharwal, R. P. (2013). An Overview on Data Mining Approach on Breast
Cancer Data. International Journal of Advanced Computer Research, 3(1), 256-262.

You might also like