Sample Draft MCP-PJ

INTRODUCTION:
Breast cancer is the deadliest form of cancer and is very common in women. At present there is a 10%
chance of getting breast cancer in their lifetime (Kharya, 2012). This has become a cause of death in most
of the women, this can also be seen in well developed countries and we can decrease the death ratio by
detecting the cancer in initial stages (Kharya, 2012). Overall 7.9 million women were died all over the
world in the year of 2007 and some statistics stated that the figure will reach to 12 million in 2030 (Devi,
2011). Some clustering algorithms of data mining are used for diagnosis of cancer patients (Joshi, Doshi,
& Patel, 2014). Including detecting the cancer in initial stages, whatever the data mining tools which are
used in this generation has increased the hope in the patients (Devi, 2011). There are some tools like
Treatment Assistant to visualize the data to get a clear understanding of the problem (Cakir & Demirel,
2010).
Previous research predicting breast cancer using data mining tools and techniques has some of
the efficient data mining techniques like Logistic Regression, Support Vector Machines, Decision trees,
Neural Networks, Naïve Bayes. After doing proper analysis on these data mining techniques it is known Commented [PM1]: Citation needed here
that Support Vector Machines is one of the most efficient technique with 97.3% accuracy to predict and
diagnose breast cancer in terms of specificity, precision, accuracy and sensitivity (Hiba, Moatassime, &
Noel, 2016). There are many studies on predicting of breast cancer, but they are not integrated together.
So, my aim is to bring all those studies together and examine them in one study, so it could be helpful in
getting a clear understanding about all the performances of the data mining techniques that are used to
predict and diagnose the breast cancer which could be helpful for me to suggest one of the efficient data
mining technique with highest accuracy. Because of this breast cancer, many women are losing their lives, Commented [PM2]: Run-on sentence, break this into 2
sentences
the period of recognition of cancer from symptoms could be a very big problem which is suspected to
different types of assumptions and also accustomed to the volatile effects. So, these issues regarding the
cancer motivated the authors to do research on the life-threatening disease like breast cancer and they
did research and proposed certain methods like genetic search, greedy step wise, to get a clear overview
of the breast cancer dataset using the data mining techniques which could be helpful in early diagnosis of
breast cancer (Devi, 2011).
The purpose of this article is to bring all the studies together and organize existing separate
studies on predicting breast cancer. Decision tree can make best outcomes with 93.62% of accurate results
in both Benchmark dataset and SEER dataset (Kharya, 2012). Apart from this, Bayesian network is also
well-known technique to predict breast cancer with good percent of outcomes and errors will be very
minimal (Kharya, 2012). Research question is to what extent could the diagnosis and prediction of breast Commented [PM3]: The
cancer be effective by using the data mining tools and methods?
Breast Cancer and its symptoms:
Breast cancer is known as a cancer which is developed from breast tissues and the cells grow out of
control. These cells will form a tumor which is also known as malignant tumor which is grown from the
cells present in the breast (Guptha, Kumar, & Sharma, 2011). Breast cancer is mainly caused in ducts
which carry the milk to nipples.
The symptoms of breast cancer include fluids coming from nipples, yellow skin, swollen breast, warm,
dimpling of skin, shortness of breath, change in breast shape and many more. By this breast cancer many Commented [PM4]: Citation needed
women are losing their lives, overall 7.9 million women were died all over the world in the year of 2007
and some statistics stated that the figure will reach to 12 million in 2030 (Devi, 2011)
Most of the researchers have stated that it is very difficult to detect symptoms of cancer at the earlier
stage, for the initial treatment which is turning out to be difficult to draw conclusions. However, the
researchers knew some of the risk factors that is accelerating the likelihood of breast cancer growing in
women (Guptha, Kumar & Sharma, 2011; Kharya, 2012). For example, it is difficult to detect breast cancer
at a very early stage. So, by taking proper effort and use effective data mining techniques to predict the
cancer. Even though some of the risk factors can be avoidable, some are not.
Based on the symptoms observed, there are some of the efficient data mining techniques used, these
techniques could be helpful for proper analysis that could be useful in predicting and diagnosing the breast
cancer to a satisfactory level.
DATA MINING AND KNOWLEDGE DISCOVERY PROCESS.
Data mining is a process of discovering patterns from huge data sets and extract patterns from data (Joshi,
Doshi, & Patel, 2014), according to Shelly et al. (2011), have defined data mining as extraction of data Commented [PM5]: New sentence
patterns which is equivalent to knowledge discovery. KDD involves certain stages to extract the required
data from huge amount of data. The stages involved are Data cleaning (removes noise and invalid
data),Data integration (combining the data from multiple sources into a single repository), Data selection
(collects the relevant data for analysis), Data transformation (transforming the data into understandable
format), Data mining (extract the patterns), Pattern evaluation (evaluates the data patterns), Knowledge
representation (exhibits the data visually to the user).
Figure 1: Knowledge Discovery and Database model (Shrivastava, Sant, & Aharwal, 2013).
That data may be collected from various sources or by the methods which researchers follow. Abdulkadir
Cakir & Burcin Demirel (2010) explained how they collected data from numerous sources. They stated
that in the first stage, they take the appropriate data to database. Even though downloading the web log
data from various web servers there will be a need to gathering data from other sources. The applications
of data mining can use typical data to visualize and analyze and that data can be processed by data mining
techniques (Guptha, Kumar, & Sharma, 2011). Deeper analysis has to be done to understand the format
of data so that no quality issues can be encountered, and visualization has to be done to understand the
data more clearly.
1. Data mining process is defined in five stages they are, the first step is called as problem definition,
in this to implement correct tools we need to be about goals. Based on the goals tools can be
chosen for the model. The next stage is known as exploration, in this we can make data collection
and methods of storing them (Nakte & Himmatramaka, 2016). To analyze the data, the data must
be secure in order to work continuously on it.
Figure 2: Data Mining Process Representation (Nakte & Himmatramaka, 2016).

In third stage is data preparation, in this stage, data undergoes cleaning, transform so that the missing of
data is removed for further analysis (Nakte & Himmatramaka, 2016). In fourth stage is modeling the data,
in this stage to analyze the data some algorithms of data mining such as decision tree, neural network is
implemented (Nakte & Himmatramaka, 2016). In the final stage is evaluation and deployment, here it
takes the data as input from the modeling stage and analysis will be conducted in way to produce key
conclusions (Nakte & Himmatramaka, 2016).
CLASSIFICATION METHODS:
This paper explores some of the important data mining classification methods to diagnosis and predict
the breast cancer. The five classification methods are Decision tree, Neural Network, Naive Bayes, Support
Vector Machine, Logistic Regression. These techniques mentioned above are some of the efficient data
mining techniques.
Neural Network: This is one of the interesting and efficient data mining classification methods to predict
and diagnose the breast cancer. Neural networks are excellent in dealing with non-linear functions which
are more complex (Kumar, Ramachandra, & Nagamani, 2013). For example, when we are identifying the
relation between two points, that relation is said to be in a polynomial way. The working of Neural
network was explained by multiple researches as, the artificial neurons consisting of both input and
output characteristics used to form a function, which can be a computation of sum of all inputs makes an
output only if it overcomes the value of threshold (Kumar, Ramchandra & Nagamani, 2013; Guptha, Kumar
& Sharma, 2011; Kharya 2012).
Furthermore, the researchers stated that the output of one network will be the input to another neuron
within the network (Kharya, 2012; Kumar, Ramchandra & Nagamani, 2013). So, we can make different
results when we take those neurons as inputs. Kumar et al. (2013) stated about two layers of neural
network output and its accuracy, in contrast to this Senturk & Kara (2014) explained that multi layered
perceptron is also used, and this can be build by both hands and algorithms. This can be monitored, and
changes can also made during the process time.
By using a/the feed forward neural network model, back propagation learning algorithm the neural
network can get the breast cancer data from database. The average success ratio is 81.24%, correct
instance is 699 accuracies in finding the results is 86.5% (Kharya, 2012).
Logistic Regression: Predicting the breast cancer is a big challenging task. This model is to approach the
data statistically for modeling binary data or multi class dependent variables (Kumar, Ramachandra, &
Nagamani, 2013). Logistic regression is used to perform the research on data which is present in SEER
database. SEER is a breast cancer public database consisting of 433,272 records, 72 variables (Kharya,
2012). The main idea of this not only to reduce the errors but also to make it cost efficient. Instead of
predicting the estimated value while performing, it states to predict the odds of its own occurrence
(Kumar, Ramachandra, & Nagamani, 2013). The binary data will involve “0” and “1” to represent the true
or false. This model achieved a classification accuracy of 89.02% with a specificity of 87.8% and 90.01%
sensitivity.
Decision tree: This is one of the classification techniques, which is used in the prediction of breast cancer
by mapping out the treatment options for each and every patient. Decision tree consists of nodes and
leaves and is implemented in data mining because it can make different outputs when we follow different
paths. Apart from nodes and leaves, it also contains non-terminals that are said to be a test on the selected
data sets. The process involved in decision tree is classifying the data by starting from the root node and
passing all the way towards the leaf (terminal) node (Guptha, Kumar, & Sharma, 2011). By this process we
can discriminate the number of patients who are suffering from breast cancer and who are not.
From the study of Kumar, Ramachandra & Nagamani’s, (2013) explained the concept of decision
tree in an alike way that states decision trees is an important process involved in classifying the data to
obtain the accurate results. After reaching the end node a decision is made Choosing a specific branch will
depend on the results (Guptha, Kumar, & Sharma, 2011). With the help of decision tree, data can be
collected from the patients that could be helpful to find some group of people with the perceptivity of
facing huge problems and suffering from breast cancer (Kharya, 2012).
From the detailed analysis of patients with the help of decision tree, time to build a predictive
model is 0.06seconds through this we could correctly classify 665 instances and Incorrect instances are 34
, this could be helpful in proper prediction and diagnosis of the breast cancer which gives 95.13% of
accurate results (Hiba, Moatassime, & Noel, 2016).
Naïve Bayes: From this study, we could get to know and understand about another classification data
mining techniques called Naïve Bayes which is used to predict and diagnose cancer. This technique is
mainly based on Bayesian thermos, which is used to make a statistical predictive model (Kumar,
Ramachandra, & Nagamani, 2013).
This method mainly explains about the relationship between attributes and classes to derive some
conditional probabilities (Kumar, Ramachandra, & Nagamani, 2013). If an event occurs, then it is assumed
that there are chances for another event to take place which is defined as conditional probability. This
method is used to analyze and predict the survivability rate of many breast cancer patients with the help
of WEKA tool (Kharya, 2012). WEKA is one of the data mining tools used for analyzing the data of breast
cancer patients, visualizing the data of the patients to get a clear understanding which is analyzed and
making the best possible predictions.
Using all the data of the patients whoa re suffering from breast cancer we need to build an efficient model
to understand the performance of different classifiers. The time taken to build an efficient model is 0.05
seconds, which correctly classifies 671 instances and incorrect classifies 28 instances and the resulting in
95.99% of accuracy (Hiba, Moatassime, & Noel, 2016).
Support Vector Machine: This is one of the most efficient classification data mining techniques used to
predict and diagnose the cancer, it is a supervised learning models that helps in analyzing the data and
recognizing the data patterns which are used in classification methods. Some of the researchers analyzed
and found that the linear hyperplane between the points of different classes which are located in
multidimensional space (Kumar, Ramchandra & Nagamani, 2013; Guptha, Kumar & Sharma, 2011).
As it is used to predict and diagnose cancer, the main purpose of it needs to be understood as it is using
SVM is to minimize the errors in classification when finding the hyperplane between the two classes
(Senturk & Kara, 2014). The objective of SVM model is to identify patients suffering from breast cancer
for chemotherapy extends survival time (Kharya, 2012). This paper examined about the SVM which are
classified into three prognostic groups such as Good, Intermediate and Poor. Where the good states that Commented [PM6]: cite
patient should not receive any chemotherapy, Intermediate group should get chemotherapy based on
survival curve analysis and poor group must receive the chemotherapy (Kharya, 2012).
To understand more about the performance of classifiers we need to build a model. To build a SVM model
it takes 0.07 seconds and for the, SVM correctly classifies 678instances and incorrectly classifies 21
instances, the accuracy of producing the results is 97.13% (Hiba, Moatassime, & Noel, 2016).
It is concluded from all the above techniques is Support Vector Machines(SVM) ranks as one of the
efficient data mining technique with 97.13% accuracy to predict and diagnose breast cancer, with suitable
analysis and classification, whereas Neural Networks ranked very low in predicting and diagnosing the
cancer with 86.5% of accuracy. Commented [PM7]: cite

DATABASE: Database is a collection of data, schemas, tables, and different other elements. In this SEER is
database which is used to store breast cancer patients’ data. The data is collected from the various
hospitals and stored in a SEER database.
SEER: SEER is the breast cancer database consisting of 433,273 records and 72 different variables
(Kharya, 2012). This Database will collect the data of breast cancer patients and death ratio, life time risk,
prevalence of cancer, makes few estimates of the death that occur every year. New version of SEER has
been used three extra fields called VSR (vital status record), COD (cause of death), STR (survival time
record) (Kharya, 2012). By including these records into the database, it makes it easier to store and
retrieve the data when needed (Guptha, Kumar, & Sharma, 2011).
DATA MINING TOOLS:
They are two important tools say treatment assistant and Weka with two different purposes. They are
treatment assistants for analyzing the data mining algorithms and suggesting the best efficient technique,
whereas WEKA on other side is used not only to analyze but also visually present the data for a clear
understanding of the data which could be helpful in predicting and diagnosing the breast cancer.
Treatment Assistant: It is a software program which works using Java net beans interface. This
software uses data mining algorithms such as decision tree, multilayer perception and many more (Cakir
& Demirel, 2010). It Analyze the algorithms and suggests the most suitable methods for better results for
every attribute. This tool will also suggest treatments to the patients according the results (Cakir &
Demirel, 2010).
WEKA: WEKA (Waikato Environment of Knowledge Analysis) developed in Waikato University in
New Zealand (Shrivastava, Sant, & Aharwal, 2013). This is a data mining tool available as open source, it
is written in Java. It is used as visualizing, analysis, and a prediction tool. While using this tool the caner
dataset will be passed under specific clusters stages such as, clustering algorithm, hierarchal clustering
and finally K-mean clusters (Joshi, Doshi, & Patel, 2014). The WEKA toolkit can calculate different metrics
when completion of k-fold cross validation (Bellaachia & Guren , 2006). WEKA allows every user to make
comparisons of different machine learning algorithms and methods for new datasets (Devi, 2011). WEKA
can reduce the difficulty of using huge amount of raw data and can run on any platforms.
DISCUSSION AND IMPLIMENTATION:
In this paper, I observed that Decision Tree, support vector machines, neural networks are some of the
efficient data mining techniques suggested by different researchers in numerous studies and are used to
predict the breast cancer. A valuable tool is also used named WEKA which is used for the most important
part of the development cycle for the analysis and visualization phases that could be very helpful in
predicting effectively the breast cancer. Another tool named Treatment Assistant also produced
satisfactory results in analysis and visualization of the important data. This article is clearer and more
understandable than the previous one as it states about the performances of all the individual efficient
data mining techniques. But there are certain limitations with a few of the data mining techniques like the
neural networks, so future research must be conducted to overcome all the limitations for these kinds of
techniques as stated like the neural networks which need to analyze further so as to yield its accuracy
level to more than 95% in order to be an efficient data mining technique.
References
Bellaachia, A., & Guren , E. (2006). Predicting Breast Cancer Using Data Mining Techniques. The Age
(Melbourne, vic), 1-4.
Cakir, A., & Demirel, B. (2010). A Software Tool for Determination od Breast CancerTreatment Methods
Using Data Mining Approach. Springer Science + Business Media, 1503-1511.
Devi.S, G. (2011). Breast Cancer Prediction System Using Feature Selection and Data Mining Methods.
Inernational Journal of Advanced Research in Computer Science., 2(1), 81-87.
Guptha, S., Kumar, D., & Sharma, A. (2011). Data Mining Classification Techniques Applied for Breast
Cancer Diagnosis and Prognosis. Indian Journal of Computer Science and Engineering., 2(2), 188-
195.
Hiba Asri, Moatassime, H. A., & Noel, T. (2016). Using Machine Learning Algorithm for Breast Cancer Risk
Prediction and Diagnosis. The 6th International Symposium on Frontiers in Ambient and Mobile
Systems., 83(2016), 1064-1069.
Joshi, J., Doshi, R., & Patel, J. (2014). Diagnosis of Breast Cancer Uisng Clustering Data Mining Approach.
International Journal of Computer Applications, 101(10), 13-17.
Kharya, S. (2012). Uisng Data Mining Techniques for Diagnosis and Prognosis of Breast Cancer Disease.
International Journal of Computer Science, Engineering and Information Technology, 2(2), 55-66.
Kumar, G. R., Ramachandra , D., & Nagamani, K. (2013). An Effective Prediction of Breast Cancer Data
Mining Techniques. International Journal of Innovation in Engineering and Technology, 2(4), 139-
144.
Nakte, J., & Himmatramaka, V. (2016). Breast Cancer Prediction Using Data Mining Techniques.
International Journal on Recent and Innovation Trends in Computing and Communication., 4(11),
55-60.
Senturk, Z. K., & Kara, R. (2014). Breast Cancer Diagnosis via data Mining: Performance Analysis of Seven
Different Algorithms. Computer Science & Engineering: An International Journal., 4(1), 35-46.
Shrivastava, S. S., Sant, A., & Aharwal, R. P. (2013). An Overview on Data Mining Approach on Breast
Cancer Data. International Journal of Advanced Computer Research, 3(1), 256-262.

Sample Draft MCP-PJ

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Draft MCP-PJ

Uploaded by

Copyright:

Available Formats

INTRODUCTION:

breast cancer (Devi, 2011).

cancer be effective by using the data mining tools and methods?

Breast Cancer and its symptoms:

which carry the milk to nipples.

cancer to a satisfactory level.

DATA MINING AND KNOWLEDGE DISCOVERY PROCESS.

representation (exhibits the data visually to the user).

data more clearly.

be secure in order to work continuously on it.

Figure 2: Data Mining Process Representation (Nakte & Himmatramaka, 2016).

conclusions (Nakte & Himmatramaka, 2016).

& Sharma, 2011; Kharya 2012).

changes can also made during the process time.

instance is 699 accuracies in finding the results is 86.5% (Kharya, 2012).

accurate results (Hiba, Moatassime, & Noel, 2016).

Ramachandra, & Nagamani, 2013).

making the best possible predictions.

95.99% of accuracy (Hiba, Moatassime, & Noel, 2016).

cancer with 86.5% of accuracy. Commented [PM7]: cite

hospitals and stored in a SEER database.

DATA MINING TOOLS:

WEKA: WEKA (Waikato Environment of Knowledge Analysis) developed in Waikato University in

DISCUSSION AND IMPLIMENTATION:

level to more than 95% in order to be an efficient data mining technique.

You might also like