GROUP PROJECT LECTURER: DR. IZWAN NIZAL MOHD SHAHARANEE PROJECT: BREAST CANCER DATASET DUE DATE: 28 MAY 2014 PREPARED BY: 211245 LEE PEI PEI 211330 SOH GUAN CHEN 211650 CHONG SIOW HUI 212072 LIM KOK SIANG
Contents CHAPTER 1: INTRODUCTION .............................................................................. 1 1.1 Background of the Problem ............................................................................. 1 1.2 Motivation for the Reported Work ................................................................... 1 1.3 Define the Problem .......................................................................................... 2 1.4 Aims and Objectives ........................................................................................ 2 1.5 Significant of the Work .................................................................................... 2 CHAPTER 2: LITERATURE REVIEW .................................................................... 3 CHAPTER 3: METHODOLOGY ............................................................................. 4 3.1 Knowledge Discovery in Database (KDD) ....................................................... 4 3.1.1 Selection ................................................................................................... 4 3.1.2 Pre-processing .......................................................................................... 4 3.1.3 Transformation ......................................................................................... 5 3.1.4 Data mining .............................................................................................. 6 3.1.5 Interpretation and Evaluation .................................................................... 6 3.2 Data Description .............................................................................................. 6 3.3 Process of Developing and Comparing the Models .......................................... 8 3.3.1 Data Mining Methodology ........................................................................ 8 3.3.2 Models Development ................................................................................ 9 CHAPTER 4: KNOWLEDGE DISCOVERY PROCESS IN SAS ENTERPRISE MINER ............................................................................................................... 11 4.1 Data Selection................................................................................................ 11 4.2 Pre-processing ............................................................................................... 13 4.3 Transformation .............................................................................................. 14 4.4 Data Mining................................................................................................... 15 4.4.1 Logistics Regression ............................................................................... 15 4.4.2 Neural Network ...................................................................................... 17 4.4.3 Decision Tree .......................................................................................... 19
4.5 Interpretation and Evaluation ......................................................................... 22 CHAPTER 5: RESULT AND DISCUSSION .......................................................... 26 CHAPTER 6: CONCLUSION ................................................................................ 27 CHAPTER 7: REFERENCE .................................................................................. 28 CHAPTER 8: APPENDICES ................................................................................. 29
1
CHAPTER 1: INTRODUCTION 1.1 Background of the Problem According to Wikipedia (2014), breast cancer is a type of cancer originating from breast tissue, most commonly from the inner lining of milk ducts or the lobules that supply the ducts with milk. Normally, breast cancer occurs in humans and other mammals. Majority of the human cases are happened on women and some cases occur in men only. There will be some sign and symptoms which are noticeable. The very first noticeable symptom of breast cancer typically a lump that feels different from the rest of the breast tissue. Most of the women which are around 80% only realize they are being discovered a breast cancer after feel a lump on their breast. However, breast cancer can be classified into benign and malignant. We will never know the cancer is considered benign and malignant before any diagnosis treatment going on. There are some criteria need to be observed which are uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli. Through the observation on these criteria, doctors or scientists will be able to make particular decision according the diagnosis test of the patients. Unhealthy lifestyle will be obtained higher risk of getting breast cancer instead of others. Smoking, consume oily food and alcohol drinks, lack of exercise and always work under stressful situation are the unhealthy daily routine. Moreover, genetics play a minor role in most cases. This mean getting a breast cancer not because of genetics but unhealthy lifestyle contributed the most. 1.2 Motivation for the Reported Work With the building and understanding of the three models and the selection on the best model, we are able to have a better understanding about the differences among the three models. Through this project, we are able to differentiate the models and apply the respective model to a different scenario. Also, with the aids of the software such as 2
SAS Enterprise Miner, we are able to develop a more organized and systematic model that can be understood by everyone. The real breast cancer dataset will enable us to look through the classes of breast cancer and classify their based on their characteristics. 1.3 Define the Problem In our group project, we are given the breast cancer dataset. The dataset consists of 11 variables and 699 observations. The 11 variables are Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses and Class. Also, the dataset is found out to have missing values in it. So, we are required to do something to replace the missing values. With the given dataset, we are required to develop three models and decide which model is among the best to predict or classify the target variable. The more accurate is the model, the better is it. Every model will have its own advantages and disadvantages as well as its strengths and weaknesses. Each model has its own characteristics in dealing with different situations. So, we will have to classify the classes of the breast cancer and come out with a good and accurate model. 1.4 Aims and Objectives Our aims are to develop three models for the breast cancer dataset and decide which of the three models the best among the rest is. One of the objectives is to determine the three most suitable models that are suitable and relevant to be used in the breast cancer dataset. Moreover, we need to classify the best model for the dataset. 1.5 Significant of the Work The project is significant for the researchers, scientist or doctors or ant others related areas in life to use so that they can determine the classes for breast cancer. The 3
best developed model can help them to quickly detect the category of the breast cancer in the patients. In addition, a positive impact will give to the society as well as the patients as the breast cancer can be detected more quickly. CHAPTER 2: LITERATURE REVIEW There are two relevant research studied by the researchers before. Both researchers are using the breast cancer data set for their research. William and Olvi (1990) apply multisurface pattern separation in their research. It is a mathematical method to differentiate the elements of two pattern set. Each element of the pattern sets is comprised of various scalar observations. In their research, they use the diagnosis of breast cytology to demonstrate the applicability of this method to medical diagnosis and decision making. Only 369 sample size that used for training data in their study. According to William and Olive, only 1 trail for collected classification results. The result showed that two pairs of parallel were found to be consistent with 50% of the data and 6.5% of the samples were misclassified, the accuracy on remaining 50% of dataset: 93.5%. Three pairs of parallel were found to be consistent with 67% of data and 4.1% of the samples were misclassified, the accuracy on remaining 33% of dataset: 95.9%. William and Olvi also show that the multisurface method of pattern separation is more powerful than other methods for breast cytology diagnosis because it utilizes all of the available diagnostic information. According to Zhang (1990), only 369 instances of data set used in his research. He applied 4 instance-based learning algorithms in his study. the collected classification results averaged over 10 trials data set. The result show that the best accuracy result included one nearest neighbour with 93.7% of the data set and the training data is 200 instances (54%) and the tested data is 169 instances (46%). He is also interested in using only typical instances which is total 92.2% of the dataset with storing only 23.1 instances. The training data set that used in this study were 200 instances (54%) and 169 for tested data set (46%).
4
CHAPTER 3: METHODOLOGY 3.1 Knowledge Discovery in Database (KDD) 3.1.1 Selection Data selection is to acquire the most appropriate size that useful to the KDD process. We can use the sampling method for which the data is too big. There are two types of sampling method which are probability sampling method and non-probability sampling. Three approaches to determine size of sample are as below. Central limit theorem is used for the size of sample, n must be greater than 30 ( n > 30 ). If n is size of population, then standard error is 0.
Second approach is based on the confidence interval and accepted error.
Third is subjected to data availability. The dataset for our project is Wisconsin Breast Cancer Database (January 8, 1991) which is adapted from UCI Machine Learning website. (UCI Machine Learning Repository, 1992) 3.1.2 Pre-processing Pre-processing process is to ensure the data is clean. Certain data mining algorithm requires pre-processing for better performance. For example Neural Networks incapable to perform well using string data types. However, real world data usually contains missing value, noisy data and inconsistent value. Thus, here are the methods as shown below: Data cleaning is the method to handle incomplete, noisy and inconsistent data. Incomplete or missing data is due to improper data collection method. To solve this missing data, we can using mean-value, estimate the probable value using regression, using constant value such as null or ignore the missing record. For the case of noisy data, noisy data is random error or variance in data. This is due to corrupted data 5
transmission, technological limitation. During transmission data into certain software such as SPSS or SAS, we may key in wrong data in it, thus, this will cause noisy data happened. To solve this problem, we can use binning method or outlier removal method. Inconsistent data means the data contains replication or possibly redundancy data. Method to overcome this problem is removing redundant or replicate data. Data integration is data comes from different sources with different naming standard. This will cause in inconsistencies and redundancies. There are several ways to handle this problem which is Consolidate different source into one repository (using metadata), Correlation analysis (measure the strength of relationship between different attribute). Data reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form. This is to increase efficiency, can reduce the huge data set into a smaller representation. Several techniques can be used in data reduction such as data cube aggregation, dimension reduction, data compression and discretization. 3.1.3 Transformation In the transformation process, which also known as data normalization, is basically re-scale the data into a suitable range. This process is important because it can increase the processing speed and reduce the memory allocation. There are several methods in transformation: Z-Score Normalization is useful when the extreme value is unknown or outlier dominates the extreme values. Typically the scale will be [0 to 1]
Min-Max Normalization is a linear transformation of the original input to newly specified range.
6
Decimal Scaling is to divide the value by 10 power n, where n is the number of digits of the maximum absolute value.
3.1.4 Data mining Data Mining is the use of algorithms to extract the information and patterns by the KDD process. This step applies algorithms to the transformed data to generate the desired results. The hearts of KDD process (where unknown pattern will be revealed). Example of algorithms: Regression (classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), K- Means & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification). 3.1.5 Interpretation and Evaluation In interpretation and evaluation process, certain data mining output is non- human understandable format and we need interpretation for better understanding. So, we convert output into an easy understand medium (using graphs, mathematical model, tables and etc.). Visualization methods: graphical (charts, graphs), geometric (box-plot), icon-based (figures, icon), pixel-based (colored pixel), hierarchical (tree), hybrid (combination of any). 3.2 Data Description The dataset for our project is Wisconsin Breast Cancer Database (January 8, 1991) which is adapted from UCI Machine Learning website. (UCI Machine Learning Repository, 1992) The sources of our dataset are as bellow: 1) Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals Madison, Wisconsin USA
7
2) Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu) Received by David W. Aha (aha@cs.jhu.edu) Date: 15 July 1992 There are a total of 699 instances in the database and also 11 attributes including the class attribute. The attribute names with their domain as shown in below. Attribute Domain 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class 2 for benign, 4 for malignant
The 11 variables are Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses and Class. The attribute Class will be used as target attribute. There are 16 missing values in the attribute Bare Nuclei. Replacement for the values will be done by using the Replacement node. Before we do analysing using this dataset, we replace the values in attribute Class into their respective classes. In such, the value of 2 in attribute class will be replaced to Benign while value of 4 will be replaced as Malignant. Refer to Appendix 1 for the replaced dataset. 8
The dataset is in xls format in MS Excel and will be exported into SAS Enterprise Miner for model development and comparison. 3.3 Process of Developing and Comparing the Models 3.3.1 Data Mining Methodology There are two types of data mining methodology, which are hypothesis testing and knowledge discovery. Hypothesis testing is the top down approach that attempts to substantiate or disprove preconceived idea. On the other hand, knowledge discovery is a bottom-up approach which is started with data and tried to find something that is unknown. In our project, we will use the directed knowledge discovery method where the sources of pre classified data are identified. The five steps of knowledge discovery process as shown in Table 3.1 will be developed and used by us. Table 3.1: Knowledge Discovery Process
The data mining task consists of predictive and descriptive modelling. Predictive modelling is making prediction about values of data using known results found from different data and performing inference based on the current data to make Data Selection The selected dataset is the breast cancer dataset with 11 variables and 699 observations. Pre-processing The dataset consists of 16 missing values. The missing values will be replaced so that the data is clean and the quality of the data is high. Transformation Data from different sources will be transformed into a common format for processing. Data Mining Develop three models and apply algorithms to the transformed data to generate the desired results. Interpretation and Evaluation The results are intepreted and presented in a proper and visualizing manner 9
predictions. On the other hand, descriptive modelling is to identify pattern or relationships in data. It is to explore the properties of data examined but not to predict the new properties. It always required a domain expert to do so. Thus, in our project, we decide to use the predictive modelling tools in dealing with our dataset. Under the predictive modelling, there are 4 types of models which are classification, regression, time series analysis and also prediction. By using predictive modelling tools, we can make prediction and inferences based on the available breast cancer dataset. 3.3.2 Models Development Classification is chosen due to the characteristics of accuracy, speed, robustness, scalability and also interpretability. Classification is accurate as it has the ability to correctly predict the new class label. Moreover, it is fast in computation the results. It is also able to make correct predictions given a noisy and missing data. In addition, it has the ability to construct the classifier efficiently given a large amount of data. Lastly, it gives a better understanding and insight for the results. Moreover, classification techniques are most suitable for predicting data sets with binary or nominal categories. They are less effective for ordinal categories since they do not consider the implicit order among the categories. Since our target variable is nominal data, so it is most suitable to use classification. Three models that are chosen to be developed under the classification are logistics regression, neural network and also decision tree. 3.3.2.1 Logistics Regression Logistics regression is a nonlinear regression technique for problem having a binary outcome. A created regression equation limits the values of the output attribute to class values between 0 and 1. This allows output to represent a probability of class membership. Target is a discrete (binary or ordinal) variable while input variables have any measurement level. Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables. 10
3.3.2.2 Neural Network Neural network offers a mathematical model that attempts to mimic the human brains. Knowledge is often represented as a layered set of interconnected processors. Each Node has a weighted connection to several other nodes in adjacent layers. Moreover, individual nodes take the input received from connected nodes and use the weight together with a simple function to compute output values. 3.3.2.3 Decision Tree A decision tree is a structure that can be used to divide up a large collection of records into successfully smaller sets of records by applying a sequence of simple decision rules. The algorithm used to construct decision tree is referred to as recursive partitioning. The target variable is usually categorical and the decision tree is used either to calculate the probability that a given record belong to each of the category or to classify the record by assigning it to the most likely class. Decision tree has three types of nodes, which are root node which is the top (or left-most) node with no incoming edges and zero or more outgoing edges, child or internal Node which is the descendent node which has exactly one incoming edge and two or more outgoing edges and lastly the leaf Node is the terminal node which has exactly one incoming edge and no outgoing edges. Each leaf node is assigned a class label. The rules or branches are the unique path (edges) with a set of conditions (attribute) that divide the observations into smaller subset.
11
CHAPTER 4: KNOWLEDGE DISCOVERY PROCESS IN SAS ENTERPRISE MINER 4.1 Data Selection To begin, select Solution Analysis Enterprise Miner. Then, the SAS Enterprise Miner window will open. After that, click File New Project to create a new project which named BreastCancer. After we name our project as BreastCancer, click create and rename the untitled diagram as Project.
The breast cancer dataset is imported from MS Excel to SAS Enterprises Miner and being stored in EMDATA so that the data is stored in a permanent SAS library. Then, the Input Data Source node is added to the workspace so that the breast cancer dataset can be selected. The Input Data Source node represents the data source that we choose for a mining analysis and provides details (metadata) about the variables in the data source that we want to use. After we have dragged in the Input Data Source node, we click open to select the breast cancer dataset which is named EMDATA.CANCER as the source data which consists of 699 metadata sample. Data:
The data consists of 11 variables where 1 class variable (CLASS) and 10 interval variables (CLUMP THICKNESS, UNIFORMITY OF CELL SIZE, UNIFORMITY OF CELL SHAPE, MARGINAL ADHESION, SINGLE EPITHELIAL CELL SIZE, BARE NUCLEI, BLAND CHROMATIN, NORMAL NUCLEOLI and MITOSES). There are no missing values in all the variables except the BARE NUCLEI variable which has 2% missing data. 12
Then, we click Variables and set the model role. We set model role for CLASS from input to become a target. The model rule for each variable is shown as below. Variable Model Rule SAMPLE CODE NUMBER id CLUMP THICKNESS input UNIFORMITY OF CELL SIZE input UNIFORMITY OF CELL SHAPE input MARGINAL ADHESION input SINGLE EPITHELIAL CELL SIZE input BARE NUCLEI input BLAND CHROMATIN input NORMAL NUCLEOLI input MITOSES input CLASS target Variables:
Interval Variables:
Class Variables:
13
4.2 Pre-processing We drag Data Partition node and connect it with Input Data Source node. This node is to partition the input data sets of breast cancer into a training, validation and test model. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the free model parameters during estimation and is also used for model assessment. The test data set is an additional holdout data set that we can use for model assessment.
Then, right click on this node and click open. We decided to set 70% for training, 0% for validation and 30% for test. Thus, Model construction is developed from 70% of the training data and the remaining 30% testing data is used for model evaluation. Training data is used to build the model while the testing data is used to validate the model. Partition:
For developing the models, it is necessary to handle and replace the missing values. A Replacement node is dragged inside and connected with Data Partition node. We use the Replacement node to generate score code to process unknown levels when scoring and also to interactively specify replacement values for class and interval levels. In some cases we might want to reassign specified non missing values before performing imputation calculations for the missing values.
14
The missing values are being replaced by using Replacement node. The imputation methods for the interval variables are using mean whereas the imputation methods for the class variables are using count. The variable CLASS will not be using to replace as it is the target variable. Interval Variables:
Class Variables:
After we run the node, we can see that the missing values of CLASS variable are being replaced in the observations with the values of 3.4886. The table below shows part of the dataset after the missing values are being replaced.
4.3 Transformation We will then drag Transform Variables node and connect it with Replacement node. The function of the Transform Variables node is to create new variables or variables that are transformations of existing variables in the data. Transformations are useful when we want to improve the fit of a model to the data. The Transform Variables node also enables us to create interaction variables. Sometimes, input data is more informative on a scale other than that on which it was originally collected. For example, variable transformations can be used to stabilize variance, remove nonlinearity, improve additively, and counter non-normality. Therefore, for many models, transformations of the input data (either dependent or independent variables) can lead to a better model fit. These transformations can be functions of either a single variable or of more than one variable. 15
In our project, we use the Transform Variables node to make variables better suited for logistic regression model and neural network. 4.4 Data Mining 4.4.1 Logistics Regression The Regression is being dragged and connected with Transform Variables node. The function of Regression node is to fit both linear and logistic regression models to the data. We can use continuous, ordinal, and binary target variables, and you can use both continuous and discrete input variables. The node supports the stepwise, forward, and backward selection methods. In this project, we are going to use Stepwise method with Profit / Loss criteria.
Variables:
Model Options:
16
Selection Method:
After that, we run the node and the results will appear. We click on the Statistics and we will able to get the results below. Statistics:
From the results above, we can see that the Misclassification Rate for Training is 0.0286 while for Test is 0.0524. The misclassification rate for Training is less than Test which indicates that result for our logistics regression model is good. Then, we can get our logistics regression equation from the Estimates Table. Estimates: Table:
The confusion matrix above shows that 64.42% of the class BENIGN are being classified correctly and 32.72% of the class MALIGNANT are classified correctly. Only 6% of the class BENIGN are being misclassified as class MALIGNANT while 8% of class MALIGNANT are being misclassified as BENIGN. 4.4.2 Neural Network We drag Neural Network node and connect it with Transform Variables node. Neural Network node is used to construct, train, and validate multilayer, feed forward neural networks. By default, the Neural Network node automatically constructs a network that has one hidden layer consisting of three neurons. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network node supports many variations of this general form.
In our project, we will also select Profit / Loss model selection criteria. Variables:
18
General:
After that, we will run the node and the results will appear. Basic:
Table:
From the results, we can see that the Misclassification Rate for Training is 0.0307 and Test is 0.0238. We can say that this model is the good as the errors are smaller in Training as compared to Test in the model.
19
Weights:
In the Weights option, we are able to see that weights for all the variables. As we can see from the table above, we have 9 variables with their respective weights. The highest weight is from variable 1 (BARE NUCLEI with a value of 0.2096 and the lowest weight is variable 8 (UNIFORMITY OF CELL SHAPE) with a value of - 0.0014. We can say that variable 1 contributes the most to the model as the weight is the highest. 4.4.3 Decision Tree We drag Tree node and connect it with Replacement node. Tree node is used to fit decision tree models to the data. The implementation includes features that are found in a variety of popular decision tree algorithms such as CHAID, CART, and C4.5. The node supports both automatic and interactive training.
When we run the Decision Tree node in automatic mode, it automatically ranks the input variables, based on the strength of their contribution to the tree. This ranking can be used to select variables for use in subsequent modelling. We can override any automatic step with the option to define a splitting rule and prune explicit tools or sub- trees. Interactive training enables us to explore and evaluate a large set of trees as we develop them. 20
Variables:
Basic:
Then, we will run the node and the results appear. All:
We can see that the diagram above shows only 5 leaves are being selected in Training dataset and the Misclassification Rate is 0.0286. This shows that the model is good where there is small error in the Training dataset.
21
Summary:
The summary above is the confusion matrix, we can see that 64% of the Benign are correctly classified while for Malignant, 33% are correctly classified. Only 1% are incorrectly classified. We can also see decision tree results by clicking View Tree.
From the above, we can see that five leaf nodes represent the class label with all correctly classified. By having UNIFORMITY OF CELL SIZE less than 2.5, the breast cancer is classified as class BENIGN with the BARE NUCLEI of less than 5.5. For UNIFORMITY OF CELL SIZE that is 2.5 and above, if it also consists BARE NUCLEI 3.7443 and above, then it is classified as class MALIGNANT. On the other hand, if it consists BARE NUCLEI less than 3.7443 with UNIFORMITY OF CELL 22
SIZE less than 4.5, then it is classified as class BENIGN, otherwise it is classified as class MALIGNANT with the UNIFORMITY OF CELL SIZE 4.5 and above. Next, we can see the completing splits for the decision tree by right clicking View completing splits.
From the table below, we can say that UNIFORMITY OF CELL SIZE variable is used for the first split with the highest Logworth of 78.522 while the other variables such as UNIFORMITY OF CELL SHAPE, BARE NUCLEI, BLAND CHROMATIN and SINGLE EPITHELIAL CELL SIZE are the completing splits for the first split.
4.5 Interpretation and Evaluation An Assessment node is being dragged and connected to Regression node, Neural Network node and Tree node. The Assessment node provides a common framework for comparing models and predictions from any of the modeling nodes (Regression, Tree, Neural Network, and User Defined Model nodes).
After that, the node is run and the results appear. 23
Models:
We can see that the table above shows that in the Decision Tree model, the Misclassification Rates for Training dataset is 0.0286 while for Test dataset is 0.0667. For Neural Network model, the Misclassification Rates for Training dataset is 0.0307 while for Test dataset is 0.0238. Both models have very small errors but Neural Network model is better than Decision Tree model as the misclassification rate for Test is smaller. . Next, we can see the lift chart for the both models by just highlighting the model that we want to see and click Draw Lift Chart.
Both the models show the lift chart as below. Decision Tree:
24
From the lift chart for Decision Tree model, in 10 th to 20 th percentile, the cumulative % response is 96.454%. At the 30th percentile, the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 96.431%. Neural Network:
25
From the lift chart for Neural Network model, in 10 th to 20 th percentile, the cumulative % response is 100.000%. At the 30th percentile, the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 99.318%. Logistics Regression:
From the lift chart for Logistic Regression model, in 10 th to 20 th percentile, the cumulative % response is 100.000%. At the 30th percentile, the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 98.637%. Thus, both three models can be used. In addition, an Insight node can be added to connect with the Assessment node with all the three models to see the results of the breast cancer dataset. The Insight node is to enable us to open a SAS/INSIGHT session. SAS/INSIGHT software is an interactive tool for data exploration and analysis. 26
The table below shows the part of the result from the SAS/INSIGHT session. From the last column Class, it shows the predicted Class for all the observations by SAS Enterprise Miner. We can use the predicted Class to compare with our original dataset for Class and see the differences between them.
CHAPTER 5: RESULT AND DISCUSSION The misclassification rate for both training and test dataset will be shown at table below for the three models used. Model Misclassification Rate: Training Misclassification Rate: Test Regression 0.0286 0.0524 Neural Network 0.0307 0.0238 Decision Tree 0.0286 0.0667
From the above results, we can say that the Neural Network model can be used for the breast cancer dataset as the misclassification rate for Test dataset is the smallest with the value of 0.0238 as compared to the other 2 models. The Decision Tree model is the worst model to be used as the misclassification rate for Test dataset is the highest among the other two models, with a value of 0.0667. 27
CHAPTER 6: CONCLUSION It is important for any professionals or specialists in life areas to classify the classes of the breast cancer correctly. The models used in this project are not pretty sure that they can be applied perfectly to all the breast cancer patients, yet the models can be a guideline for them to know and understand more about the breast cancers classes more quickly. Based on our result, we can conclude that Neural Network model is the best in classifying the dataset for Breast Cancer as the misclassification rate is the lowest. The models will be developed and changed from time to time as the increasing numbers of variables that are contributing to the data.
28
CHAPTER 7: REFERENCE UCI Machine Learning Repository. (1992). Breast Cancer Wisconsin (Original) Data Set. Adapted 21 May 2014 from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Origina l%29
Wikipedia. (2014). Breast Cancer. Adapted 21 May 2014 from http://en.wikipedia.org/wiki/Breast_cancer
Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193--9196.
Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470--479). Aberdeen, Scotland: Morgan Kaufmann.