You are on page 1of 7

The Adoptive comparative approach to migrate

from kNN to Decision tree on medical data


Gayam Suhasini
M.Tech Student
Department of Information Technology
VR Siddhartha Engineering College
Vijayawada, India
suhasinigayam@gmail.com

Vemuri Sindhura
Assistant Professor
Department of Information Technology
VR Siddhartha Engineering College
Vijayawada, India
vemurisindhura2233@gmail.com

Sandeep Y
Assistant Professor
Department of Information Technology
VR Siddhartha Engineering College
Vijayawada, India
Sandeep.yelisetti@gmail.com

Abstract: Analysis of medical data is quite challenge in

I.

INTRODUCTION

perspective of incremental growth of attributes and

The main aim of this work is to achieve the feasible

various parameters. Once the data is grouping there is

analysis on huge medical data with respect all the

particular limit to end up in getting the data as tuples or


calculating

the

frequency

of

the

combinational

attributes with respect to age, disease, gender. The


biggest reason is to push the anonymous diseases in
research. In large data related data spotting of the

attributes in the data sets. The work is on different


types of attributes like cholesterol and BP. In KNN
classification medical attributes are classified into
different classes. All these classes fall under different

desired disease data is too complicated. So we use KNN

combinational and conditional ranges with respect to

with Euclidean distance mechanism and decision tree

medical complications. In kNN finding the nearest

with normal and enhanced model with respect to given

neighbours with respect to new category of values

attributes. In KNN Euclidean distance generated for all

and classifies and finding the result in the form of

tuples and rank will be provide for new classification of

classes. This result is depends on value of K and

new set and result generated based on k value as nearest

neighbours of new tuples Euclidian distance. In

neighbours. But for comparison is between both above


said with respect to analysis of time complexity for
classification. The main classification is done on huge
and dynamic patch data. So the main classification is

decision tree we have considered normal and


enhanced decision tree. The main difference is to
decrease the tree hierarchy when compare to

done on KNN approach and instant decision trees make

enhanced tree with normal tree. In Normal decision

up by using NAE approach (normal and enhanced

tree and enhanced decision tree we consider same

decision tree structures).

attributes to build tree to predict find out the risk

(Key words: KNN, NAE, decision tree, Classification

factor of heart disease. Enhanced decision tree

,Euclidean distance)

produces accurate result compared to Normal


decision Tree. In KNN classification based on

attribute ranges for presentation but in this case

of the three machine learning algorithm based on the

exhibiting desired result and time complexity of

accuracy measurements.

classification is totally cost consumption. So our


work is migrated in adoptive manner as decision tree
in low and high level of exhibiting the desired output.
[1][3]
II.

PROPOSED WORK

The proposed system as shown in fig1 consists of 3


stages i.e., raw data is collected from laboratory
information

system,Diagnosis,

Procedures,

Pharmacy, Procedure Notes, Nurse Records. The raw


data is filtered and required attributes to perform
classification are retrieved.
Machine learning algorithms are applied on medical
data

collected

to

predict

heart

disease..kNN

algorithm, Decision tree, and Enhanced Decision tree


are the techniques used to classify the medical data.
In kNN algorithm classifiers are generated after that
Euclidean distance is calculated to find out the class
of the new input medical records. Get the k value to
generate number of classes related to new records.
In decision Tree reading dataset is performed to
retrieve the age limit to build root node and
cholesterol and overweight ranges are considered to
build further level nodes. The count of all the records
which have high risk factor of heart disease are
displayed

with Aadhaar Ids of the patients. In

Enhanced Decision Tree reading of tuples is


performed and gender is taken as root node and the
pair of attributes are considered to build next level
nodes like overweight, cholesterol and chest pain for
female patients and smoking ,alcohol and cholesterol
are considered as leaf attributes to predict risk factor
of heart disease. Final count of al the patients who
have high risk factor are displayed with respect to
their Aadhaar numbers. The best technique is find out

Fig 1: Architecture of proposed system


III.

DATA MINING TECHNIQUES

Data Mining is used to make intellect and use of


information. Knowledge discovery in Data is nontrivial process of categorising usable, unique,
theoretically valuable and comprehensible strategies
in data. Data mining consists of more than gathering
of handling data. It also contains exploration and
prediction. Data mining consists several algorithms

such as Decision Tree, kNN, Naive Bayesian,


Support Vector Machines, Apriori algorithm. All
these algorithms will perform well in medical

Distance= (cncp)2+(bnbp)2 (1)


Algorithm:

diagnosis, healthcare system and medicine. In this


paper Decision Tree, Enhanced Decision Tree and
kNN algorithms are used to perform analysis on
medical data to predict class objects.
3.1 KNN
Medical data is so dynamically grown source as
datasets of information from various hospitals in the
form of tuples as patient records. When data sets
mined , the hidden information in these datasets is a
big resource bundle for medical research and
presentation. All the data contains different patterns
and close set of relations, which can result for better

Fig 2: pseudo code for Euclidean Distance

diagnosis. The main tough task in discovering and


classification of these patterns and relations often
exposed inIT. Research has been moved in medical
diagnosis to find out heart diseases, lung diseases and
various thyroid problems based on the data collected
on affected patients. However, there is some
limitation to specific domain systems that can
investigate diseases restricted to their relevant
operations. In retrospect the calibre of k-nearest
neighbourhood (k-NN) classifier is totally depend on
the distance metric used to recognize k nearest
neighbours of the query points. The standard
Euclidean distance is commonly used in regular data
analysis.[10]The research uses huge storage of
information so that diagnosis is totally based on
historical data can be made. The goal is on computing
the probability of generated of a particular ailment by

Fig 3: kNN pseudo code

using a new algorithm. This kNN algorithm vastly

3.2 DECISION TREE

increases the accuracy of diagnosis. This approach


can be used to upgrade the automated in diagnosis.,
which merges diagnosis of multiple diseases showing
similar attributes and symptoms.[2]

The decision trees are widely researched solution


for classification tasks. For many realistic and
practical tasks, the tree generated by algorithms is not
comprehensible to end user due to the large data size

and complexity of relational data with respect to


attributes. Many tree approaches have been worked
out to produce simpler and more comprehensive trees
with good accuracy in classification, simplification of

Algorithm:

tree has usually of secondary major concern relative


to accuracy and also no practical attempt has been
made to research the literature from the angle of
simplification. The framework that organizes the
approaches to tree simplification and conclude the
approaches within this framework. The main aim of
this research is to provide the researchers with a fine
overview of tree simplification approaches and
insight their combinational capabilities.[1][2][3][7]
Our aim is to predict Risk factor of the heart disease
using the two Decision and Enhanced decision tree
[9]. To build decision tree we considers c4.5 which
takes numerous data as input. Split information is
calculated to find the splitting of the nodes.
S is the data set, and A is the set of attributes, the
equation below calculates the information gain for
pair of attribute (

A i , A j ) in set A

Splitinformation ( s , A )=
i=1

|si|
si
log2
(2)
s
s

Fig 4: pseudo code for decision tree


3.3 ENHANCED DECISION TREE
Enhanced Decision tree algorithm find local best
solution for each Decision node. To break out from
local optima and to find the global solution we
choose pair of attributes simultaneously, not one
attribute. Enhanced decision Tree method, in
choosing attributes considers the information gain of
choosing a pair of attributes concurrently instead of
choosing only one attribute. Therefore, to improve
the possibility of result global optimum solution,
considering pair optimum attribute is better than

single attribute. Enhanced Decision tree considers the


pair of attributes to construct leaf level nodes to
reduce the number of levels.

informationgain ( s , Ai , A j ) =Entropy ( S )

as the number of true positives (Tp) over the number


of true positives added with the number of false
positives (Fp). [3]

x value ( Aj )
x value (Aj)

S is the data set, and A is the set of attributes, the


equation below calculates the information gain for
pair of attribute (

A i , A j ) in set A. Information

Tp
P=
(4)
S x ,u
T p+F p
Entropy ( S x ,u ) (3)
S

4.2 RECALL
Recall is the measure of truly relevant results
returned. Recall (R) is defined as the number of true
positives (Tp) over the number of true positives plus
the number of false negatives (Fn).[3]

R=

gain is used to find the splitting of the pair of


attributes or a single attribute.

Tp
(5)
T p+ F n

4.3 F-MEASUREMENT

Algorithm:

F-measure is the measure of accuracy test. It


considers the precision p and the recall r of the test to
compute the score: p is the number of correct positive
result divided by the number of all positive results,
and r is the number of correct positive results divided
by the number of positive result that should returned.
The F1 score is interpreted as a weighted average of
the precision and recall, where the F1 score reaches
the best value as 1 and worst as 0.[3]

F=2.

precision. recall
( 6)
precision+ recall

Precision, Recall and F-measurements are the


accuracy measurements performed on the medical
data to predict the accurate algorithm among
Decision Tree, Enhanced decision tree and kNN.

V.

RESULTS AND OBSERVATIONS

Fig5: pseudo code for enhanced decision tree


IV.

ANALYSIS

4.1 PRECISION
Precision refers to the closeness of two or more
measurements to each other. Precision (P) is defined

Fig 6: Result of kNN


The kNN algorithm predicts the class of new
attributes which are nearest to the input attributes
with elapsed time.

Fig 7: Normal decision tree

Fig 8: Risk factor


Fig 7, 8 shows the results of the normal decision tree
which predicts the Risk factor of heart disease with
count of records and elapsed time.

Fig 9: Enhanced decision Tree

Fig 10: Risk factor


Fig 9, 10 shows the result of Enhanced Decision tree
with prediction of risk factor and by considering
more than two attributes at leaf level and produce the
count of records which have the risk factor.

Fig 11: comparison graph


By the above graph considering all values of
precision, recall and time elapsed, of machine
learning based kNN algorithm, decision tree and
Enhanced decision tree are shown. Enhanced
Decision tree produces approximately accurate
results than other two techniques on medical data.
VI.
FUTURE ENHANCEMENTS
This work can be extended to SAAS on W3C for
web4.0 architectures. Coming to general stubs and
skeletons both sides the algorithm part can be
immersed based on extension of both data sets and
approaches migration. The SOA architecture
migration of this work is easily adoptive for further
processing of data like pre-processing and fast
analysis of various models and attributed data.
Throwing the data to cloud and on W3C of these kind
of datasets always sensitive but for security and
analization purpose stubs and skeletons maintains
security to process the data in mining and
classification levels. For pattern discovery this work
can be extended from KNN level without further
processing of decision trees. This is adoptive in
current work for various purposes.

REFERENCES
[1] Jiawei, H. (2006). Data Mining: Concepts and Techniques,
Morgan Kaufmann publications.
[2] Quinlan, J. R. (2014). C4. 5: programs for machine learning.
Elsevier.
[3] Karthikeyan, T., Thangaraju P. (2013). Analysis of
Classification Algorithms Applied to Hepatitis
Patients, International Journal of Computer Applications (0975
888), Vol. 62, No.15.
[4] Suknovic, M., Delibasic B. ,et al. (2012). Reusable components
in decision tree induction algorithms, Comput Stat, Vol. 27, 127148.
[5] Ruggieri, S. (2002). Efficient C4. 5 [classification algorithm].
Knowledge and Data Engineering, IEEE Transactions on, Vol.
14, No.2, 438-444.
[6] Cios, K. J., Liu, N. (1992). A machine learning method
for generation of a neural network architecture: A continuous
ID3 algorithm. Neural Networks, IEEE Transactions on, Vol. 3,
No.3, 280-291.

[7] Gladwin, C. H. (1989). Ethnographic decision tree modeling


Vol. 19.Sage.
[8] Teach R. and Shortliffe E. (1981). An analysis ofphysician
attitudes regarding computer-based clinical consultation systems.
Computers and Biomedical Research, Vol. 14, 542-558.
[9] Turkoglu I., Arslan A., Ilkay E. (2002). An expert system for
diagnosis of the heart valve
diseases.Expert Systems with
Applications, Vol. 23, No.3, 229236.
[10] Witten I. H., Frank E. (2005). Data Mining, Practical
Machine Learning Tools and Techniques, 2ndElsevier.
[11] Herron P. (2004). Machine Learning for Medical Decision
Support: Evaluating Diagnostic
Performance of Machine Learning Classification Algorithms,
INLS 110, Data Mining.

You might also like