Suhasini Paper

The Adoptive comparative approach to migrate
from kNN to Decision tree on medical data

Gayam Suhasini
M.Tech Student
Department of Information Technology
VR Siddhartha Engineering College
Vijayawada, India
suhasinigayam@gmail.com
Vemuri Sindhura
Assistant Professor
Vijayawada, India
vemurisindhura2233@gmail.com
Sandeep Y
Assistant Professor
Vijayawada, India
Sandeep.yelisetti@gmail.com
Abstract: Analysis of medical data is quite challenge in
I.
INTRODUCTION
perspective of incremental growth of attributes and
The main aim of this work is to achieve the feasible
various parameters. Once the data is grouping there is
analysis on huge medical data with respect all the
particular limit to end up in getting the data as tuples or

calculating
the
frequency
of
the
combinational
attributes with respect to age, disease, gender. The

biggest reason is to push the anonymous diseases in
research. In large data related data spotting of the
attributes in the data sets. The work is on different

types of attributes like cholesterol and BP. In KNN
classification medical attributes are classified into
different classes. All these classes fall under different
desired disease data is too complicated. So we use KNN
combinational and conditional ranges with respect to
with Euclidean distance mechanism and decision tree
medical complications. In kNN finding the nearest
with normal and enhanced model with respect to given
neighbours with respect to new category of values
attributes. In KNN Euclidean distance generated for all
and classifies and finding the result in the form of
tuples and rank will be provide for new classification of
classes. This result is depends on value of K and
new set and result generated based on k value as nearest
neighbours of new tuples Euclidian distance. In
neighbours. But for comparison is between both above

said with respect to analysis of time complexity for
classification. The main classification is done on huge
and dynamic patch data. So the main classification is
decision tree we have considered normal and

enhanced decision tree. The main difference is to
decrease the tree hierarchy when compare to
done on KNN approach and instant decision trees make
enhanced tree with normal tree. In Normal decision
up by using NAE approach (normal and enhanced
tree and enhanced decision tree we consider same
decision tree structures).
attributes to build tree to predict find out the risk
(Key words: KNN, NAE, decision tree, Classification
factor of heart disease. Enhanced decision tree
,Euclidean distance)
produces accurate result compared to Normal

decision Tree. In KNN classification based on
attribute ranges for presentation but in this case
of the three machine learning algorithm based on the
exhibiting desired result and time complexity of
accuracy measurements.
classification is totally cost consumption. So our

work is migrated in adoptive manner as decision tree
in low and high level of exhibiting the desired output.
[1][3]
II.
PROPOSED WORK
The proposed system as shown in fig1 consists of 3

stages i.e., raw data is collected from laboratory
information
system,Diagnosis,
Procedures,
Pharmacy, Procedure Notes, Nurse Records. The raw

data is filtered and required attributes to perform
classification are retrieved.
Machine learning algorithms are applied on medical
data
collected
to
predict
heart
disease..kNN
algorithm, Decision tree, and Enhanced Decision tree

are the techniques used to classify the medical data.
In kNN algorithm classifiers are generated after that
Euclidean distance is calculated to find out the class
of the new input medical records. Get the k value to
generate number of classes related to new records.
In decision Tree reading dataset is performed to
retrieve the age limit to build root node and
cholesterol and overweight ranges are considered to
build further level nodes. The count of all the records
which have high risk factor of heart disease are
displayed
with Aadhaar Ids of the patients. In
Enhanced Decision Tree reading of tuples is

performed and gender is taken as root node and the
pair of attributes are considered to build next level
nodes like overweight, cholesterol and chest pain for
female patients and smoking ,alcohol and cholesterol
are considered as leaf attributes to predict risk factor
of heart disease. Final count of al the patients who
have high risk factor are displayed with respect to
their Aadhaar numbers. The best technique is find out
Fig 1: Architecture of proposed system

III.
DATA MINING TECHNIQUES
Data Mining is used to make intellect and use of

information. Knowledge discovery in Data is nontrivial process of categorising usable, unique,
theoretically valuable and comprehensible strategies
in data. Data mining consists of more than gathering
of handling data. It also contains exploration and
prediction. Data mining consists several algorithms
such as Decision Tree, kNN, Naive Bayesian,

Support Vector Machines, Apriori algorithm. All
these algorithms will perform well in medical
Distance= (cncp)2+(bnbp)2 (1)

Algorithm:
diagnosis, healthcare system and medicine. In this

paper Decision Tree, Enhanced Decision Tree and
kNN algorithms are used to perform analysis on
medical data to predict class objects.
3.1 KNN
Medical data is so dynamically grown source as
datasets of information from various hospitals in the
form of tuples as patient records. When data sets
mined , the hidden information in these datasets is a
big resource bundle for medical research and
presentation. All the data contains different patterns
and close set of relations, which can result for better
Fig 2: pseudo code for Euclidean Distance
diagnosis. The main tough task in discovering and

classification of these patterns and relations often
exposed inIT. Research has been moved in medical
diagnosis to find out heart diseases, lung diseases and
various thyroid problems based on the data collected
on affected patients. However, there is some
limitation to specific domain systems that can
investigate diseases restricted to their relevant
operations. In retrospect the calibre of k-nearest
neighbourhood (k-NN) classifier is totally depend on
the distance metric used to recognize k nearest
neighbours of the query points. The standard
Euclidean distance is commonly used in regular data
analysis.[10]The research uses huge storage of
information so that diagnosis is totally based on
historical data can be made. The goal is on computing
the probability of generated of a particular ailment by
Fig 3: kNN pseudo code
using a new algorithm. This kNN algorithm vastly
3.2 DECISION TREE
increases the accuracy of diagnosis. This approach

can be used to upgrade the automated in diagnosis.,
which merges diagnosis of multiple diseases showing
similar attributes and symptoms.[2]
The decision trees are widely researched solution

for classification tasks. For many realistic and
practical tasks, the tree generated by algorithms is not
comprehensible to end user due to the large data size
and complexity of relational data with respect to

attributes. Many tree approaches have been worked
out to produce simpler and more comprehensive trees
with good accuracy in classification, simplification of
Algorithm:
tree has usually of secondary major concern relative

to accuracy and also no practical attempt has been
made to research the literature from the angle of
simplification. The framework that organizes the
approaches to tree simplification and conclude the
approaches within this framework. The main aim of
this research is to provide the researchers with a fine
overview of tree simplification approaches and
insight their combinational capabilities.[1][2][3][7]
Our aim is to predict Risk factor of the heart disease
using the two Decision and Enhanced decision tree
[9]. To build decision tree we considers c4.5 which
takes numerous data as input. Split information is
calculated to find the splitting of the nodes.
S is the data set, and A is the set of attributes, the
equation below calculates the information gain for
pair of attribute (
A i , A j ) in set A
Splitinformation ( s , A )=
i=1
|si|
si
log2
(2)
s
s
Fig 4: pseudo code for decision tree

3.3 ENHANCED DECISION TREE
Enhanced Decision tree algorithm find local best
solution for each Decision node. To break out from
local optima and to find the global solution we
choose pair of attributes simultaneously, not one
attribute. Enhanced decision Tree method, in
choosing attributes considers the information gain of
choosing a pair of attributes concurrently instead of
choosing only one attribute. Therefore, to improve
the possibility of result global optimum solution,
considering pair optimum attribute is better than
single attribute. Enhanced Decision tree considers the

pair of attributes to construct leaf level nodes to
reduce the number of levels.
informationgain ( s , Ai , A j ) =Entropy ( S )
as the number of true positives (Tp) over the number

of true positives added with the number of false
positives (Fp). [3]
x value ( Aj )
x value (Aj)
S is the data set, and A is the set of attributes, the

equation below calculates the information gain for
pair of attribute (
A i , A j ) in set A. Information
Tp
P=
(4)
S x ,u
T p+F p
Entropy ( S x ,u ) (3)
S
4.2 RECALL
Recall is the measure of truly relevant results
returned. Recall (R) is defined as the number of true
positives (Tp) over the number of true positives plus
the number of false negatives (Fn).[3]
R=
gain is used to find the splitting of the pair of

attributes or a single attribute.
Tp
(5)
T p+ F n
4.3 F-MEASUREMENT
Algorithm:
F-measure is the measure of accuracy test. It

considers the precision p and the recall r of the test to
compute the score: p is the number of correct positive
result divided by the number of all positive results,
and r is the number of correct positive results divided
by the number of positive result that should returned.
The F1 score is interpreted as a weighted average of
the precision and recall, where the F1 score reaches
the best value as 1 and worst as 0.[3]
F=2.
precision. recall
( 6)
precision+ recall
Precision, Recall and F-measurements are the

accuracy measurements performed on the medical
data to predict the accurate algorithm among
Decision Tree, Enhanced decision tree and kNN.
V.
RESULTS AND OBSERVATIONS
Fig5: pseudo code for enhanced decision tree

IV.
ANALYSIS
4.1 PRECISION
Precision refers to the closeness of two or more
measurements to each other. Precision (P) is defined
Fig 6: Result of kNN

The kNN algorithm predicts the class of new
attributes which are nearest to the input attributes
with elapsed time.
Fig 7: Normal decision tree
Fig 8: Risk factor

Fig 7, 8 shows the results of the normal decision tree
which predicts the Risk factor of heart disease with
count of records and elapsed time.
Fig 9: Enhanced decision Tree
Fig 10: Risk factor

Fig 9, 10 shows the result of Enhanced Decision tree
with prediction of risk factor and by considering
more than two attributes at leaf level and produce the
count of records which have the risk factor.
Fig 11: comparison graph

By the above graph considering all values of
precision, recall and time elapsed, of machine
learning based kNN algorithm, decision tree and
Enhanced decision tree are shown. Enhanced
Decision tree produces approximately accurate
results than other two techniques on medical data.
VI.
FUTURE ENHANCEMENTS
This work can be extended to SAAS on W3C for
web4.0 architectures. Coming to general stubs and
skeletons both sides the algorithm part can be
immersed based on extension of both data sets and
approaches migration. The SOA architecture
migration of this work is easily adoptive for further
processing of data like pre-processing and fast
analysis of various models and attributed data.
Throwing the data to cloud and on W3C of these kind
of datasets always sensitive but for security and
analization purpose stubs and skeletons maintains
security to process the data in mining and
classification levels. For pattern discovery this work
can be extended from KNN level without further
processing of decision trees. This is adoptive in
current work for various purposes.
REFERENCES
[1] Jiawei, H. (2006). Data Mining: Concepts and Techniques,
Morgan Kaufmann publications.
[2] Quinlan, J. R. (2014). C4. 5: programs for machine learning.
Elsevier.
[3] Karthikeyan, T., Thangaraju P. (2013). Analysis of
Classification Algorithms Applied to Hepatitis
Patients, International Journal of Computer Applications (0975
888), Vol. 62, No.15.
[4] Suknovic, M., Delibasic B. ,et al. (2012). Reusable components
in decision tree induction algorithms, Comput Stat, Vol. 27, 127148.
[5] Ruggieri, S. (2002). Efficient C4. 5 [classification algorithm].
Knowledge and Data Engineering, IEEE Transactions on, Vol.
14, No.2, 438-444.
[6] Cios, K. J., Liu, N. (1992). A machine learning method
for generation of a neural network architecture: A continuous
ID3 algorithm. Neural Networks, IEEE Transactions on, Vol. 3,
No.3, 280-291.
[7] Gladwin, C. H. (1989). Ethnographic decision tree modeling

Vol. 19.Sage.
[8] Teach R. and Shortliffe E. (1981). An analysis ofphysician
attitudes regarding computer-based clinical consultation systems.
Computers and Biomedical Research, Vol. 14, 542-558.
[9] Turkoglu I., Arslan A., Ilkay E. (2002). An expert system for
diagnosis of the heart valve
diseases.Expert Systems with
Applications, Vol. 23, No.3, 229236.
[10] Witten I. H., Frank E. (2005). Data Mining, Practical
Machine Learning Tools and Techniques, 2ndElsevier.
[11] Herron P. (2004). Machine Learning for Medical Decision
Support: Evaluating Diagnostic
Performance of Machine Learning Classification Algorithms,
INLS 110, Data Mining.

Suhasini Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Suhasini Paper

Uploaded by

Copyright:

Available Formats

The Adoptive comparative approach to migrate

from kNN to Decision tree on medical data

Abstract: Analysis of medical data is quite challenge in

perspective of incremental growth of attributes and

The main aim of this work is to achieve the feasible

various parameters. Once the data is grouping there is

analysis on huge medical data with respect all the

particular limit to end up in getting the data as tuples or

attributes with respect to age, disease, gender. The

attributes in the data sets. The work is on different

desired disease data is too complicated. So we use KNN

combinational and conditional ranges with respect to

with Euclidean distance mechanism and decision tree

medical complications. In kNN finding the nearest

with normal and enhanced model with respect to given

neighbours with respect to new category of values

attributes. In KNN Euclidean distance generated for all

and classifies and finding the result in the form of

tuples and rank will be provide for new classification of

classes. This result is depends on value of K and

new set and result generated based on k value as nearest

neighbours of new tuples Euclidian distance. In

neighbours. But for comparison is between both above

decision tree we have considered normal and

done on KNN approach and instant decision trees make

enhanced tree with normal tree. In Normal decision

up by using NAE approach (normal and enhanced

tree and enhanced decision tree we consider same

decision tree structures).

attributes to build tree to predict find out the risk

(Key words: KNN, NAE, decision tree, Classification

factor of heart disease. Enhanced decision tree

produces accurate result compared to Normal

attribute ranges for presentation but in this case

of the three machine learning algorithm based on the

exhibiting desired result and time complexity of

classification is totally cost consumption. So our

The proposed system as shown in fig1 consists of 3

Pharmacy, Procedure Notes, Nurse Records. The raw

algorithm, Decision tree, and Enhanced Decision tree

with Aadhaar Ids of the patients. In

Enhanced Decision Tree reading of tuples is

Fig 1: Architecture of proposed system

DATA MINING TECHNIQUES

Data Mining is used to make intellect and use of

such as Decision Tree, kNN, Naive Bayesian,

Distance= (cncp)2+(bnbp)2 (1)

diagnosis, healthcare system and medicine. In this

Fig 2: pseudo code for Euclidean Distance

diagnosis. The main tough task in discovering and

Fig 3: kNN pseudo code

using a new algorithm. This kNN algorithm vastly

3.2 DECISION TREE

increases the accuracy of diagnosis. This approach

The decision trees are widely researched solution

and complexity of relational data with respect to

tree has usually of secondary major concern relative

Fig 4: pseudo code for decision tree

single attribute. Enhanced Decision tree considers the

as the number of true positives (Tp) over the number

S is the data set, and A is the set of attributes, the

gain is used to find the splitting of the pair of

F-measure is the measure of accuracy test. It

Precision, Recall and F-measurements are the