You are on page 1of 6

Incremental Radius Approach for Classification

Mr. Pratik Aher1, Prof. Purva Raut2


1 Student, IT Department, D. J. Sanghvi College of Engineering, Vile Parle, Mumbai, India
2 Assistant Professor, IT Department, D. J. Sanghvi College of Engineering, Vile Parle, Mum, India
Abstract:
Classification is a data mining technique used to points, Euclidian distance for all points has to be
categorise the given data in turn, finding the evaluated but decision can be made just by
model with common property element together. analysing certain number of points.
The most popular and simple,K-Nearest
neighbour (KNN) algorithm classifies samples In this paper, a modified version of KNN
based on the class of their nearest neighbour. algorithm is proposed that is used for
This paper presents a modified version of KNN classification. This method uses incremental
algorithm that is used for classification between radius approach where instead of evaluating the
two groups. This method uses incremental Euclidian distance between the point of
radius approach where instead of evaluating the reference and all the surrounding points it
Euclidian distance between the point of follows an incremental approach where a
reference and all the surrounding points it particular radius is selected and decision is
follows an incremental approach where a evaluated. After that, radius is again
particular radius is selected and decision is incremented and decision is again evaluated.
evaluated. Experiments are performed on UCI This paper is sectioned as follows. Section 2
Breast Cancer Wisconsin (Diagnostic) data set discusses existing classification techniques.
to ensure the effective and appropriate Section 3 focuses on the proposed method and
classification. details the algorithm. In section 4 the
Keywords: KNN, Classification, Classification, experimental results and findings are
Radius, Weighted Classification. mentioned. And section 5 concludes the work.

1. Introduction 2. Related Work / KNN and its variations

Data Mining is more than just storing and In[2] the traditional KNN Algorithm is
analysing the data. Classification is one of the explained. It follows the following steps in
vital tasks of data mining which leads order to classify the data set into different
information generation by finding various classes.
patterns and their correlation. Classification 1. Find the distance between the point to be
helps in discovering various hidden data models classified and all of the points in the dataset.
which are otherwise not visible to the data
analyst. Classification is an important aspect 2. Rank the points in the increasing order of the
which helps us choosing and making decisions distance from the point to be classified and store
between two or more groups. KNN algorithm is it in the form of a list.
an effective way to implement classification.
3. From this list, calculate the majority vote on
But it also has a few disadvantages like, 1. In
class labels and evaluate the result.
most cases K need to be specified i.e. number of
nearest neighbours. This is very tough in the
data set where one cant just guess how many of In order to classify the data, taking example of
the data points will fit to one cluster. 2. It is a colored points.
brute force approach and takes all the values
Consider the blue point to be classified as either
into consideration. In real time data sets there
black point or red point using KNN algorithm.
can be few data points which are of no
significance or which are wrongly taken. 3. In
cases where the region of evaluation specified
by giving a specific radius, it checks all points
which is not required. 4. For large number of
Step 1: Plotting all the data points on graph. 2.1 Types on KNN classifiers:
A. Density based classifier [3]:
In [3] Structural Density of data points is
discussed. In some cases calculating only the
neighbors is not efficient. Structural Density is
the concentration of data points around the
volume of its neighborhood. Density is
evaluated over a particular region. Also density
of the entire dataset is also evaluated. We then
search for a value of 'r' such that the mean of the
individual densities is equal to the average
density calculated earlier.
Fig 1: Plotted data points
B. Weighted KNN classifier [3]:
Step 2: Calculate the distance between that point
and all points in the dataset. In this type of classifier, each feature is given a
weight based on how useful it is to find out the
class of the dataset. To accomplish this, they use
the concept of Index of Discernibility. In this,
they assume a hypersphere around every
element of the dataset having a fixed radius,
which is the average distance between that
element and rest of the element of the class.
After this, Elements of the same class which
belong inside the hypersphere are examined and
counted. Then discernibility of that element is
Fig 2. Calculating distances from all data points calculated by dividing number of these elements
belonging to a particular class by the total
Step 3: Store the distances in the increasing
number of elements in a dataset. The Index of
order in a list of their class labels.Here, we can
observe the distances approximately as Discernibility of the whole dataset is calculated
[r,r,r,b,b,b] as the number of elements having discernibility
higher than 0.5 divided by the total number of
Step 4: Let us consider value of k to be 3 .We elements. Its process flow is as follows:
see that majority of vote of first three points is
1.Each of the features of the dataset is evaluated
r from the list, so we classify the point as
using IDs
belonging to r group. 2. Weights are obtained by normalizing IDs.
3. Weight is now applied to the dataset.
4. Knn classifier is now applied on this dataset.

Fig 3: After classification


C. Variable knn [3]: the validation error rate are two parameters we
need to access on different K-value. Following
Value of k which is the number of neighbors to is the curve for the training error rate with
be considered for classification, greatly varying value of K[5] :
influences the result .so in this method,
optimum value of k is calculated with the help
of Degree of Certainty (DC). The value of K
that maximizes DC for classification of various
neighborhoods is selected. After calculating the
best possible k, then for classifying an unknown
feature its nearest neighbor is found and k value
is assumed from the optimum. Then knn
classifier is applied using that k value.

D. Class based knn [3]:


As you can see, the error rate at K=1 is always
Class based knn is used in case where are too
zero for the training sample. This is because the
few elements in the dataset to classify the
closest point to any training data point is itself.
particular element. While considering every
Hence the prediction is always accurate with
element, value of k is selected by classifier so as
K=1. If validation error curve would have been
to maximize the degree of certainty, and now
similar, our choice of K would have been 1.
these k neighbors are considered for
Following is the validation error curve with
classification. Finally, the class with the lowest
varying value of K [5]:
value of harmonic distances from that element
to all k elements is selected for classification.

2.2 Impact of K On Classification:


Let us see how the value of k influences the
outcome of algorithm. [5]
Fig 4:

At K=1, we were overfitting the boundaries.


Hence, error rate initially decreases and reaches
a minima. After the minima point, it then
increase with increasing K. To get the optimal
value of K, you can segregate the training and
validation from the initial dataset. Now plot the
validation error curve to get the optimal value of
K. This value of K should be used for all
predictions.
3. Proposed Method
In this method, we evaluate the distance from
origin to reference point and that distance is
used as a base for incremental radius approach.
We evaluate the distance from origin to
Following are the different boundaries reference point and that distance is used as a
separating the two classes with different values base for incremental radius approach. This
of K. You can see here that the boundaries method has been applied to a real world
become smoother with an increasing value of k. repository and the accuracy is found to be
As k approaches infinity, the boundary becomes comparable with the traditional approach. It is
clearer and clearer. The training error rate and most effective when there are large number of
points in a data set which is completely opposite BEFORE CLASSIFICATION:
when it comes to traditional approach.
SIGNIFICANCE OF RADIUS:
Radius is important as it gives us the range of
operation and if we find our desired result with
equal accuracy as it is computationally and
logically preferable to get consider a small
portion of the dataset. Moreover, searching over
a smaller radius also eliminates the possibilities
of outliers influencing the decision.
ALGORITHM:
Purple point is to be classified into either one of
1. Find the distance between the point to be
two groups :Red or Blue.
classified and the origin(0,0) as origin_dist.
PROCESS:
2. Create an array which stores what percentage
of the total distance is to considered for
evaluation(usually 10%,20%,50% as
a=[0.1,0.2,0.5]). The Process that during classification is showed
here:
3. Calculate the radius(r) to be considered for
evaluation by calculating the product of array
element and the total
distance(r=a[i]*origin_dist).
4. Calculate the Euclidian distance from the
point to be classified to every point inside the
radius and store it in an array.
5. Sort the array obtained and find the most
common element and store it in another array
called as vote_result.
6. Repeat steps from 3 to 5, for given values of
a.
Algorithm is implemented on the data points for
7. Finally find the most common values from the classification of data points
the array vote_result and obtain the result.
PROCESS:
We consider a set of scattered data points
divided into two distinct classes based on
colour: Black and blue. We introduce a new
data point which is to classified as a point
belonging to either one of two classes.

Algorithm is successfully implemented and


result is evaluated
Result obtained from algorithm after When the result was evaluated on the dataset
classification is shown here: using the proposed method, accuracy was found
to be:
AFTER CLASSIFICATION:
Accuracy: 94.866%.
COMPUTATION TIME:
When the result was evaluated on the dataset
using the Traditional KNN method, computation
time was found to be:
TIME: 8.127048446000003 seconds.
When the result was evaluated on the dataset
using the proposed method, computation time
was found to be:
TIME: 7.972885221999945 seconds.
Final Result ACCURACY VS K IN TRADITIONAL
METHOD:
4. RESULTS:
The value of accuracy for different values of k
DATA PREPARATION:
is calculated and plotted on a simple line graph
We have taken Breast Cancer Wisconsin for the standard knn algorithm. We can observe
(Diagnostic) Data Set for testing of our here that as we increase the value of k the
algorithm. We have removed the 'id' column accuracy goes on decreasing.
from the dataset as it is not required and does
not influence the result.
We divide the dataset into two parts of 30% and
70%: Where 70% of the data while is used for
training and 30% is used for testing. Each of
testing and training set contain two groups
based on the value of features in the dataset. We
then find out the result by comparing the value
obtained from the respective algorithm and the
value in the test set.
Accuracy is then calculated by dividing the ACCURACY VS RADIUS IN PROPOSED
number of correct values obtained to the total METHOD:
number of values tested.
The values of accuracy is calculated for
DATASET: UCI Breast Cancer Wisconsin different values of radius and we can observe
(Diagnostic) Data Set. that accuracy goes on increasing as we increase
the value of radius for the proposed method.
Both the algorithms: Traditional KNN and
modified KNN,were operated on the same data
set and the results were found as follows:
ACCURACY:
When the result was evaluated on the dataset
using Traditional KNN, accuracy was found to
be:
Accuracy: 95.172%.
5.CONCLUSIONS AND DISCUSSION: [6]N. Suguna,Dr. K. Thanushkodi .An Improved
k-Nearest Neighbor Classification Using
In this paper, we have introduced a variation to Genetic Algorithm .IJCSI International Journal
traditional knn algorithm which involves of Computer Science Issues, Vol. 7, Issue 4, No
classification with incrementing radius. We 2, July 2010.
observed that the traditional algorithm was too
much dependant on k, which is the number of
neighbours to be considered, through our
approach, we eliminated the need for using k.
We also concluded through our testing on real
world dataset that we can acquire comparable
accuracy through our approach. We also
achieved better computational time through the
incremental radius approach. We observed that
in the case of large datasets traditional knn
algorithm is not feasible while the proposed
methods works efficiently. As the volume of
data becomes more and more, the traditional
approach puts a lot of load on the system
because it has to evaluate result based on all the
points in the dataset.
6. REFERENCES:
[1]M.Akhil jabbar, B.L Deekshatulua , Priti
Chandra .Classification of Heart Disease Using
K- Nearest Neighbor and Genetic
Algorithm .International Conference on
Computational Intelligence: Modeling
Techniques and Applications (CIMTA),pp 1-3
2013.
[2]Natasha Latysheva,Implementing Your Own
k-Nearest Neighbor Algorithm Using Python,
Retrieved from : http://www.kdnuggets.com/
2016/01/implementing-your-own-knn-using-
python.html (2016)
[3] Zacharias Voulgaris, George D.
Magoulas .Extensions of the k Nearest
Neighbour Methods for Classification Problems
.Proc. of 26th IASTED International Conference
on Artificial Intelligence and Applications,
ACTA,pp.3-4(2008)
[4] StatSoft, Inc. (2013). Electronic Statistics
Textbook. Tulsa, OK: StatSoft. WEB: http://
www.statsoft.com/Textbook/k-Nearest-
Neighbors.
[5]Tavish Srivastava, Introduction to k-nearest
neighbors: Simplified, Retrieved from: https://
www.analyticsvidhya.com/blog/2014/10/
introduction-k-neighbours-algorithm-clustering/
,2014.

You might also like