You are on page 1of 3

Research of Intrusion Detection Based on an Improved K-means Algorithm

Shenghui Wang
Information Technology Center
China Nuclear Power Technology Research Institute
Shenzhen, China
E-mail: wangshenghui@cgnpc.com.cn
AbstractTraditional machine learning methods for intrusion
detection can only detect known attacks since these methods
classify data based on what they have learned. New attacks are
unknown and are difficult to detect because they have not
learned. In this paper, we present an improved k-means
clustering-based intrusion detection method, which trains on
unlabeled data in order to detect new attacks. The result of
experiments run on the KDD Cup 1999 data set shows the
improvement in detection rate and decrease in false positive
rate and the ability to detect unknown intrusions.
Keywords- Intrusion Detection, k-means, clustering,
unsupervised anomaly detection
I. INTRODUCTION
With the rapid development of network technology and
the quick expansion of network application range, the
traditional static defending methods such as firewalls and
access control is very difficult to satisfy the need of network
security. As a proactive information security measure,
intrusion detection can effectively cover the shortage of
traditional security measures. Recent years various kinds of
data mining methods were applied into intrusion detection
[1]. There are two major paradigms for training data mining-
based intrusion detection systems: misuse detection and
anomaly detection. Misuse detection needs training over the
labeled data and the process of training is time-consuming
and costly waste and if the label was wrong which will lead
to the decrease of detection rate. Anomaly detection
approaches can detect new types of attacks and unsupervised
anomaly detection can train unlabeled data and detect
unknown attacks [2] [3].
The clustering as an unsupervised learning can be
applied in intrusion detection, classifying unlabeled data and
detecting new intrusions. K-means is a typical clustering
algorithm. It partitions a set of data into k clusters through
the following steps [4].
Step1 (Initialization): Randomly choose k instances
from the data set and make them initial cluster centers of the
clustering space.
Step2 (Assignment): Assign each instance to its closest
center.
Step3 (Updating): Replace each center with the mean of
its members.
Step4 (Iteration): Repeat Steps 2 and 3 until there is no
more updating.
II. METHODOLOGY
In this section firstly we discuss the improved k-means
algorithm and next we describe the model of intrusion
detection based on clustering.
A. Improved k-means Algorithm
The shortcomings of k-means: sensitive to initial
centers and dependency on number of clustering k. Fig. 1
shows the clustering effect under different initial centers.
Figure 1. Clustering diagram under different initial centers.
According to the first shortcoming of k-means, we
propose an approach to choose initial centers to improve k-
means algorithm. The basic idea is choose the initial centers
as decentralized as possible to avoid the bad clustering. The
specific steps of the algorithm are described as below.
Step1: Get the center member c of the whole Data set D;
Step2: Choose the first center member which is the
fastest member from the c, that is ;
1
m
max
1
( , ) ( ( , ))
x D
d m c d x c
Step3: Choose the second center member which is
the fastest member from the , that
is ;
2
m
1
m
1 2 1
( , ) max ( ( , ))
x D
d m m d m x

Step4: Choose which satisfies the


formula
i
m
))
1 1
( ( , max ( ( , ))
1 1
i i
d m m d m x
j i j j j x D



i
m
1 2 1
, , ,
i
m m m
. If
the distance between and any of

i
m
is less
than the given parameter r, then give up this member and
compute the sum of distances between selected center
member and the member in data set D. At last we choose
the data got the maximum sum distance as ;
2011 Second International Conference on Innovations in Bio-inspired Computing and Applications
978-0-7695-4606-3/11 $26.00 2011 IEEE
DOI 10.1109/IBICA.2011.72
275
2011 Second International Conference on Innovations in Bio-inspired Computing and Applications
978-0-7695-4606-3/11 $26.00 2011 IEEE
DOI 10.1109/IBICA.2011.72
274
Step5: when i = k, the iteration is over.
The improved k-means algorithm actually makes some
improvement in the first step of k-means algorithm.
B. Model of intrusion detection based-on clustering
In actual network environment, normal connections are
overwhelmingly larger than abnormal connections. Intrusion
detection based on clustering is under two assumptions. The
first assumption is that the number of normal connections
vastly outnumbers the number of attacks. The second
assumption is that the intrusions themselves are qualitatively
different from the normal connections. The basic idea is that
since the intrusions are both different from normal and rare,
they will appear as outliers in the data which can be detected
[5]. The basic process of intrusion detection based on
clustering is data selection and filtering, data preprocessing,
clustering, labeling clusters and data detection. The model of
intrusion detection based-on clustering is in Fig. 2.
Figure 2. The model of intrusion detection based-on clustering
1) Data Selection and Filtering
The dataset used was the KDD Cup 1999 Data [6],
which includes a wide variety of intrusions simulated in a
network environment. The attacks in the dataset are divided
into 4 groups, Denial of Service, Remote to Local, User to
Root and Probing. Each data connection in the dataset has 41
features. During selection and filter, we need to ensure that
the percentage of normal data in the training set is indeed
extremely larger than attacks.
During the experiment, we choose 20500 data including
20000 normal data and 500 attack data from the
kddcup.data_10_percent data set. The percentage of normal
data is 97.56% and the percentage of attack is 2.44%, which
satisfies the assumption.
2) Data Preprocessing
In KDD Cup 1999 data set, there is a problem that
different features are on different scales. In order to solve
this problem, we use range normalization to decrease the
effect. In (1) and (2) we can know how range normalization
works.
max( ) min( ),1
j ij ij
r r i n
n
1
'
( , 1, 2, , min( )) /
ij ij ij j
r i r r 2
Range normalization guarantees each numeric feature
value between 0 and 1 and eliminates the effect of different
scales among different features.
(3)Clustering
A cluster is represented by its mean of the members, so
the distance between a data object and a cluster could be
represented by the data and the mean of that cluster members.
In this paper, the distance is Euclidean distance which is
show in (3).
(3)
2 1/ 2
1
( , ) ( | | )
m
N i j ik jk
k
d r r r r

After the number of clusters k is given and initial


centers are found, then what we need to do is computing the
distance between a data and each cluster. Then put the data
into the nearest cluster, repeat until no data left in the
training data set.
(4)Labeling clusters
After clustering, we need to label the clusters to decide
which are normal and which are abnormal. Unsupervised
anomaly detection is under the assumption that the number
of normal connections vastly outnumbers the number of
intrusions. During the process of labeling cluster, we use
as the percentage of anomalies in the data set. If the number
of any cluster members is less than N , the cluster is
labeled anomaly or normal if the opposite. This process is
over until all the clusters are labeled.
(5)Data detection
After labeling the cluster, the model is built by the
training data set. Next we need use test data to test the model.
Before data detection, we should do data preprocessing on
every test data. Then we compute the distance between each
test data and each cluster, if one test data is nearest to a
cluster, then it belongs to that cluster. If the cluster is labeled
normal, then the test data is normal connection. If the cluster
is labeled abnormal, then its attack data.
III. EVALUATION
A. Performance measures
In order to evaluate the performance of Intrusion
Detection system, we use two major indicators: Detection
Rate (DR) and False Positive Rate (FPR). The detection rate
is the number of attacks detected by the system divided by
the number of attacks in the dataset. The false positive rate is
the number of normal connections that are misclassified as
attacks divided by the number of normal connections in the
dataset. In addition, ROC (Receiver Operating curve) [7] can
be used to show the relation between detection rate and false
positive rate under different .
B. Results
During the experiment, we find that the clustering is
better when k equals to eleven, so we choose k is eleven.
After clustering we label those clusters through different .
In the experiment, we separately choose equals to 0.2%,
0.4%, 0.5%, 1% and 2%.
In Fig. 3 from the ROC we find that improved k-means
algorithm has better performance than k-means algorithm
under different . In addition, there is a good balance
between detection rate and false positive rate when equals
to 0.5% under both algorithm, so next experiments we
276 275
choose = 0.5%. We choose 12 groups test data which 6
groups include known attacks and other 6 groups include
unknown attacks to test under k-means algorithm and
improved k-means algorithm. The result is shown is in Fig. 4,
Fig. 5, Fig. 6, Fig. 7.
Figure 6. Detection rate comparison on unknown attack diagram
Figure 3. ROC
From the comparison result of detection rate and false
positive rate between improved k-means algorithm and k-
means algorithm in Fig. 4 and Fig. 5, we can get a
conclusion that intrusion detection based on improved k-
means has a better performance. From the result of detection
rate and false positive rate in Fig. 6 and Fig. 7, even if the
detection rate is not high and the false positive rate is a little
high, but we still could conclude that intrusion detection
based clustering has the ability of detecting unknown attack
indeed.
Figure 7. False positive rate comparison on unknown attack diagram
IV. CONCLUSIONS
In this paper, we proposed an improved k-means
algorithm which overcomes the shortcoming of sensitive to
initial centers in k-means. We applied them into intrusion
detection model and compared the performance of these two
algorithms. The experiments proved that the intrusion
detection method based on improved k-means algorithm has
higher detection rate and lower false positive rate. And as
unsupervised anomaly detection both can detect unknown or
new attacks.
References
[1] Lee W, Stolfo S J. Data Mining Approaches for Intrusion
Detection[C], In Proceedings of the 7th UASEANAIX Security
Symposium, San Antonio, 1998, pp. 79-94.
Figure 4. Detection rate comparison on known attack diagram
[2] Hyun Oh Sang, Suk lee Won. Anomaly Intrusion Detection Based on
Dynamic Clustering Update[J], Advances in Knowledge Discovery
and Data Mining, 2007, 42(6), pp. 737-744.
[3] Xiangyang Li. Clustering and Classification Algorithm for Computer
Intrusion Detection[D], Arizona State University, 2001.
[4] Yu Guan, Ali A. Ghorbani, Nabil Belacel. Y-means: A Clustering
Method for Intrusion Detection[C], Canadian Conference on
Electrical and Computer Engineering, 2003.
[5] Portnoy L, Eskin E and Stolfo S. Intrusion Detection with Unlabeled
Data using Clustering[C], INFOCOM 2001 Twentieth Annual Joint
Conference of the IEEE Computer and Communications Societies,
2001, pp. 878-886.
[6] KDD Cup 1999 Data.
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html[EB/OL],
1999.
Figure 5. False positive rate comparison on known attack diagram
[7] Hanley J A, McNeil B J. The Meaning and Use of the Area under a
Receiver Operating Characteristic(ROC) Curve[J], Radiology, 1982,
14(3), pp. 29-36.
277 276

You might also like