Professional Documents
Culture Documents
Shenghui Wang
Information Technology Center
China Nuclear Power Technology Research Institute
Shenzhen, China
E-mail: wangshenghui@cgnpc.com.cn
AbstractTraditional machine learning methods for intrusion
detection can only detect known attacks since these methods
classify data based on what they have learned. New attacks are
unknown and are difficult to detect because they have not
learned. In this paper, we present an improved k-means
clustering-based intrusion detection method, which trains on
unlabeled data in order to detect new attacks. The result of
experiments run on the KDD Cup 1999 data set shows the
improvement in detection rate and decrease in false positive
rate and the ability to detect unknown intrusions.
Keywords- Intrusion Detection, k-means, clustering,
unsupervised anomaly detection
I. INTRODUCTION
With the rapid development of network technology and
the quick expansion of network application range, the
traditional static defending methods such as firewalls and
access control is very difficult to satisfy the need of network
security. As a proactive information security measure,
intrusion detection can effectively cover the shortage of
traditional security measures. Recent years various kinds of
data mining methods were applied into intrusion detection
[1]. There are two major paradigms for training data mining-
based intrusion detection systems: misuse detection and
anomaly detection. Misuse detection needs training over the
labeled data and the process of training is time-consuming
and costly waste and if the label was wrong which will lead
to the decrease of detection rate. Anomaly detection
approaches can detect new types of attacks and unsupervised
anomaly detection can train unlabeled data and detect
unknown attacks [2] [3].
The clustering as an unsupervised learning can be
applied in intrusion detection, classifying unlabeled data and
detecting new intrusions. K-means is a typical clustering
algorithm. It partitions a set of data into k clusters through
the following steps [4].
Step1 (Initialization): Randomly choose k instances
from the data set and make them initial cluster centers of the
clustering space.
Step2 (Assignment): Assign each instance to its closest
center.
Step3 (Updating): Replace each center with the mean of
its members.
Step4 (Iteration): Repeat Steps 2 and 3 until there is no
more updating.
II. METHODOLOGY
In this section firstly we discuss the improved k-means
algorithm and next we describe the model of intrusion
detection based on clustering.
A. Improved k-means Algorithm
The shortcomings of k-means: sensitive to initial
centers and dependency on number of clustering k. Fig. 1
shows the clustering effect under different initial centers.
Figure 1. Clustering diagram under different initial centers.
According to the first shortcoming of k-means, we
propose an approach to choose initial centers to improve k-
means algorithm. The basic idea is choose the initial centers
as decentralized as possible to avoid the bad clustering. The
specific steps of the algorithm are described as below.
Step1: Get the center member c of the whole Data set D;
Step2: Choose the first center member which is the
fastest member from the c, that is ;
1
m
max
1
( , ) ( ( , ))
x D
d m c d x c
Step3: Choose the second center member which is
the fastest member from the , that
is ;
2
m
1
m
1 2 1
( , ) max ( ( , ))
x D
d m m d m x
i
m
is less
than the given parameter r, then give up this member and
compute the sum of distances between selected center
member and the member in data set D. At last we choose
the data got the maximum sum distance as ;
2011 Second International Conference on Innovations in Bio-inspired Computing and Applications
978-0-7695-4606-3/11 $26.00 2011 IEEE
DOI 10.1109/IBICA.2011.72
275
2011 Second International Conference on Innovations in Bio-inspired Computing and Applications
978-0-7695-4606-3/11 $26.00 2011 IEEE
DOI 10.1109/IBICA.2011.72
274
Step5: when i = k, the iteration is over.
The improved k-means algorithm actually makes some
improvement in the first step of k-means algorithm.
B. Model of intrusion detection based-on clustering
In actual network environment, normal connections are
overwhelmingly larger than abnormal connections. Intrusion
detection based on clustering is under two assumptions. The
first assumption is that the number of normal connections
vastly outnumbers the number of attacks. The second
assumption is that the intrusions themselves are qualitatively
different from the normal connections. The basic idea is that
since the intrusions are both different from normal and rare,
they will appear as outliers in the data which can be detected
[5]. The basic process of intrusion detection based on
clustering is data selection and filtering, data preprocessing,
clustering, labeling clusters and data detection. The model of
intrusion detection based-on clustering is in Fig. 2.
Figure 2. The model of intrusion detection based-on clustering
1) Data Selection and Filtering
The dataset used was the KDD Cup 1999 Data [6],
which includes a wide variety of intrusions simulated in a
network environment. The attacks in the dataset are divided
into 4 groups, Denial of Service, Remote to Local, User to
Root and Probing. Each data connection in the dataset has 41
features. During selection and filter, we need to ensure that
the percentage of normal data in the training set is indeed
extremely larger than attacks.
During the experiment, we choose 20500 data including
20000 normal data and 500 attack data from the
kddcup.data_10_percent data set. The percentage of normal
data is 97.56% and the percentage of attack is 2.44%, which
satisfies the assumption.
2) Data Preprocessing
In KDD Cup 1999 data set, there is a problem that
different features are on different scales. In order to solve
this problem, we use range normalization to decrease the
effect. In (1) and (2) we can know how range normalization
works.
max( ) min( ),1
j ij ij
r r i n
n
1
'
( , 1, 2, , min( )) /
ij ij ij j
r i r r 2
Range normalization guarantees each numeric feature
value between 0 and 1 and eliminates the effect of different
scales among different features.
(3)Clustering
A cluster is represented by its mean of the members, so
the distance between a data object and a cluster could be
represented by the data and the mean of that cluster members.
In this paper, the distance is Euclidean distance which is
show in (3).
(3)
2 1/ 2
1
( , ) ( | | )
m
N i j ik jk
k
d r r r r