You are on page 1of 6

1

st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |


978-1-4577-0697-4/12/$26.00 2012 IEEE

Hari Om
Department of Computer Science and Engineering,
Indian School of Mines,
Dhanbad, India
hari.om.cse@ismdhanbad.ac.in

Aritra Kundu
Department of Computer Science and Engineering,
Indian School of Mines,
Dhanbad, India
aritra32.kundu@gmail.com

Abstract: In this paper, we propose a hybrid intrusion detection
system that combines k-Means, and two classifiers: K-nearest
neighbor and Nave Bayes for anomaly detection. It consists of
selecting features using an entropy based feature selection
algorithm which selects the important attributes and removes
the irredundant attributes. This algorithm operates on the
KDD-99 Data set; this data set is used worldwide for
evaluating the performance of different intrusion detection
systems. The next step is clustering phase using k-Means. We
have used the KDD99 (knowledge Discovery and Data Mining)
intrusion detection contest. This system can detect the
intrusions and further classify them into four categories:
Denial of Service (DoS), U2R (User to Root), R2L (Remote to
Local), and probe. The main goal is to reduce the false alarm
rate of IDS
1
.

Keywords: Clustering, Classification, k-Means, Nave Bayes,
detection rate, false alarm rate, intrusion detection, KDD Cup 99
Data set.
I. INTRODUCTION

In recent years, network based services and network
based attacks have grown significantly [1][2]. The network
based attacks can also be considered as some kind of
intrusion. Intrusion can be defined as "any set of actions that
attempt to compromise the integrity, confidentiality or
availability of a resource". For controlling intrusion,
intrusion detection systems are employed. The three
important characteristics of intrusion detection systems are
accuracy, extensibility and adaptability. The attacks
generally change their types; so we need to update the
detection rules to notice new attacks. Several techniques
such as data mining, statistics, and genetic algorithm have
been used for intrusion detection. Most recently, the data
mining techniques have been used to mine the normal
pattern from the audit data. Two data mining techniques
used for anomaly detection are: association rules and
frequency episodes. The association rules find correlations


between features or attributes and the frequency episodes
techniques are effectively used for detecting occurrences of
sequential patterns in a sequence of events. Intrusions can be
broadly classified into misuse and anomaly based. In the
misuse, there are some set of signatures in the database and
the system always tries to match the incoming attack with
the attack patterns stored in the database and if there is any
match, then the attack is detected. In anomaly, any action
that significantly deviates from the normal behavior is
considered as intrusion. It searches for malicious activities
by comparing the network traffic to the normal usage pattern
learned from the training data. This approach can detect
novel and unseen attacks, but suffers from a high rate of
false alarms.

The main purpose of intrusion detection is to detect
future attacks which has led to incremental learning
techniques. The intrusion detection model cannot adapt to
the network behavior pattern. So in order to detecting new
attacks and continually adapt with the new network
behavior, we propose a hybrid intrusion detection system
that is composed of incremental misuse and anomaly
detection system. This system combines the merits of misuse
and anomaly detection. Our goal is not only to obtain high
detection rate (DR) on malicious activities but also to reduce
the False Positive Rate (FPR) on normal computer usage
from network traffic. The rest of the paper is organized as
follows. Section 2 discusses the related works and section 3
provides theoretical background. In section 4, the proposed
work is discussed. The experimental work is discussed in
section 5 and finally in section 6 the paper is concluded.
II. RELATED WORKS
Hybrid intrusion detection systems comprise of misuse
detection and anomaly detection systems that can detect
both known and unknown intrusions. Some of the intrusion
detection systems are mentioned in sequel. Audit Data
Analysis and Mining (ADAM) [3] uses association rules for
detecting intrusions [1]; Next Generate Intrusion Expert
A Hybrid System for Reducing the False Alarm
Rate of Anomaly Intrusion Detection System
1
st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |




System(NIDES)[4] consists of rule-based misuse detection
and anomaly detection; Random Forest algorithm [4] used
for intrusion detection system uses ensemble of
classification tree for misuse detection and use proximities
to find anomaly intrusions such as ADAM [3]; Feedback
Learning Intrusion Prevention System (FLIPS) [5] uses
hybrid approach for intrusion prevention systems. The core
of our proposed work is an anomaly-based classifier.
III. THEORETICAL BACKGROUND
In this section we discuss the basic ways an intrusion
detection system can be built. As mentioned above, their
two main classes of intrusion which are misused based and
anomaly based intrusion. Their different combinations
which can be named as hybrid system are discussed below.

A. Hybrid System Architecture
There are three ways to combine misuse and anomaly
detection. Some use anomaly at first to detect the malicious
activities and then use signature or misuse detection to
detect attacks from malicious activities. Connections that
match the pattern of attacks are labeled as attacks, those
matching to false alarms are labeled as normal and others
are labeled as unknown attacks. We have used this approach
to reduce the false positive rate of the anomaly based part.
Some uses misuse and anomaly both in parallel. Both the
components generate malicious activities individually. Then
some correlation component is used to combine the output
of both. The third category uses misuse and then anomaly
based part to detect attacks in real time.
B. Network Profiling
Since the number of attacks are increasing, IDS should
be updated with signatures of new attacks. Network
profiling helps to define new signatures. There are some
problems in network profiling e.g. grouping the attacks
coming from the network based on their types. These types
of problems can be solved by techniques such as
classification and clustering.
IV. PROPOSED SYSTEM ARCHITECTURE
In our proposed system, we use K-means clustering and
K-Nearest Neighbor algorithm [7]. We first apply the k-
means algorithm to the given dataset to split the data records
into normal cluster and anomalous clusters. We specify the
number of clusters as five to the k-means and cluster the
records in the dataset into normal cluster and anomalous
clusters. The anomalous clusters are U2R, R2L, PROBE,
and DoS. The records are labeled with the cluster indices.
Then, we divide the data set into two parts. One part is used
for training and the other one is used for evaluation. In
training phase, apply the labeled records to the K-nearest
neighbor for training purpose. The K-NN classifier is
trained with the labeled records. Finally, we apply the rest of
unlabeled records to the K-Nearest Neighbor for
classification. The K-NN classifier will classify the
unlabelled record into normal and anomalous clusters. The
work consists of feature selection, clustering and hybrid
classification. Then the proposed algorithm is discussed.
A. Module1: Feature Selection Algorithm
We use Entropy based feature selection method for
selecting the attributes and removing the redundant ones.
The algorithm [8] consists of two parts. In first part, it
removes irrelevant features with poor prediction ability to
the target class. It calculates the mutual information between
the features and class. The algorithm ranks the features in
descending order of their degrees of association to the target
class. Once it is done, those with information measure
equals to zero are removed. The second part removes the
redundant features that are inter-correlated with one or more
features.

B. Module 2: Clustering
Clustering is a division of data into groups of similar
kind of objects. Each group or cluster contains objects that
are similar among themselves but dissimilar with the others.
The greater the difference between groups, the better is the
clustering. Clustering is an unsupervised learning because
the class labels are not known. A group of measurements
and observations are done for the existence of the data in a
cluster. Some clustering algorithms are: k-Means [6],
Agglomerative Hierarchical clustering and classification and
DBSCAN [7]. We use k-means clustering in our work.
C. Module3: Hybrid Classification
This module assigns class labels to the objects. It is
trained first with records along with the class labels in the
training phase. The data sets are divided into search domain
and new samples. It builds a classification model from the
search domain and decides the class domain for each given
object using one of the methods - k-nearest neighbor [9],
Nave Bayes [6][9], Decision tree [6], and Support Vector
Machine[5].
D. Proposed Algorithm
Input: Dataset D, a sample X, Normal Cluster N,
Anomalous cluster A
Output: X is abnormal or normal
Algorithm Hybrid
a. Removing irrelevant features as follows.
i. Input original data set D that includes features X and
target class T
ii. For each feature f
i

iii. Calculate mutual information MU(T, f
i
)
iv. Sort MU(T, f
i
) in descending order
1
st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |



v. Put f
j
, whose MU(T, f
i
)>0 into relevant feature ser
Rxy
vi. Remove rest redundant features.
b. Input relevant feature set Rxy
i. For each feature f
j

ii. Calculate pairwise mutual information MU(f
i
, f
j
)
iii. Select those features having MU(f
i
,f
j
) >T, a
predefined threshold and put those features to set
B
MUxx= MU(f
i
, f
j
)
c. Calculate following from Autocorrelation coefficients.
i. their means
x
and
y

ii their ratio W=Rxx/Ryy

iii R=wR
yy
-R
xx
d. Select f
j
from set B whose R>0 into final set F
e. Apply K-Means Algorithm to cluster the data
f. Compute pairwise Entropy E(REi,REj) for all records
in the sample and find out the minimum entropy
between each record and the other record, and store
them in p
i ,
i.e., p
i
=min(E(RE
i
,RE
j
))
g. Form a sequence P by ordering the records in
descending order and save them in q
i
h. Select the first k points from q
i
and form k cluster
centroids by calling KMeans(q
i
,k);
i. Apply K-means for rest of records in data set and put
the remaining connection records into corresponding
clusters, number of clusters are taken as 5.
j. Obtain cluster indexes, and append the cluster indexes
to the connection records and update a separate copy
of the data set file.
k. Take a part of connection records in the modified
Data set table and apply those records to the hybrid
Classification algorithm and build training normal data
set D.
l. Take a part of data set, D
j

m. For each record x in D
j
in test data do
i. If x is present in database (of signatures) then
X is anomalous
Else
Find scores of dist(x, y), for all x,y D
j
, where y
is the other record or point.
ii. Arrange the distances in ascending order.
iii Find first k shortest distances and pick up the first
shortest k nearest neighbours
iv. If (voting(x, N) <voting(x, A))
x is Normal
Else If (voting(x, N)>voting(x, A))
x is abnormal
Else
Calculate class conditional and prior
probabilities for Nave Bayes classifier.
n. Calculate posterior probability


) (
) ( ) / (
) / (
x P
k P k x P
x k P
j j
j
=
.
K
j
will belong to cluster k
j
, if
) / ( x k P
j
is minimum for
all j=1,2,3.n.
Here n=5 and

=
=
n i
j i j j
k x P k P k x P
1
) / ( ) ( ) / (
,
Prior probabilities

=
=

=
n
i
i
n
i
j i
j
t
k t
k P
1
1
) (
,
Conditional probabilities

=
=

=
n
i
j i
n
i
j i
j i
k t
k x
k x P
1
1
) / (
,
d(x, y) is Euclidean distance, k
1
, k
2
, k
3
, k
4
are clusters or
classes for different types of attacks which are Normal, DoS,
probe, R2L, U2R, respectively.
V. EXPERIMENTAL EVALUATION
In this section we discuss simulation results of the
proposed work for different types of attacks. The data set
taken for simulation is KDD99 cup
A. Intrusion Dataset
To simulate the presented ideas, we use the 1998 DARPA
Intrusion Detection Evaluation program data provided by
MIT Lincoln Labs [10]. The TCP dump raw data has been
processed into connection records, which are about five
million connection records. The data set contains 24 attack
types. All these attacks fall into four main categories: DoS,
U2R, and R2L, Probe as follows.
Normal Connections are generated by capturing the daily
behavior such as downloading files or visiting web pages.
Denial of Service (DoS): The attacker makes some
computing resources too busy or memory resources too full
to handle legitimate requests, or denies legitimate users
access to a machine. DoS attacks are classified based on the
services that the attacker makes unavailable to the users like
apache2, land, mail, back, etc.
Remote to Local (R2L): The attacker who does not have an
account on a remote machine sends packets to that machine
over a network and exploits some vulnerability to gain local
access as a user of that machine which include send-mail,
and Xlock
User to Root (U2R): The attacker starts out with access as
a normal user on the system and becomes a root user by
exploiting vulnerabilities to gain root access to the system.
Probing: The attacker scans a network of computers to
collect the information or to find known vulnerabilities. An
attacker with a map of the machines and services that are
available on the network can use this information to look for
exploits.
1
st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |



B. Results and Analysis
We have use the KDD99 cup data set [10][11][12] for
training and testing [1] [2]. In 1998 DARPA intrusion
detection evaluation program was set up to acquire raw
TCP/IP dump data [10],[12] for a LAN by MIT Lincoln lab
to compare the performance of various intrusion detection
methods [1][2]. In KDD-99 data set each record is consists
of a set of features, some of which are either discrete or
continuous. The qualitative values are labels without an
order which could be symbolic or numeric values e.g. the
value of feature protocol type is one among the symbols
{icmp, tcp, udp}. The numeric value of the feature logged in
is 0 or 1 to represent whether the user has successfully
logged in or not. For the quantitative attributes, the data are
characterized by numeric values within a finite interval.
Example can be the duration. Since the feature selection is
applicable only to the discrete attributes, not to the
continuous ones, the continuous features need be converted
to discrete ones prior to the feature selection analysis. In
order to evaluate the performance of this method we have
used KDD99 data set. First we apply the entropy based
feature selection algorithm, and then K-means clustering
algorithm on the features selected. After that, we classify the
obtained data into Normal or Anomalous clusters by using
the Hybrid classifier.
We have applied 10 fold cross validation evaluation on the
data set, classification accuracy such as detection rate (DR),
false positive rate (FPR), overall classification rate (CR) for
evaluating the performance of the intrusion detection task.
The meaning of true positive (TP), true negative (TN), false
positive (FP), false negative (FN) are defined as follows.
True positive (TP): number of malicious records that are
correctly classified as intrusion.
True negative (TN): number of legitimate records that are
not classified as intrusion.
False positive (FP): number of records that are incorrectly
classified as attacks.
False negative (FN): number of records that are incorrectly
classified as legitimate activities.
Detection rate(DR):

FN TP
TP
DR
+
=
):

FP TN
FP
FPR
+
=

FN FP TN TP
TN TP
CR
+ + +
+
=


TABLE I: ATTACK CLASSES IN KDD99 DATA SET
Four Main Attack
classes
22 Attack classes
Denial of Service neptune, teardrop ,back, land,
pod, smurt,
Remote to User(R2L) ftp_write, warezclient,
warezmaster guess_passwd,
imap, multihop, p spy,
User to Root(U2R) loadmodule,buffer_overflow,
perl, rootkit
Probing satan ,ipsweep, nmap,
portsweep,

TABLE II: NUMBER OF EXAMPLES USED IN TRAINING AND
TESTING DATA TAKEN FROM KDD99 DATA SET
Attack Types Training
Examples
Sample
percentage (%)
Normal 56833 21.4123
Probe 3015 1.1359
User to Root 120 0.04521
Remote to User 3185 1.199
Denial of service 202269 76.20656
Total examples 265422 100

Attack Types Testing
Examples
Sample
percentage
(%)
Normal 31052 22.3
Probe 3904 2.80367
User to Root 86 0.061
Remote to User 4300 3.088
Denial of service 99904 71.7464
Total examples 139246 100

TABLE III: CLASSIFICATION RESULT FOR K-MEANS


TABLE IV: RESULT FOR KMEANS
Predicted
Normal
Predicted
Intrusions(Attacks)
Actual Normal 12347 852
Actual Intrusions
(Attacks)
1584 83709

TABLE V: RESULT FOR KMEANS+KNN



1
st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |



TABLE VI: RESULT FOR KMEANS+KNN CLASSIFIER USING
NORMAL AND ATTACK CLASS
Actual Predicted
Normal
Predicted
Intrusions(Attacks)
Normal 14761 635
Intrusions(Attacks) 1249 88346

TABLE VII: RESULT FOR KMEANS+KNN+NAVE BAYES


TABLE VIII: RESULT FOR KMEANS+KNN+NAVE BAYES
CLASSIFIER USING NORMAL AND ATTACK CLASS
Actual Predicte
d
Normal
Predicted
Intrusion(Attack)
Normal 18954 352
Intrusions(Attacks) 794 94778

TABLE IX: DR, FPR AND ACCURACY
Method
Used
Detection rate False
positive
rate
Accuracy
1 0.93544 0.01857 0.97526
2 0.9587555 0.01394 0.982055
3 0.981867 0.00830 0.990024

Here, Method 1: kMeans clustering, Method 2: kMeans
clustering and kNN,Method 3:kMeans clustering , kNN and
Nave Bayes Classifier.
Table I shows attack classes in KDD Cup 99 Data set,
table II shows number of examples used in the training and
testing. The attacks can be divided into 4 major categories,
DoS, U2R, R2L, and Probe. The first table shows
classification result and second table shows the confusion
matrix constructed from the previous table. These two tables
are repeated for 3 different approaches. The detection rate,
false positive rate, accuracy are calculated from the
confusion matrix table using the given formula and results
are given in table VIII. From table VIII, we can see that
there is a sharp increase in detection rate accuracy and
decrease in false alarm rate. In Method I, the detection rate
is 99.35%, which have increased from 95.87%, the false
alarm rate decreases from 1.857% to 1.394%, and accuracy
increases to 98.20%. But in method 3, which is a
combination of kMeans, kNN and Nave Bayes classifier,
the detection rate reaches 98.18% and the false positive rate
has decreased from 1.394% to 0.830%. This shows that our
proposed approach is better than the conventional kMeans
and kMeans, kNN.
VI. CONCLUSIONS
In this paper, we have proposed a hybrid intrusion
detection system that combines the merits of anomaly and
misuse detection. Anomaly detection have very high false
alarm rate. In order to reduce it we have applied the k-
Means algorithm for clustering followed by a hybrid
classifier, combining k-Nearest Neighbor and nave Bayes
Classifier for detecting intrusions. The disadvantage of the
existing mehods is that the data set in real life has very little
difference between normal and anomalous data. The
differences are sometimes so small that the classification
algorithms misclassify them and some records are
misclassified. We have overcome this problem by using
some kind of fuzzy based algorithms
REFERENCES
[1] James P. Anderson, Computer security threat monitoring and
surveillance, Technical Report 98-17, James P. Anderson Co., Fort
Washington, Pennsylvania, USA, April 1980.
[2] D. E. Denning, An intrusion detection model, IEEE Transaction
on Software Engineering, SE-13(2), 1987, pp. 222-232.
[3] Daniel Barbar, Julia Couto, Sushil Jajodia, Leonard Popyack and
Ningning Wu, ADAM: Detecting intrusion by data mining, IEEE
Workshop on Information Assurance and Security, West Point, New
York, June 5-6, pp. 11-16, 2001.
[4] Debra Anderson, Thane Frivold, and Alfonso Valdes, NIDES
Next-generation Intrusion Detection Expert System (NIDES),
A Summary, Computer Science Laboratory,SRI-CSL-95-07,
May 1995
[5] Te-Shun Chou and Tsung-Nan Chou, Hybrid Classified Systems for
Intrusion Detection, Seventh Annual Communications Networks
and Services Research Conference, pp. 286-291, 2009.
[6] N.B. Amor, S. Benferhat, and Z. Elouedi, Nave Bayes vs.
decision trees in intrusion detection systems, Proc. of 2004 ACM
Symposium on Applied Computing, 2004, pp. 420-424.
[7] Yihua Liao and V. Rao Vimuri, Using K-nearest Neighbor
Classifier for Intrusion Detection, Department Of Computer
Scinece, University Of California
[8] T. S. Chou, K. K. Yen, and J. Luo, Network Intrusion Detection
Design Using Feature Selection of Soft Computing Paradigms,
World Academic of Science, Engineering and Technology, 47, pp.
529-541, 2008.
[9] Z. Muda, W. Yassin, M.N. Sulaiman and N.I. Udzir, A K-Means
and Naive Bayes Learning Approach for Better Intrusion Detection,
Information Technology Journal, 10, pp. 648-655, 2011.
[10] MIT linconin labs, 1999 ACM Conference on Knowledge
Discovery and Data Mining (KDD) Cup dataset,
http://www.acm.org/sigs/sigkdd/kddcup/index.php?section=1999
[11] The KDD Archive. KDD99 cup dataset, 1999.
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[12] M. Tavlle, E. Bagheri, W. Lu, and A. A. Gorbani, A detailed
analysis of the KDD CUP 99 Data Set, Proc. of IEEE Symposium
1
st
Intl Conf. on Recent Advances in Information Technology | RAIT-2012 |



Computational Intelligence for Security and Defense Applications
(CISDA'09), pp. 1-6, 2009.
[13] Mukkamala S., Janoski G., and Sung A.H., Intrusion detection
using neural networks and support vector machines, In Proc.
of the IEEE International Joint Conference on Neural Networks,
2002, pp.1702-1707.
[14] J. Zhang and M. Zulkernine, A Hybrid Network Intrusion
Detection Technique Using Random Forests, Proc. of IEEE First
International Conference on Availability, Reliability and Security
(ARES06), p. 8, 2006.

[15] D. Md. Farid, N. Harbi, S. Ahmmed, Md. Z. Rahman, and C. M.
Rahman, Mining Network Data for Intrusion Detection through
Nave Bayesian with Clustering, World Academy of science,
Engineering and Technology, 66, pp. 341-345, 2010.

You might also like