You are on page 1of 9

Available ONLINE www.vsrdjournals.

com


VSRD-IJCSIT, Vol. 2 (6), 2012, 502-510


____________________________
1,2,3
Research Scholar,
1,3
Department of Computer Science & Engineering,
2
Department of Electronics
& Communication Engineering,
1,2,3
Lovely Professional University, Jalandhar, Punjab, INDIA.
*Correspondence: narayanesh.1984@gmail.com
R
RR E
EE S
SS E
EE A
AA R
RR C
CC H
HH A
AA R
RR T
TT I
II C
CC L
LL E
EE
Intrusion Detection System Using Fuzzy
C_ Means Clustering with Unsupervised
Learning via EM Algorithms
1
Esh Narayan*,
2
Pankaj Singh and
3
Gaurav Kumar Tak
ABSTRACT
In present time many intrusions in network and the activities of intrusion is the goal of the security policy
system. The unsupervised learning techniques using the machine learning for intrusion detection datasets, we
know that Clustering is the best techniques on the efficient data mining for intrusion detection. The k-mean
clustering algorithm is widely used for intrusion detection, because it gives efficient results. But sometime k-
mean clustering fails to give best result because if the data set is noisy so for removing these problems we are
proposing new algorithms for cluster to class assignment with fuzzy c-means clustering algorithm. According to
our experimental results the proposed algorithm are having low error rates for low class instances. My proposed
algorithm is expectation maximization fuzzy c_ means clustering (EMFCM). The feature reduction techniques
are used to a given KDD Cup 1999 dataset. For the data analysis we will be used the MATLAB software. The
MATLAB software is used to train and test the dataset and the efficiency is measured.
Keywords : Feature Selection, K_ means Clustering, C_ mean Clustering, EM Algorithms, KDD 99 Cup
Dataset, Matlab, Unsupervised Learning.
1. INTRODUCTION
The security mechanism is very important goal of the system are designed which is to make to prevent to
unauthorized access. A simple firewall can no longer provide enough security. The security policy system is
main goal of security mechanism in present time so try to detect the intrusion attempts so that action may be
taken to repair. This field of research is called Intrusion Detection
[2]
. Intrusion Detection System (IDS) is
commonly, software that automates the intrusion detection process and detects possible intrusions
[10]
. Some
Intrusions are these.
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 503 of 510
- Attempted break-ins, which are detected by atypical behavior profiles or violations of security
constraints.
- Leakage, which is detected by atypical use of system resources.
IDS may be classified into mainly two categories which based on the sources of the audit information used by
each ID: (a) Host-base IDSs and (b) Network-based IDSs
[3]
, Host-based IDS can only monitor the individual
host systems on which the agents are installed; it doesn't monitor the entire network
[8]
. Example of HIDSs is
honypots. And Network-based intrusion detectors insert themselves in the network just like as any other device.
A filter is usually applied to determine which traffic will be discarded or passed on to an attack recognition
module. This helps to filter out known un-malicious traffic
[9]
. Example of NIDSs is snort. A network computing
environment increase the complexity, so this is the main functional requirements. And the IDSs must
continuously monitor and report interaction. The various techniques are being used for detecting the intrusion,
but mainly two types of detection techniques
[6]
. (1) Anomaly detection and (2) Misuse detection.In the network
many types of attacks can occurs an ids system, which is probe, dos, R2L and U2L
[3]
.
2. RELATED WORK
Bharti, Shukla & Jain (2010) clustering is the best techniques for intrusion detection. In k-means clustering is
very used for intrusion detection for Clustering algorithms. Because it gives efficient results incase of
datasets. But sometime k-mean clustering fails to give best result because of class dominance problem and no
class problem
[1]
. The intrusion detection system is an effective approach to deal with the problems of networks
using various neural network classifiers
[3]
. Sapna S. Kaushik, Dr. Prof. P.R. Deshmukh (2011) Network based
intrusion detection are the best methods in IDS. IDS can be a piece of installed software or a physical
appliance. The different types of attacks are normal, Probe attacks, u2R, Dos and R2l attacks
[6]
. Attacks are
generated randomly using a random function. The type of attack generated is classified to be a Probe, R2L,
U2R or Dos attack
[7]
. Chen, Jianlin, Dan, Chen Li (2011) fuzzy clustering analysis is the most popular
research currently. Fuzzy clustering is one of the most perfect and most widely used theories although there
are some draw backs for classical algorithms
[9]
. Aizhong, Mi Linpeng (2010) they focus on patterns recognition
is the best classifier selection to network intrusion detection and clustering based selection method. The multiple
clusters are selected for a test sample. The purpose of to selected the multiple classifier is pattern recognition
[6]
.
Ajit Singh (2005) Expectation-Maximization (EM) is a technique used in point estimation. Given a set of
observable variables X and unknown (latent) variables Z we want to estimate parameters q in a model
[10]
.
Sometimes the M-step is a constrained maximization, which means that there are constraints on valid
solutions not encoded in the function itself. An example of a constrained optimization is to maximize. The
method to arrange the set of objects into classes of similar (which are having same behavior) objects, is defined
as clustering. We can categorize the objects into 2 categories, (1) Documents within a cluster should be similar.
(2) Documents from different clusters should be dissimilar. This is provided similar or dissimilar cluster from
the documents.
3. FUZZY C_ MEANS CLUSTERING
Cluster analysis is identifying such grouping (or clusters) in an unsupervised manners, in unsupervised approach
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 504 of 510
are divides a set of objects into homogeneous groups. There have been many clustering algorithms scattered in
publications in very diversied areas such as pattern recognition, articial intelligence, information technology,
image processing, biology, psychology, and marketing
[2]
. Clustering algorithms can be classied into main two
categories: hard clustering algorithms and fuzzy clustering algorithms. Unlike hard clustering algorithms, which
require that each data point of the data set belong to one and only one cluster, fuzzy clustering algorithms allow
a data point to belong to two or more clusters with different probabilities. There is also a huge number of
published works related to fuzzy clustering
[9]
. Basically hard clustering has each document belongs to exactly
one cluster. In hard clustering we make a hard partition of the data set Z.
Ai
c
=1
= ZonJAi A] = orolli ] (3.1)
Also none of the set Ai may be empty.
The FCM employs fuzzy partitioning such that a data point can belong to all groups with different membership
grades between 0 and 1. Fuzzy C-means Clustering (FCM) is also known as Fuzzy ISODATA, is a clustering
technique which is separated from hard k-means that employs hard partitioning. FCM is an iterative algorithm.
The aim of FCM is to find cluster centers (centroid) that minimize a dissimilarity function. This algorithm
works by assigning membership to each data point corresponding to each cluster center on the basis of distance
between the cluster center and the data point
[11][13]
. Fuzzy Clustering also called soft clustering. In fuzzy
clustering we make a fuzzy partition of the data set. Fuzzy clustering uses membership function in partition data
set. To accommodate the introduction of fuzzy partitioning, the membership matrix (U) is randomly initialized
according to Equation 3.2
u
]
c
=1
=1,] = 1, , n (3.2)
This function is called membership function and its value between 0 and 1. To find the value of centroid (c

) to
help of membership matrix (u
]
)

=
=
=
n
j
m
ij
n
j
j
m
ij
i
u
x u
c
1
1
(3.3)
The dissimilarity function which is used in FCM is given Equation
[(u, c
1
, c
2
, c
c
) = [

c
=1
= u
]
m n
]=1
c
=1
J
]
2
(3.4)
u
ij
is between 0 and 1
d
ij
is the Euclidian distance between i
th
centroid (c
i
) and j
th
data point;
Where dij=
(xi ci)
2 n
=1
(3.4.1)
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 505 of 510

|
|
.
|

\
|
=
c
k
m
kj
ij
ij
d
d
u
1
) 1 /( 2
1
(3.4.2)
If ||U (k+1)-U (k) || < (3.5)
Then STOP; otherwise return to step 2.
4. EM ALGORITHMS
The most common algorithm uses an iterative refinement technique. An EM (Expectation maximization)
algorithm is very use full in statically model. These algorithms are giving the best result in clustering method; it
is also referred to as LoyardAlgo particularly in the computer science community. EM algorithms given an
initial set of c_ means m
1
(1)
,,m
k
(1)
, the algorithm proceeds by alternating between two steps
[9]
.
4.1. Assignment Step
In these steps Assign each observation to the cluster with the closest means, it is according to Voronoi diagrams
for finding new mean
[8]
.
S

(I)
= {X
p
: X
p
m

(I)
X
p
m
]
(I)
1 ] k. (4.1)
Where each X
p
goes into exactly oneS

I
, even if it could go in two of them
4.2. Update Step By
Calculate the new means to be the centroid of the observations in the cluster.
m

(I+1)
=
1
S
i
(l)

X
]
X
]
S
i
(l) (4.2)
The "assignment" step (4.1) is also referred to as expectation step, the "update step" (4.2) as maximization step,
making this algorithm a variant of the generalized. In this equation (4.2) we find new mean for new cluster. So
we can say these algorithms is deemed to have converged when the assignment no longer change. And it gave
the best performance of initial means. EM algorithms commonly used initialization methods are random
partition. Cluster analysis is identifying such grouping (or clusters) in an unsupervised manners, in unsupervised
approach are divides a set of objects into homogeneous groups.
5. PROPOSED MODEL
My research work is related to networking based and my work is to improvement the performance of intrusion
detection system using fuzzy c_ means clustering (FCM) with EM algorithms. So first we increase the
performance of fuzzy c_ means clustering using EM algorithms and proposed new algorithms is expectation
maximization fuzzy c_ means clustering (EMFCM).
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 506 of 510
It is the proposed model in my research paper.

Fig. 1: Proposed Model of My Research Paper
6. PROPOSED ALGORITHMS
My proposed algorithm is called estimation maximization fuzzy c_ means clustering (EMFCM). My proposal to
enhancing the performance of c_ means clustering using EM algorithms. Proposed algorithms used the
clustering techniques and EM algorithms, which provided the sufficient result for the cluster analysis in
maximum mean to calculate the fixed centroid and correct threshold value. Proposed algorithms have these
steps.
Step1. In these step we find the membership matrix (U) initialise randomly in equation7.1.
u
]
c
=1
= 1, ] = 1, . . , n (7.1)
This equation represent the membership matrix, it has taken the value equal to 1.
Step2. Calculate the centroids (ci) in equation 7.2

=
=
=
n
j
m
ij
n
j
j
m
ij
i
u
x u
c
1
1
(7.2)
Centroid is main point of the cluster analysis system, in clustering this value of c

is depends on the member


matrix function and related parameter ofx

.
Step3. Using dissimilarity function to calculate the dissimilarities between centroid and data points in
equation5.3 and check threshold value in equation5.4 then stop if we find the correct threshold value.
[(u, c
1
, c
2
, c
c
) = [

c
=1
= u
]
m n
]=1
c
=1
J
]
2
(7.3)
In this steps check the threshold value using membership matrix and Euclidian distance between i
th
centroid (c
i
)
and j
th
data point.
If ||U (k+1)-U (k) || < (7.4)
Check the value of membership matrix in next membership matrix in correct threshold value, which have been
done by dissimilarities between centroid and data points.
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 507 of 510
Step 4. If threshold value is not correct we find the new mean (m

) using EM algorithm that has constraints in


equation (7.5) using this equation we can find new mean, which is provided the correct threshold value for the
dissimilarity function,So we can say these algorithms is deemed to have converged when the assignment no
longer change. And it gave the best performance of initial means. EM algorithms commonly used initialization
methods are random partition.
m

(I+1)
=
1
S
i
(l)

X

X
]
S
i
(l) (7.5)
Step5. In these steps assign each observation to the cluster with the closest means in equation (7.6)
{X
p
: X
p
m

(I)
X
p
m
]
(I)
1 ] k (7.6)
7. DISCUSSIONS AND RESULT
For the data analysis we are using the MATLAB software. The MATLAB software is used to train and test the
dataset and the efficiency is measured, in the cluster analysis is used the mouse data set have taken 2000 data
point. In fig.2 we have taken 2000 data points uses for cluster analysis make two cluster, cluster1 and 2 and
also fined the centroid using fuzzy c_means clustering. In bellow figure we can also see the clusters and
centroids,centroid have seme distance among the other data points.

Fig. 2: Clusters in Fuzzy c_means clustering
The centroid have some iteration value which have to calculate the total some of distance. Which is provides the
centroid to centroid iteration value.
11 iterations, total sum of distances = 3028.41
10 iterations, total sum of distances = 3028.41
8 iterations, total sum of distances = 3028.41
8 iterations, total sum of distances = 3028.41
6 iterations, total sum of distances = 3028.41
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 508 of 510
In fig. 3 we can see the clusters are dividing in small parts because we have to apply fuzzy c_means clustering
with EM algorithms, which gave the better performance FCM algorithms. The proposed algorithms EMFCM
works on the iteration value. These values are use to any following iteration count for object fcn.
Iteration count = 1, obj. fcn = 499.205829
Iteration count = 2, obj. fcn = 413.664260
Iteration count = 3, obj. fcn = 413.609835
Iteration count = 4, obj. fcn = 413.600794
Iteration count = 5, obj. fcn = 413.599276
Iteration count = 6, obj. fcn = 413.599020
Iteration count = 7, obj. fcn = 413.598976
Iteration count = 8, obj. fcn = 413.598968

Fig. 3: EMFCM Algorithms Shown the Clustering Method
The interpolation techniques have given many results of iteration count. These value are represented the object
fcn of all iteration count. These techniques are called the method of interpolation.
In figure 4 they can see the error rates graph is decreasing in order. We can see the figure 4 for all error rates
values.
Error_ number = 57
Error_ number = 13
Error_ number = 0
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 509 of 510

Fig. 4: Comparisons between Error Probabilities of FCM
So they can my proposed algorithms have the (0) error rates so we can say my proposed algorithms EMFCM
provided better performance in traditional algorithms.
8. CONCLUSION AND FUTURE WORK
We know that clustering gives the best performance in all research proposals in present time but some cases if
we enhanced the clustering techniques, which gives the better performance. Clustering gives the best
performance in unsupervised learning. In proposed are algorithms expectation maximization fuzzy c_ means
clustering (EMFCM), proposed algorithms will provide the better result to fuzzy c_ means clustering by
avoiding the looping problems and save the time. EMFCM clustering algorithm has fast converges in a few
iterations regardless of the initial number of clusters. We also demonstrate that the quality of our algorithm
is the same as the FCM algorithm. We check the performance in these parameters.
- Sum of distance is 3028.41 between the cluster 1 and cluster 2 and all data points.
- All iteration count number has a different object fcn value but iteration count number 1 have the maximum
value in first iteration sets. The maximum object fcn value is Iteration count = 1, obj. fcn = 499.205829,
which data points are indicated the bad effectors.
- Using the five steps so we can say EMFCM algorithms is being save the time and avoids the looping. And
also error rates are decreases so we can say my proposed algorithms performance increases.
In future for model generation other clustering and classifiers can be used to improve the detection rate of
intrusion detection system. And multi classifier can be used to improve the performance of intrusion detection
system. And also EMFCM algorithms is very help full to increase the performance of machine learning, all data
mining approaches, image processing, network security etc.
9. ACKNOWLEDGEMENTS
I wish to express my science profound gratitude to Mr. Gaurav Kumar Tak Asst. Professor, whose supervision
Esh Narayan et al / VSRD International Journal of CS & IT Vol. 2 (6), 2012

Page 510 of 510
& guidance in this investigation has been carried out, without whose guidance and constant supervision. It is not
possible for me to complete this research paper successfully.
10. REFERENCES
[1] Kusum bharti, Sanyam Shukla & Shweta Jain Intrusion Detection using unsupervised learning
International Journal on Computer Science and Engineering. Vol. 02, No. 05, 2010, 1865-1870
[2] S.Devaraju, Dr. S.Ramakrishnan. Performance Analysis of Intrusion detection system using various neural
network classifiers Anna University, Chennai. June 2011.
[3] Sapna S. Kaushik, Dr. Prof.P.R.Deshmukh Detection of Attacks in an Intrusion Detection System
Amravati India in 2011.
[4] KDD Cup 1999 Intrusion Detection data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 2010.
[5] Albayrak & Fatih Amasyal Fuzzy C-means clustering on medical diagnostic systems Technical
University, Computer Engineering Department, 34349, Istanbul, Turkey in 2006.
[6] Vladimir Golovko, Pavel Kachurka, Leanid Vaitsekhovich neural Network Ensembles for Intrusion
Detection Brest State Technical University in 2007.
[7] Peng Shanguo; Wang Xiwu; ZhongQigen; , "The study of EM algorithm based on forward sampling,"
Electronics, Communications and Control (ICECC), 2011 International Conference on , vol., no., pp.4597-
4600, 9-11 Sept. 2011 doi: 10.1109/ICECC.2011.6067693
[8] Fisher, D.; Ling Xu; Carnes, J.R.; Reich, Y.; Fenves, J.; Chen, J.; Shiavi, R.; Biswas, G.; Weinberg, J.; ,
"Applying AI clustering to engineering tasks," IEEE Expert , vol.8, no.6, pp.51-60, Dec. 1993 doi:
10.1109/64.248353
[9] Maria Colmenares & Olaf Wolken Hauer, An Introduction into Fuzzy Clustering,
http://www.csc.umist.ac.uk/computing/clustering.htm, July 1998, last update 03 July,2000
[10] http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html
[11] www.ics.uci.edu/pub/ml-repos/machine-learning-database/, 2001.
[12] Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed."A view of the EM algorithm that justifies
incremental, sparse, and other variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355
368.
[13] Jain, A. K. And Dubes, R.C. Algorithms for clustering data (N. J. & Cliffis; Prentice Hall) in 1999.
[14] LI. Kuncheva, Clustering-and-selection model for classifier combination, Proceedings of Knowledge-Based
Intelligent Engineering Systems and Allied Technologies, Brighton, U.K., 2000, pp.185188.

You might also like