Professional Documents
Culture Documents
I.
INTRODUCTION
c
978-1-4799-2572-8/14/$31.00 2014
IEEE
RELATED WORK
1348
III. MOTIVATION
The conventional datasets contain inconsistencies which
degrades the performance and increase the computational
time. This motivate us to preprocess the datasets and forms an
efficient version which helps researchers to create robust and
cost effective model for intrusion detection.
1349
$OJRULWKPB5HPRYHBGXSOLFDWHVRXUFHBILOHGHVWBILOH
Begin:
set Count: =0;
Size=length (source_file);
Set k: =1;
Set flag: =0;
Repeat for i=1 to size
move source_file(i) into dest_file(k)
str=dest_file(k);
increment k;
set max = length (source_file);
Repeat for j= i to max
Compare str with source_file(j)
If match
Remove record source_file(j)
increment count
decement max, j
Flag=1;
End if
End for
If flag
Decrement i
Size=length (source_file);
End if
End for
Remove source_file;
Rename dest_file as source_file
Print Number of duplicate records: count
End
To fill the missing values, we have written a
procedure named as search_fill which takes the class types and
search it into the class_values. The class_values are calculated
according to the mean of the particular attack class. First we
divide the whole dataset according to their class categories and
then find mean of each column store it into class_values.
The intrusion detection dataset contains higher dimensional
data for processing. So it is highly essential to select only
relevant features among all. For that various statistical analysis
is required such as feature selection, dimensionality reduction,
normalization and data sub-setting.
Feature selection is an important data processing process.
In KDD Cup99, NSLKDD and GureKDD the attribute
num_outbound_cmds contains zero only. So we should remove
it from the dataset for subsequent analysis. By taking 40
attribute out of 42 we analysis PCA and generate screen plots
(Eigen values with attribute numbers) and variance matrix as
shown figure 2 and table 5.
After Feature selection and dimensionality reduction, it is
required to normalize the dataset. The attributes are scaled to
the range [0, 1] using (Equation 1), where xi denotes the
value of the feature, min (x) denotes the minimum value, and
max (x) denotes the maximum value. Thus, data available for
subsequent analysis are real numbers between 0 and 1.
But one disadvantage of min-max normalization is the
lowest and highest values are remain same i.e. 0 and 1. So this
leads problem in Nave Bayes classifier (Attribute 15, 20 and
21 have negative variance). So to avoid this; we have to scale
the data within -1 to 1 using equation 2 or using the third
normalization process e.g. z-score normalization. This
$OJRULWKP5HPRYHUHGXQGDQF\VDPSOHV
1350
[1] Bace, R. and Mell, P. (2001). Intrusion Detection System, NIST Special
Publications SP 800. November.
[2] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, A Detailed
Analysis of the KDD CUP 99 Data Set, Submitted to Second IEEE
Symposium on Computational Intelligence for Security and Defense
Applications (CISDA), 2009.
[3] J. McHugh, Testing intrusion detection systems: a critique of the 1998
and 1999 darpa intrusion detection system evaluations as performed by
lincoln laboratory, $&0 7UDQVDFWLRQV RQ,QI RUPDWLRQ DQG 6\VWHP
6HFXULW\, vol. 3, no. 4, pp. 262294, 2000.
[4] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/ kddcup99/
[5] NSL-KDD data: http://nsl.cs.unb.ca/NSL-KDD/
[6] GureKDDCup Dataset: http://www.sc.ehu.es/acwaldap/.
[7] Analysis of KDD99 intrusion detection dataset for selection of
relevance features., adetunmbi a. Olusola, adeola S Oladelel.
[8] KDD feature set complaint heuristic rules for R2L attack detection,
Maheshkumar sabhnani, Gursel serpent.
[9] Toward developing a systematic approach to generate benchmark
datasets for intrusion detection, ali shiravi, hadi shiravi, mahbod
tavallaee, ali a. ghorbani.
[10] Sung, A. H. and Mukkamala, S. (2003) identifying important features
for intrusion detection using support vector machines and neural
networks, IEEE proceedings of the 2003 symposium on applications and
the internet.
[11] Kayacik, H.G., Zincir-Heywood, A. N. and Heywood, M.L.(2006).
Selecting features for Intrusion detection: a feature analysis on KDD 99
intrusion detection datasets.
[12] W. Lee, and S. Stolfo, A framework for constructing features and
models for intrusion detection systems, ACM Transactions on
information and system security, November 2000, Vol 3(4), PP. 227261.
[13] C. Elkan, Results of the KDD99 classifier learning, SIGKDD
Explorations, ACM SIGKDD, January 2000, Vol 1.(2), pp. 63-64.
[14] Laheeb m. Ibrahim,dujan t. Basheer, mahmod s. Mahmoda comparison
study for intrusion database (kdd99, nsl-kdd) based on self organization
map(som) artificial neural network ,Journal of Engineering Science and
Technology, Vol. 8, No. 1 (2013) 107 - 119,School of Engineering,
Taylors University.
[15] Vipin Kumar, Himadri Chauhan, Dheeraj Panwar, K-Means Clustering
Approach to Analyze NSL-KDD Intrusion Detection Dataset,
International Journal of Soft Computing and Engineering (IJSCE) ISSN:
2231-2307, Volume-3, Issue-4, September2013.
[16] Zargari S, Voorhis, D., Feature Selection in the Corrected KDDdataset,Emerging Intelligent Data and Web Technologies (EIDWT),
2012 Third International Conference on 19-21 Sept. 2012, Page(s) 174 180.
[17] Bro-ids: www.EUR.org/
[18] B.Kavitha, S. Karthikeyan and B. Chitra, Efficient Intrusion detection
with reduced dimension using data mining classification methods and
their performance comparison, BAIP 2010, pp. 96-101, 2010.
REFERENCES
; L,
; L,
0 WR 1
; L ; 0LQ
; 0D[ ; 0LQ
1 WR 1
; ; 0LQ
; L 0D[
; 0D[ ; 0LQ
(1)
(2)
;L ; 6
; L , 1V
(3)
V;, 6
Where
Xi is each data point
; 6 is the average of all sample data points.
V;,6
2:5
6:2
8:3
1351
Table 1: The details categories of attack categories in KDD FULL & 10% Dataset.
Category
KDDCup
99FULL
Dataset
AfterRemoving %rateof
Duplicate
Reduction
Samples
KDDCup99
10%
Dataset
AfterRemoving
Duplicate
Samples
%rateof
Reduction
Dataset
Class
Normal
972781
812814
16.44
97278
87832
9.71
NORMAL
Back
Pod
Land
Smurf
Teardrop
Neptune
Nmap
Satan
Ipsweep
Portsweep
2203
264
21
2807886
979
1072017
2316
15892
12481
10413
968
206
19
3007
918
242149
1554
5019
3723
3564
56.06
21.97
9.52
99.89
6.23
77.41
32.90
68.42
70.17
65.77
2203
264
21
280790
979
107201
231
1589
1247
1040
968
206
19
641
918
51820
158
906
651
416
54.88
21.97
9.52
99.77
6.23
51.66
31.60
42.86
47.79
60.00
DOS
DOS
DOS
DOS
DOS
DOS
PROBE
PROBE
PROBE
PROBE
Phf
Guess_pwd
4
53
4
53
0.00
0.00
4
53
4
53
0.00
0.00
R2L
R2L
Ftp_write
8
0.00
0.00
R2L
Imap
12
12
0.00
12
12
0.00
R2L
Spy
2
0.00
0.00
R2L
7
1020
20
7
893
20
0.00
12.45
0.00
7
1020
20
7
893
20
0.00
0.00
0.00
R2L
R2L
R2L
30
9
3
10
48,98,43
1
30
9
3
10
0.00
0.00
0.00
0.00
30
9
3
10
30
9
3
10
0.00
0.00
0.00
0.00
U2R
U2R
U2R
U2R
Multihop
Warezclient
Warezmaster
Buffer_Overflow
Loadmodule
Perl
Rootkit
Total
10,74,992
78.05%
4,94,021
145586
70.53%
Table 2: Attack Distribution in KDD Cup Datasets (KDD full, KDD 10% and KDD Corrected).
Dataset
KDDFull
KDDFullAfterremovingduplicate
samples
DoS
3883370
247267
U2R
52
52
R2L
1126
999
Probe
41102
13860
Normal
972781
812814
Total
4898431
1074992
KDD10%
391458
52
1126
4107
97278
494021
KDD10%Afterremovingduplicate
samples
KDDCorrected
54598
229269
52
999
2133
87832
145586
70
16172
4925
60593
311029
KDDCorrectedafterremoving
duplicatesamples
22984
70
2898
3426
47913
77291
Table 3: Contains various experiments and their results on the three datasets (Time in second).
NSLKDD20%
Train
25192
KDDCup
corrected
77291
GureKDD
160904
1352
Normalized
NO
Time
76.59
SVMClassifier
Accuracy
99.99
Time
4.06
DecisionTree
Accuracy
99.91
YES
4.45
98.122
4.7
99.91
KNearestNeighbor
Time
Accuracy
43.23
99.99
KMeanClustering
Time
Accuracy
5.9
53.19
37.69
99.99
4.15
88.31
NO
192.2
99.12
89
99.236
799.42
99.9
57.56
50.12
YES
165.8
99.23
72.48
99.81
799.42
99.9
65.32
90.33
NO
YES
725.6
123.2
99.12
99.08
2100
1171
99.02
99.96
3002.2
2712.3
99.97
99.99
780
711.2
85.77
82.61
Table 4: Attack categories and total samples present in NSLKDD and GureKDD Dataset.
Category
NSLKDD
TrainFull
NSLKDD
Train20%
Category
OnGureKDD
dataset
KDD_Test+
Instances
GureKDD
Original
Normal
67343
13449
9711
Normal
174873
Apache2
0
0
737
Anomaly
9
0.00
359
Dict
880
878
0.23
2
50.00
Back
956
196
BufferOverflow
30
6
20
Dict_simple
Ftp_write
8
1
3
Eject_fail
157048
10.19
2
50.00
12
11
8.33
Guess_pwd
53
10
1231
Eject
HttpTunnel
0
0
133
Ffb
11
10
9.09
Imap
11
5
1
Ffb_clear
Ipsweep
2
50.00
Formatfail
2
50.00
14.29
3599
710
141
Land
18
1
7
Format
7
Loadmodule
9
1
2
Format_clea
MailBomb
0
0
2
50.00
293
Ftp_write
9
11.11
51
50
1.96
Mscan
0
0
996
Guest
Multihop
7
2
18
Imap
8
12.50
Land
36
17
52.78
Load_clear
Named
0
0
17
Neptune
41214
8282
4657
2
50.00
9
33.33
0.00
1493
301
73
Loadmodule
Perl
3
0
2
Multihop
9
Phf
4
2
2
Perl_clear
2
50.00
41
Perl_magic
5
20.00
6
16.67
Nmap
Pod
Portsweep
ProcessTable
Ps
201
38
2931
587
157
Phf
0
0
685
rootkit
30
29
3.33
15
Spy
3
33.33
5
40.00
1086
1083
0.28
2
50
1692
3.31
0
0
Rootkit
10
4
13
Sys_log
Saint
0
0
319
Teardrop
Satan
SendMail
Smurf
3633
691
735
Warez
0
0
14
Warezclient
1750
2646
529
665
warezmaster
20
19
5.00
Total
160904
10.03%
SnmpGetAttack
0
0
178
SnmpGuess
0
0
331
178835
Spy
2
1
0
SqlAttack
0
0
2
Teardrop
892
188
12
UdpStorm
0
0
2
Warezmaster
20
7
944
warezclient
890
181
0
Worm
0
0
2
Xlock
Xsnoop
0
0
0
0
9
4
Xterm
0
0
13
Total
125973
25192
22544
1353