06779523

A Detail Analysis on Intrusion Detection Datasets
Santosh Kumar Sahu Sauravranjan Sarangi Sanjaya Kumar Jena

National Institute of Technology, Rourkela
santoshsahu@hotmail.co.in
sarangis@nitrkl.ac.in skjena@nitrkl.ac.in
$EVWUDFW7RVLPXODWHDQHIILFLHQW,QWUXVLRQ'HWHFWLRQ6\VWHP
,'6PRGHOHQRUPRXVDPRXQWRIGDWDDUHUHTXLUHGWRWUDLQDQG
WHVWLQJWKHPRGHO7RLPSURYHWKHDFFXUDF\DQGHIILFLHQF\RIWKH
PRGHO LW LV HVVHQWLDO WR LQIHU WKH VWDWLVWLFDO SURSHUWLHV IURP WKH
REVHUYDEOH HOHPHQWV RIWK H GDWDVHW ,Q WKLV ZRUN ZH KDYH
SURSRVHG VRPH GDWDSUHSURFHVVLQJ WHFKQLTXHV VXFK DVI LOOLQJ WKH
PLVVLQJ YDOXHV UHPRYLQJ UHGXQGDQW VDPSOHV UHGXFH WKH
GLPHQVLRQVHOHFWLQJPRVWUHOHYDQWIHDWXUHVDQGILQDOO\QRUPDOL]H
WKH VDPSOHV $IWHU GDWD SUHSURFHVVLQJ ZH KDYH VLPXODWHG DQG
WHVWHG WKH GDWDVHW E\ DSSO\LQJ YDULRXV GDWD PLQLQJ DOJRULWKPV
VXFK DV 6XSSRUW 9HFWRU 0DFKLQH 690 'HFLVLRQ 7UHH.
QHDUHVW QHLJKERU .0HDQ DQG )X]]\ &0HDQ &OXVWHULQJ ZKLFK
SURYLGHVEHWWHUUHVXOWLQOHVVFRPSXWDWLRQDOWLPH
.H\ZRUGV'DWD 3UHSURFHVVLQJ +HXULVWLF 5XOHV ,QWUXVLRQ
'HWHFWLRQ6\VWHP.''*XUH.''16/.''
I.
INTRODUCTION
The term intrusion refers to any unauthorized access that

attempt to compromise confidentiality, integrity and
availability of information resources. Commonly, we can say
that any malicious use or misuse refers to an intrusion. The
intruder tries to find the vulnerability in the security system and
then prepare for attack. Intrusion detection is the process of fast
detection of unwanted violation in systems normal behavior
due to illegal actions (attacks) performed by the malicious user
(attacker) [1]. Intrusion detection system (IDS) deals with
detection of such type of attacks in just in time or real time and
report, alert or countermeasure on the attack to the
administrator.
The rapid development of technology over the Internet
makes computer security a critical issue. Now a days, artificial
Intelligence, data mining and machine learning algorithms have
been subjected to extensive research in intrusion detection with
emphasis on improving the accuracy of detection and make an
immune model for Intrusion Detection System (IDS) to tackle
zero day attacks or novel attacks. To make an IDS model faster
with more accurate detection rates, selection of important
features from the input dataset is highly essential. Feature
selection in learning process while design the model leads to
reduction in computational cost, over fitting, model size and
improve accuracy. Some existed work in feature selection for
intrusion detection proposed includes the work of [10, 11]. Lee
et al. [12] used data mining techniques to select KDD features
from DARPA 98 dataset. Intrusion detection datasets contain
huge amount of observations or records with higherdimensional data. To handle such large datasets and implement
a model for IDS is not an easy task for the novice researcher.
c
978-1-4799-2572-8/14/$31.00 2014
IEEE
To manipulate such large dataset, some data preprocessing

is required. To avoid the gap and reduce the time for data
preprocessing, this paper presents some techniques to handle
large datasets with higher-dimensional data for IDS modeling.
This paper is organized as follows. Section 2 presents
related work on various datasets, Section 3 explores the
motivation, section 4 provides the problem statements, Section
5 presents the three intrusion detection datasets, Section 6
presents data pre-processing techniques, Section 7 describes
experimental results and finally, Section 8 summarizes the
paper and discusses the conclusions and future work.
II.
RELATED WORK
Many researchers since 1999 evaluate their intrusion

detection models using KDD Cup 99 dataset. It is 15 year old
benchmark dataset which is openly available. Olusola et al [7]
analysis the KDD 99 dataset for selecting relevant features.
They proposed that some features or attributes are not related
to any attack. They have taken only 10% of the whole dataset
and performed their analysis. Sabhnani et al. Proposed heuristic
rules for R2L attack detection by taking KDD cup99 dataset
[8]. They derive some heuristic rules for various R2L attacks
by applying the decision tree algorithm and applying statistical
method.
Ali et al. [9] compared various datasets according to the
principal objectives outlined for qualifying data sets. They
compare the datasets by taking Realistic network configuration,
realistic traffic, labeled dataset, total interaction capture,
complete capture and diverse/multiple attack scenarios
parameters and among them their dataset is only better because
they captured realistic traffic. Other parameter is similar to the
KDD Cup99 dataset. Zargari et al [16] attempts to explore
significant features on KDD Cup 99 and NSL-KDD datasets
(curse of high dimensionality) in intrusion detection in order to
be applied in data mining techniques.
Laheeb et al [14] compares KDD Cup99 Dataset with
NSLKDD dataset based on based on self-organization map
(som) artificial neural network. Results obtained are compared
and analyzed based on several performance metrics, where
the detection rate for KDD 99 dataset is 92.37%, while
detection rate for NSL-KDD dataset is 75.49%. Vipin Kumar
et al [15] discussed NSL-KDD dataset using K-Mean
Clustering. The main objective is to provide the complete
analysis of NSL-KDD intrusion detection dataset.
1348
III. MOTIVATION
The conventional datasets contain inconsistencies which
degrades the performance and increase the computational
time. This motivate us to preprocess the datasets and forms an
efficient version which helps researchers to create robust and
cost effective model for intrusion detection.
connection records which are captured last two weeks of the

experiment. The KDD dataset was used in the UCI KDD1999
competition. The objective of the competition is to develop
intrusion detection system models to detect attack categories
i.e. DOS, PROBE, R2L and U2R. The results of the KDD Cup
99 competition are discussed in [13].
Attacks fall into four main categories:
IV. PROBLEM STATEMENT

This research focuses on solving the issues in conventional
intrusion detection datasets. The revised version of the dataset
helps the researchers to work efficient analysis on the datasets.
Data mining algorithms aim to find useful patterns from the
hidden data and to verify the model with the given dataset.
The most time consuming and laborious work is data
preprocessing. So the novice researcher should know to select
relevant feature on the dataset and its use for designing the
IDS model.
Data mining and machine learning algorithms are
used for intrusion detection. To apply the algorithms the data
should be prepare accordingly. The format of the input to an
algorithm is not same (it differs from one algorithm to another
algorithm).
To make efficiently use your dataset for analysis the following
solutions were made,
i. To fill the unavailable value, an efficient algorithm is
used based upon mean value of corresponding
attack type.
ii. To reduce the dimension of the dataset relevant
feature analysis and Principal Component
Analysis (PCA) are used.
iii. The intrusion detection datasets contains large
amount of samples with redundancy instances.
So, to avoid the overwhelming of instances, we
use an efficient algorithm to remove the
duplicate samples.
iv. To improve the accuracy and reduce the
computational time, three normalization
techniques are applied to the dataset according to
fulfill the requirement of the algorithm used.
Hence the proposed techniques for analysis of intrusion
detection datasets address the issues and efficiently identifies
any kind of problems in the dataset before any statistical
measure. All these four solutions have been discussed in the
following section in details.
V. INTRUSION DETECTION DATASET
This paper analysis on three intrusion detection datasets
named as KDD Cup 99, GureKDD and NSL KDD. The details
of the datasets are as follows:
$ .''&83
The objective of 1999 KDD intrusion detection contest is to
create a standard dataset for survey and evaluate research in
intrusion detection which is prepared and managed by MIT,
Lincoln Labs by DARPA Intrusion Detection Evaluation
Program. After capturing nine weeks of raw TCP dump data
for LAN simulating a typical U.S. Air Force LAB. They
operated the LAN as if it were a true Air Force environment,
but peppered it with multiple attacks.
The raw training data contains four gigabytes of
compressed binary TCP dump data from seven weeks of
network traffic by processed into about five million connection
records. Similarly, the test data yielded around two million
DOS: In DoS, an attacker tries to prevent legitimate users

accessing or consume a service via back, land, Neptune, pod
Smurf and teardrop.
R2L: The attacker tries to gain access to the victim system
by compromising the security via password guessing or
breaking. Such attacks are given following table 1.
U2R: In U2R, an attacker has local access privilege to the
victim machine and tries to access super users (administrators)
privileges via Buffer overflow attack.
Probing: In Probe attack, an attacker tries to gain
information about the victim machine. The intention is to check
vulnerability on the victim machine. e.g., Port scanning.
The KDD Cup99 dataset available in three different files
such as KDD Full Dataset which contains 4898431 instances,
KDD Cup 10% dataset which contains 494021 instances, KDD
Corrected dataset which contains 311029 instances. In table 1
the details on KDD full and KDD 10% datasets information is
given. Table 1 explores the number of samples present in each
category before and after reduction duplicate samples and
percentage of reduction. Similarly table 2 contains detail
information on KDD Corrected and GureKDD dataset along
with before and after reduction of redundancy samples with
percentage of reduction. The reduction of duplicate samples is
based upon algorithm 1. The table 2 elaborates the forth
mentioned four attack category on KDD Cup 3 different
datasets with number of samples in each category and
percentage of reduction after applying algorithm 1.
Each sample of the dataset represents a connection between
two network hosts according to network protocols. It is
described by 41 attributes out of which 38 continuous or
discrete numerical attributes and 3 categorical attributes. Each
sample is labeled as either normal or one specific attack. The
dataset contains 23 class labels out of which 1 is normal and
remaining 22 are different attacks. The total 22 attacks fall into
four categories as forth-mentioned attacks.
% *XUH.''FXS
GureKDDcup [6] dataset contains connections of
kddcup99 (database of UCI repository) but it adds its payload
(content of network packets) to each of the connections. It will
permit to extract information directly from the payload of each
connection to be used in machine learning processes.
The GureKDDCup capture team follows the same steps
followed to generate kddcup99. They processed tcpdump files
with bro-ids [17] and got each connection with its attributes.
Finally, the dataset is labeled each connection based on the
connections-class files (tcpdump.list) that MIT provides. The
size of the dataset is 9.3 GB and 6 percent datasets size is 4.2
GB. Details of the dataset including number of samples, attack
categories, duplicate records and their reduced rates are given
in table 4. The table 4 contains the attack and normal categories
of all instances before and after reduction of duplicate samples.
The first column describes the sample categories, second
column is number of samples in each category present in
original dataset, column 3 presents the number of samples after
2014 IEEE International Advance Computing Conference (IACC)
1349
reduction of duplicate samples and 4 column is explores the

percentage of reduction.
& 16/.''
NSL-KDD is a data set [5] proposed to solve some of the
underlying problems of the KDD'99 data set which are
discussed in [2]. KDD data set still suffers from some of the
problems which are discussed by McHugh [3]. The dataset
cant be a perfect representative of existing real networks,
because of the lack of public data sets for network-based IDSs.
But due to unavailability of effective benchmark intrusion
detection data set, still we use the fourteen years old datasets
for evaluation of intrusion detection models.
Furthermore, the number of records in the NSL-KDD train
(125973 samples) and test sets (22544 samples) are reasonable.
This advantage makes it affordable to run the experiments on
the complete set without the need to randomly select a small
portion. Consequently, evaluation results of different research
work will be consistent and comparable. There is no
redundancy sample present in the dataset. The testing set
contains some attack which are not present in the training set.
The details of NSLKDD are given in table 4. By applying the
forth-mentioned data preprocessing methods, we visualize the
result in table 4.
VI. DATA PREPROCESSING
The objective of data preprocessing is to transform the raw
input data into an appropriate format for subsequent analysis.
The various steps involved in data preprocessing include
merging data from data repositories, cleaning data to remove
noise and duplicate observations and then selecting relevant
observations as per the requirement at hand. Figure 1 illustrates
the data processing block diagram and how data flow from raw
data input to processed input data for further statistical
measures.
Fig 1: Data preprocessing

The above three datasets are already represented in CSV
(comma separated value) format. The KDD Cup 99 and
GureKDD contains a large number of duplicate samples. By
using the following algorithm we remove the redundancy
records from the dataset. The detail analysis of removal
duplicates record are given in table one, two and three with a
percentage of reduction.
Data preprocessing is the most time consuming and
complex task of preparing for subsequent analysis as per
requirement for IDS model. If the dataset contains duplicate
records then the clustering or classification algorithms take
more time and also provides inefficient results. To achieve
more accurate and efficient model your dataset should be free
from noise and redundancy samples.
Algorithm for removing the duplicate records from the
dataset:
$OJRULWKPB5HPRYHBGXSOLFDWHVRXUFHBILOHGHVWBILOH
Begin:
set Count: =0;
Size=length (source_file);
Set k: =1;
Set flag: =0;
Repeat for i=1 to size
move source_file(i) into dest_file(k)
str=dest_file(k);
increment k;
set max = length (source_file);
Repeat for j= i to max
Compare str with source_file(j)
If match
Remove record source_file(j)
increment count
decement max, j
Flag=1;
End if
End for
If flag
Decrement i
Size=length (source_file);
End if
End for
Remove source_file;
Rename dest_file as source_file
Print Number of duplicate records: count
End
To fill the missing values, we have written a
procedure named as search_fill which takes the class types and
search it into the class_values. The class_values are calculated
according to the mean of the particular attack class. First we
divide the whole dataset according to their class categories and
then find mean of each column store it into class_values.
The intrusion detection dataset contains higher dimensional
data for processing. So it is highly essential to select only
relevant features among all. For that various statistical analysis
is required such as feature selection, dimensionality reduction,
normalization and data sub-setting.
Feature selection is an important data processing process.
In KDD Cup99, NSLKDD and GureKDD the attribute
num_outbound_cmds contains zero only. So we should remove
it from the dataset for subsequent analysis. By taking 40
attribute out of 42 we analysis PCA and generate screen plots
(Eigen values with attribute numbers) and variance matrix as
shown figure 2 and table 5.
After Feature selection and dimensionality reduction, it is
required to normalize the dataset. The attributes are scaled to
the range [0, 1] using (Equation 1), where xi denotes the
value of the feature, min (x) denotes the minimum value, and
max (x) denotes the maximum value. Thus, data available for
subsequent analysis are real numbers between 0 and 1.
But one disadvantage of min-max normalization is the
lowest and highest values are remain same i.e. 0 and 1. So this
leads problem in Nave Bayes classifier (Attribute 15, 20 and
21 have negative variance). So to avoid this; we have to scale
the data within -1 to 1 using equation 2 or using the third
normalization process e.g. z-score normalization. This
$OJRULWKP5HPRYHUHGXQGDQF\VDPSOHV
1350
normalization technique also used according to the requirement

of the algorithm.
records, then select relevant features by statistical analysis,

reduce the dimension using PCA and normalized the dataset
using equation 1 to 3 as per requirement of the algorithm. We
apply different data mining algorithms to the datasets and
record the result in both cases (before normalization and after
normalization). Experimentally, we have seen that the datasets
contains more than 75% in KDD Cup 99 all version datasets
and 15% in GureKDD dataset redundant samples. Feature
20(num_out_of_bound) contains zero in all fields so we omit
it from further analysis. The analysis on the datasets are
clearly demonstrated using tables and figures which is helpful
to the reader to know details about the datasets.
In our future work, we capture the real traffic from
realistic network configuration and apply data preprocessing
in real time and subsequent analysis using various data mining
and machine learning algorithms.
Fig 2: Screen plot of Eigen Values with Component

number
[1] Bace, R. and Mell, P. (2001). Intrusion Detection System, NIST Special
Publications SP 800. November.
[2] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, A Detailed
Analysis of the KDD CUP 99 Data Set, Submitted to Second IEEE
Symposium on Computational Intelligence for Security and Defense
Applications (CISDA), 2009.
[3] J. McHugh, Testing intrusion detection systems: a critique of the 1998
and 1999 darpa intrusion detection system evaluations as performed by
lincoln laboratory, $&0 7UDQVDFWLRQV RQ,QI RUPDWLRQ DQG 6\VWHP
6HFXULW\, vol. 3, no. 4, pp. 262294, 2000.
[4] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/ kddcup99/
[5] NSL-KDD data: http://nsl.cs.unb.ca/NSL-KDD/
[6] GureKDDCup Dataset: http://www.sc.ehu.es/acwaldap/.
[7] Analysis of KDD99 intrusion detection dataset for selection of
relevance features., adetunmbi a. Olusola, adeola S Oladelel.
[8] KDD feature set complaint heuristic rules for R2L attack detection,
Maheshkumar sabhnani, Gursel serpent.
[9] Toward developing a systematic approach to generate benchmark
datasets for intrusion detection, ali shiravi, hadi shiravi, mahbod
tavallaee, ali a. ghorbani.
[10] Sung, A. H. and Mukkamala, S. (2003) identifying important features
for intrusion detection using support vector machines and neural
networks, IEEE proceedings of the 2003 symposium on applications and
the internet.
[11] Kayacik, H.G., Zincir-Heywood, A. N. and Heywood, M.L.(2006).
Selecting features for Intrusion detection: a feature analysis on KDD 99
intrusion detection datasets.
[12] W. Lee, and S. Stolfo, A framework for constructing features and
models for intrusion detection systems, ACM Transactions on
information and system security, November 2000, Vol 3(4), PP. 227261.
[13] C. Elkan, Results of the KDD99 classifier learning, SIGKDD
Explorations, ACM SIGKDD, January 2000, Vol 1.(2), pp. 63-64.
[14] Laheeb m. Ibrahim,dujan t. Basheer, mahmod s. Mahmoda comparison
study for intrusion database (kdd99, nsl-kdd) based on self organization
map(som) artificial neural network ,Journal of Engineering Science and
Technology, Vol. 8, No. 1 (2013) 107 - 119,School of Engineering,
Taylors University.
[15] Vipin Kumar, Himadri Chauhan, Dheeraj Panwar, K-Means Clustering
Approach to Analyze NSL-KDD Intrusion Detection Dataset,
International Journal of Soft Computing and Engineering (IJSCE) ISSN:
2231-2307, Volume-3, Issue-4, September2013.
[16] Zargari S, Voorhis, D., Feature Selection in the Corrected KDDdataset,Emerging Intelligent Data and Web Technologies (EIDWT),
2012 Third International Conference on 19-21 Sept. 2012, Page(s) 174 180.
[17] Bro-ids: www.EUR.org/
[18] B.Kavitha, S. Karthikeyan and B. Chitra, Efficient Intrusion detection
with reduced dimension using data mining classification methods and
their performance comparison, BAIP 2010, pp. 96-101, 2010.
REFERENCES
; L,
; L,
0 WR 1
; L ; 0LQ
; 0D[ ; 0LQ
1 WR 1
; ; 0LQ
; L 0D[
; 0D[ ; 0LQ
(1)
(2)
;L ; 6
; L , 1V
(3)
V;, 6
Where
Xi is each data point
; 6 is the average of all sample data points.
V;,6
is the sample standard deviation of all data points
; L , 1V is the data point I standardized to 1, also known as

Z-Score.
To increase the speed for training and predicting, efficiency
and accuracy using support vector machine (SVM) the dataset
should be sparsely represented. The example of how to make
the sparse representation on the data is as follows:
Let the feature sets is: 1 5 0 0 0 2 0 3
Sparse Representation is 1:1
2:5
6:2
8:3
The zero values are omitted and remaining none zero

elements are represented as above.
After experiment we have seen that the SVM classifier
works best in sparse representation dataset with a high
accuracy and minimum computation time. After normalizing
the dataset, now our data set is ready for subsequent analysis
using data mining and machining learning algorithms. By
taking the dataset, we have applied K-Mean clustering, Support
Vector Machine, Decision Tree and KNN techniques and
depicted result in table 3.
VII. CONCLUSION
In this paper, data preprocessing is carried out on KDD
cup 99, NSLKDD and GureKDD dataset. First we fill the
blank fields on the dataset followed by remove the redundant
1351
Table 1: The details categories of attack categories in KDD FULL & 10% Dataset.
Category
KDDCup
99FULL
Dataset
AfterRemoving %rateof
Duplicate
Reduction
Samples
KDDCup99
10%
Dataset
AfterRemoving
Duplicate
Samples
%rateof
Reduction
Dataset
Class
Normal
972781
812814
16.44
97278
87832
9.71
NORMAL
Back
Pod
Land
Smurf
Teardrop
Neptune
Nmap
Satan
Ipsweep
Portsweep
2203
264
21
2807886
979
1072017
2316
15892
12481
10413
968
206
19
3007
918
242149
1554
5019
3723
3564
56.06
21.97
9.52
99.89
6.23
77.41
32.90
68.42
70.17
65.77
2203
264
21
280790
979
107201
231
1589
1247
1040
968
206
19
641
918
51820
158
906
651
416
54.88
21.97
9.52
99.77
6.23
51.66
31.60
42.86
47.79
60.00
DOS
DOS
DOS
DOS
DOS
DOS
PROBE
PROBE
PROBE
PROBE
Phf
Guess_pwd
4
53
4
53
0.00
0.00
4
53
4
53
0.00
0.00
R2L
R2L
Ftp_write
8
0.00
0.00
R2L
Imap
12
12
0.00
12
12
0.00
R2L
Spy
2
0.00
0.00
R2L
7
1020
20
7
893
20
0.00
12.45
0.00
7
1020
20
7
893
20
0.00
0.00
0.00
R2L
R2L
R2L
30
9
3
10
48,98,43
1
30
9
3
10
0.00
0.00
0.00
0.00
30
9
3
10
30
9
3
10
0.00
0.00
0.00
0.00
U2R
U2R
U2R
U2R
Multihop
Warezclient
Warezmaster
Buffer_Overflow
Loadmodule
Perl
Rootkit
Total
10,74,992
78.05%
4,94,021
145586
70.53%
Table 2: Attack Distribution in KDD Cup Datasets (KDD full, KDD 10% and KDD Corrected).
Dataset
KDDFull
KDDFullAfterremovingduplicate
samples
DoS
3883370
247267
U2R
52
52
R2L
1126
999
Probe
41102
13860
Normal
972781
812814
Total
4898431
1074992
KDD10%
391458
52
1126
4107
97278
494021
KDD10%Afterremovingduplicate
samples
KDDCorrected
54598

229269
52
999
2133
87832
145586
70
16172
4925
60593
311029
KDDCorrectedafterremoving
duplicatesamples
22984
70
2898
3426
47913
77291
Table 3: Contains various experiments and their results on the three datasets (Time in second).

NSLKDD20%
Train
25192
KDDCup
corrected
77291
GureKDD
160904
1352
Normalized
NO
Time
76.59
SVMClassifier
Accuracy
99.99
Time
4.06
DecisionTree
Accuracy
99.91
YES
4.45
98.122
4.7
99.91
KNearestNeighbor
Time
Accuracy
43.23
99.99
KMeanClustering
Time
Accuracy
5.9
53.19
37.69
99.99
4.15
88.31
NO
192.2
99.12
89
99.236
799.42
99.9
57.56
50.12
YES
165.8
99.23
72.48
99.81
799.42
99.9
65.32
90.33
NO
YES
725.6
123.2
99.12
99.08
2100
1171
99.02
99.96
3002.2
2712.3
99.97
99.99
780
711.2
85.77
82.61
Table 4: Attack categories and total samples present in NSLKDD and GureKDD Dataset.
Category
NSLKDD
TrainFull
NSLKDD
Train20%
Category
OnGureKDD
dataset
KDD_Test+
Instances
GureKDD
Original
On After Removing %rateof

Duplicate
reduction
samples
Normal
67343
13449
9711
Normal
174873
Apache2
0
0
737
Anomaly
9
0.00
359
Dict
880
878
0.23
2
50.00
Back
956
196
BufferOverflow
30
6
20
Dict_simple
Ftp_write
8
1
3
Eject_fail
157048
10.19
2
50.00
12
11
8.33
Guess_pwd
53
10
1231
Eject
HttpTunnel
0
0
133
Ffb
11
10
9.09
Imap
11
5
1
Ffb_clear
Ipsweep
2
50.00
Formatfail
2
50.00
14.29
3599
710
141
Land
18
1
7
Format
7
Loadmodule
9
1
2
Format_clea
MailBomb
0
0
2
50.00
293
Ftp_write
9
11.11
51
50
1.96
Mscan
0
0
996
Guest
Multihop
7
2
18
Imap
8
12.50
Land
36
17
52.78
Load_clear
Named
0
0
17
Neptune
41214
8282
4657
2
50.00
9
33.33
0.00
1493
301
73
Loadmodule
Perl
3
0
2
Multihop
9
Phf
4
2
2
Perl_clear
2
50.00
41
Perl_magic
5
20.00
6
16.67
Nmap
Pod
Portsweep
ProcessTable
Ps
201
38
2931
587
157
Phf
0
0
685
rootkit
30
29
3.33
15
Spy
3
33.33
5
40.00
1086
1083
0.28
2
50
1692
3.31
0
0
Rootkit
10
4
13
Sys_log
Saint
0
0
319
Teardrop
Satan
SendMail
Smurf
3633
691
735
Warez
0
0
14
Warezclient
1750
2646
529
665
warezmaster
20
19
5.00
Total
160904
10.03%
SnmpGetAttack
0
0
178
SnmpGuess
0
0
331
178835
Spy
2
1
0
SqlAttack
0
0
2
Teardrop
892
188
12
UdpStorm
0
0
2
Warezmaster
20
7
944
warezclient
890
181
0
Worm
0
0
2
Xlock
Xsnoop
0
0
0
0
9
4
Xterm
0
0
13

Total
125973
25192
22544
1353

06779523

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06779523

Uploaded by

Copyright:

Available Formats

A Detail Analysis on Intrusion Detection Datasets

Santosh Kumar Sahu Sauravranjan Sarangi Sanjaya Kumar Jena

The term intrusion refers to any unauthorized access that

To manipulate such large dataset, some data preprocessing

Many researchers since 1999 evaluate their intrusion

connection records which are captured last two weeks of the

IV. PROBLEM STATEMENT

DOS: In DoS, an attacker tries to prevent legitimate users

2014 IEEE International Advance Computing Conference (IACC)

reduction of duplicate samples and 4 column is explores the

Fig 1: Data preprocessing

2014 IEEE International Advance Computing Conference (IACC)

normalization technique also used according to the requirement

records, then select relevant features by statistical analysis,

Fig 2: Screen plot of Eigen Values with Component

is the sample standard deviation of all data points

; L , 1V is the data point I standardized to 1, also known as

The zero values are omitted and remaining none zero

2014 IEEE International Advance Computing Conference (IACC)

2014 IEEE International Advance Computing Conference (IACC)

On After Removing %rateof

2014 IEEE International Advance Computing Conference (IACC)

You might also like