You are on page 1of 9

Intrusion Detection Using Backpropagation and GA Weight Extraction and its Comparative Studies

Under the esteemed guidance of

P.Srinivasu, Associate Professor, Department of Computer Science.

M.Harshavardhan D.Ramesh J.V.Phanindra Kumar K.Jagadish Kumar N.N.V.Adithya

690752064 690752022 690752040 690752045 690752066

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES SANGIVALASA,VISAKHAPATNAM

ABSTRACT: For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup 99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. The concept of Backpropagation is applied on the data to calculate the weights of the nodes in Artificial Neural Network(ANN) . In this project, we are proposing an alternate approach using GA weight extraction to evolve the weights of the Artificial Neural Network. The weights thus obtained are used in testing phase. Testing is done to check whether record is normal or attacked. Testing is performed using both the techniques and the corresponding accuracy rates are calculated. INPUT: Here the input was the standard Knowledge Discovery Dataset (KDD cup99) dataset provided by the Lincoln Laboratory at MIT based on the binary TCP dump data provided by DARPA evaluation, Since one cannot know the intention (benign or malicious) of every connection on a real world network ,the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks, millions of connection statistics are collected and generated to form the training and test data in the classifier learning context, it is well defined as normal and with different types of attacks for TCP, UDP, ICMP, etc. KDD training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. The simulated attacks fall in one of the following four categories: 1) Denial of Service Attack (DoS) 2)User to Root Attack (U2R) 3) Remote to Local Attack (R2L) 4) Probing Attack

BACK PROPAGATION ALGORITHM: The backpropagation neural network is essentially a network of simple processing elements working together to produce a complex output. These elements or nodes are arranged into different layers: input, middle and output. The output from a backpropagation neural network is computed using a procedure known as the forward pass.

* The input layer propagates a particular input vectors components to each node in the middle layer. * Middle layer nodes compute output values, which become inputs to the nodes of the output layer. * The output layer nodes compute the network output for the particular input vector. The forward pass produces an output vector for a given input vector based on the current state of the network weights. Since the network weights are initialized to random values, it is unlikely that reasonable outputs will result before training. The weights are adjusted to reduce the error by propagating the output error backward through the network. This process is where the backpropagation neural network gets its name and is known as the backward pass: *Attribute values in the dataset are given as inputs to the input layer of the neural network. *The input to the hidden layer of the neural network is the weighted sum of attributes that are obtained from the input layer. * GA Weight Extraction is applied to the nodes in the hidden layer. This linear function is used to calculate weights of the intermediate nodes * Genetic Algorithm is used to process the weighted sum of outputs of the outputs. *The expected output is compared and error is calculated. *The error value is backpropagated and weights are adjusted in each neuron.

The training set is repeatedly presented to the network and the weight values are adjusted until the overall error is below a predetermined tolerance. Since the Delta rule follows the path of greatest decent along the error surface, local minima can impede training. The momentum term compensates for this problem to some degree.

Initialize the synaptic weights, Bias randomly

Attribute values in the dataset are given as inputs at the input layer of neural network

Weighted sum of attribute values is calculated and given as input to the hidden layer of neural network along with the bias

At each node in the hidden layer output is calculated by applying a linear function such as GA weight extraction

The weighted sum of outputs of the final hidden layer is processed with genetic algorithms.

Compare with the expected output for the given training data

Calculate the error

Back propagate the error value and adjust the synaptic weights and the bias values at each neuron

Is error less than stipulated or training data exhausted? No

yes Stop Training

ARCHITECTURE OF THE PROJECT:

GA-WEIGHT EXTRACTION: Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit strings whose interpretation depends on the application, though hypotheses may also be described by symbolic expressions or even computer programs. The search for an appropriate hypothesis begins with a population or collection of initial hypotheses .Members of the current population give rise to the next generation population by means of random mutation and crossover, which are patterned after processes in biological evolution. At each step, the hypotheses in the current population are evaluated relative to a given measure of fitness, with the most fit hypotheses selected probabilistically as seeds for producing the next generation. When GA is used three factors will have impact on the effectiveness Of the algorithm, they are 1).The representation of individuals 2) The selection of fitness function 3) The values of the GA parameters. The determination of the values of these parameters depends on the application .In paper of work we have represented the individuals such that each individual chromosome contains a number of genes representing the weights of the neural network as shown in figure. Gene1 geneN

Weight1 Fig 1:Chromosome Representation Of the Weights

weight

Each gene is a real coded value of the corresponding weight. The number of genes in a chromosome depends on the number of neurons in the neural network. Assume that the NN configuration is l-m-n (l -input neurons , m-hidden neurons ,and n-output neurons) then the number of weights determined should be (l+ n)m [8]. And further if we assume that each gene is of length d then a string S of decimal values representing the (l+n)*m weights and therefore having a string length L=(l+n)*m*d is randomly generated. We have to extract the weights from one such string so that they can be used by the NN. Algorithm: GA-Weight Extract Input: Number of generations, Normalized Training dataset Output:Optimized Weights to be used by the NN Method: Step 1. Start Step 2. Generation Gen 1

Step 3. Generate the initial population Pgen of real-coded chromosomes C genchno Step 4. while the termination condition is not met begin Step 5. Generate fitness values Step 6. Generate the mating pool by considering the best fit chromosomes Step 7. Generate the off-springs by applying the crossover operator on the mating pool chromosomes Step 8. Gen Gen+1

Step 9. Let the newly generated population be Pgen Step10 .GoTo Step 5 end Step 11. Extract weights from Pgen to be used by the NN Step 12. Stop

FEATURE SELECTION: 12 Feature set:


Name of the Feature Service Source bytes Destination Bytes Logged in Count Description Attribute type

Destination service Bytes sent fromsource to destination Bytes sent from destination to source 1 if successfully logged in;0 otherwise No of connections to the same host as the current connection in the past 2 seconds No of connections to the same service as the current connection in past 2 seconds No of connections that have SYN error No of connection that have REJ error No of connection to different host Count of connection having same dest hot Count of connection having same dest hot and using the same service No of different service on the current host

Discrete Discrete Discrete Discrete Discrete

Srv count

Discrete

Serror rate Srv rerror rate Srv diff host rate Dst host count Dst host srv count

Continuous Continuous Continuous Discrete Discrete

Dst host diff srv rate

Continuous

INPUT DATASET: Here we are using 10% KDD data set .which contains nearly 3,00,000 records. This data set is partitioned into ten parts. Each part contains 10% of normal records and ten percent of attack records.10 fold crossing is applied over the data set and prepared ten datasets namely 1,210. Total no of records =283472 Attacked records=237811 Normal records=45661

10 cross fold validation S.no 1 2 3 4 5 6 7 8 9 10 Training datasets 1,2,3,4,5,6,7,8,9 2,3,4,5,6,7,8,9,10 1,3,4,5,6,7,8,9,10 1,2,4,5,6,7,8,9,10 1,2,3,5,6,7,8,9,10 1,2,3,4,6,7,8,9,10 1,2,3,4,5,7,8,9,10 1,2,3,4,5,6,8,9,10 1,2,3,4,5,6,7,9,10 1,2,3,4,5,6,7,8,10 Testing datasets 10 1 2 3 4 5 6 7 8 9

REFERENCES:
1) A genetic algorithm-based artificial neural network model

for the optimization of machining processes D. Venkatesan K. Kannan R. Saravanan 2)Ensemble Fuzzy Belief Intrusion Detection Design by Te-Shun Chou Florida International University, tchou001@fiu.edu. 3)Machine Learning for Network based Intrusion Detection by Vegard Engen

You might also like