You are on page 1of 72

Genetic Algorithms

Evolutionary Computing
 Evolutionary computing produces
high-quality partial solutions
to problems through
natural selection and
survival of the fittest
– Compare to natural
biological systems that
adapt and learn over time
Genetic Algorithm Example
 Find the maximum value of function
f(x) = –x2 + 15x

– Represent problem using chromosomes built from four genes:


Integer Binary code Integer Binary code Integer Binary code
1 0001 6 0110 11 1011
2 0010 7 0111 12 1100
3 0011 8 1000 13 1101
4 0100 9 1001 14 1110
5 0101 10 1010 15 1111
Genetic Algorithm Example
 Initial random population of size N = 6:
60
f(x)
50

40

30

20

10

0
0 5 10 15
x
(a) Chromosome initial locations.
Genetic Algorithm Example
fitness function here is
simply the original function
 Determine chromosome fitness for f(x) = –x2 + 15x
each chromosome:
Chromosome Chromosome Decoded Chromosome Fitness
label string integer fitness ratio, %
X1 1 1 00 12 36 16.5
X2 0 1 00 4 44 20.2
X3 0 0 01 1 14 6.4
X4 1 1 10 14 14 6.4
X5 0 1 11 7 56 25.7
X6 1 0 01 9 54 24.8
218 100.0
Genetic Algorithm Example
 Use fitness ratios to determine which
chromosomes are selected for crossover
and mutation operations:
100 0
X1: 16.5%
X2: 20.2%
75.2 X3: 6.4%
X4: 6.4%
X5: 25.3%
36.7 X6: 24.8%
49.5 43.1
Genetic Algorithms – Step 1
 Represent the problem domain as
a chromosome of fixed length
– Use a fixed number of genes to represent a solution
– Use individual bits or characters for efficient
memory use and speed
1 0 1 1 0 1 0 0 0 0 0 1 0 1 0 1
– e.g. Traveling Salesman Problem (TSP)
http://www.lalena.com/AI/Tsp/
Genetic Algorithms – Step 2
 Define a fitness function f(x) to measure
the quality of individual chromosomes
 The fitness function determines
– which chromosomes carry over to the next generation
– which chromosomes are crossed over with one another
– which chromosomes are individually mutated
Genetic Algorithms – Step 3
 Establish our genetic algorithm parameters:
– Choose the size of the population, N
– Set the crossover probability, pc
– Set the mutation probability, pm

 Randomly generate an initial population


of chromosomes: 1 0 1 1 0 1 0 0 0 0 0 1 0 1 0 1
1 0 1 0 1 1 0
0 0 0 1 0 1 0 0 0 0 0 1 0 1
– x1, x2, ..., xN 1 0 1 1 1
0 10 1
0 0
1 0 0
1 0 1
0 0 1
0 0 1 0
...
1 0 1 01 10 01 10 0
1 0 1 1 0 1 0 0 0 0
Genetic Algorithms – Step 4
 Calculate the fitness of each
individual chromosome using f(x):
– f(x1), f(x2), ..., f(xN)

 Order the population based on fitness values


Genetic Algorithms – Step 5
 Using pc, select pairs of chromosomes
for crossover
 Using pm, select chromosomes for mutation
 Chromosomes are selected 100 0
X1: 16.5%
based on their fitness X2: 20.2%
values using a 75.2 X3: 6.4%
X4: 6.4%
roulette wheel approach: X5: 25.3%
36.7 X6: 24.8%
49.5 43.1
Genetic Algorithms – Step 6
 Create a pair of offspring chromosomes by
applying a crossover operation:

X6i 1 0 00 1 0 1 00 00 X2i

X1i 10 11 00 00 0 11 11 11 X5i
Genetic Algorithms – Step 6
 Mutate an offspring chromosome by applying
a mutation operation:
X6'i 1 0 0 0

X2'i 0 1 0 10

X1'i 10 1 1 1 1 1 X1"i

X5'i 0 1 01 01

X2i 0 1 0 0 1 0 X2"i
Genetic Algorithms – Steps 7 & 8
 Step 7:
– Place all generated offspring
chromosomes in a new population

 Step 8:
– Go back to Step 5 until the size of the new population
is equal to the size of the initial population, N
Genetic Algorithms – Steps 9 & 10

 Step 9: Generation i
Crossover

X1i 1 1 0 0 f = 36 X6i 1 0 00 1 0 1 00 00 X2i

Replace the initial population with


X2i 0 1 0 0 f = 44

– X3i
X4i
0 0 0 1
1 1 1 0
f = 14
f = 14 X1i 0 11 00 00
1 0 11 11 11 X5i

the new population


X5i 0 1 1 1 f = 56
X6i 1 0 0 1 f = 54
X2i 0 1 0 0 0 1 1 1 X5i
Generation (i + 1)
X1i+1 1 0 0 0 f = 56 Mutation
X2i+1 0 1 0 1 f = 50 X6'i 1 0 0 0
X3i+1 1 0 1 1 f = 44
X2'i 0 1 0 1
0
X4i+1 0 1 0 0 f = 44

 Step 10: X5i+1 0 1 1 0 f = 54


X6i+1 0 1 1 1 f = 56
X1'i 1
0

X5'i 0 1 0

X2i 0 1
1 0
1
1 1

0
1

0 1
1 1 X1"i

0 X2"i

Go back to Step 4 and repeat the process


X5i 0 1 1 1


until termination criteria are satisfied
– Typically repeat this process for 50-5000+ generations
Iteration Generation i
Crossover

X1i 1 1 0 0 f = 36 X6i 1 0 00 1 0 1 00 00 X2i


X2i 0 1 0 0 f = 44
X3i 0 0 0 1 f = 14
X4i 1 1 1 0 f = 14 X1i 0 11 00 00
1 0 11 11 11 X5i
X5i 0 1 1 1 f = 56
X6i 1 0 0 1 f = 54
X2i 0 1 0 0 0 1 1 1 X5i
Generation (i + 1)
X1i+1 1 0 0 0 f = 56 Mutation
X2i+1 0 1 0 1 f = 50 X6'i 1 0 0 0
X3i+1 1 0 1 1 f = 44
X2'i 0 1 0 1
0
X4i+1 0 1 0 0 f = 44
X1'i 1
0 1 1 1 1 1 X1"i
X5i+1 0 1 1 0 f = 54
X6i+1 0 1 1 1 f = 56 X5'i 0 1 0
1 0
1

X2i 0 1 0 0 1 0 X2"i

X5i 0 1 1 1
Genetic Algorithms
 Advantages of genetic algorithms:
– Often outperform “brute force” approaches by
randomly jumping around the search space
– Ideal for problem domains in which near-optimal
(as opposed to exact) solutions are adequate

 Disadvantages of genetic algorithms:


– Might not find any satisfactory partial solutions
– Tuning can be a challenge
Neural Network
Introduction

 What are Neural Networks?


– Neural networks are a new method of programming
computers.
– They are exceptionally good at performing pattern
recognition and other tasks that are very difficult to program
using conventional techniques.
– Programs that employ neural nets are also capable of
learning on their own and adapting to changing conditions.
Introduction

 What are Neural Networks?


– On average, neural networks have higher computational
rates than conventional computers.
– Neural networks learn by example.
– Neural networks mimic the way the human brain works.
Background
 An Artificial Neural Network (ANN) is an information processing
paradigm that is inspired by the biological nervous systems,
such as the human brain’s information processing mechanism.
 The key element of this paradigm is the novel structure of the
information processing system. It is composed of a large
number of highly interconnected processing elements (neurons)
working in unison to solve specific problems. NNs, like people,
learn by example.
 An NN is configured for a specific application, such as pattern
recognition or data classification, through a learning process.
Learning in biological systems involves adjustments to the
synaptic connections that exist between the neurons. This is
true of NNs as well.
How the Human Brain learns

 In the human brain, a typical neuron collects signals from others


through a host of fine structures called dendrites.
 The neuron sends out spikes of electrical activity through a long,
thin stand known as an axon, which splits into thousands of
branches.
 At the end of each branch, a structure called a synapse converts the
activity from the axon into electrical effects that inhibit or excite
activity in the connected neurons.
A Neuron Model
 When a neuron receives excitatory input that is sufficiently large
compared with its inhibitory input, it sends a spike of electrical activity
down its axon. Learning occurs by changing the effectiveness of the
synapses so that the influence of one neuron on another changes.

 We conduct these neural networks by first trying to deduce the essential


features of neurons and their interconnections.
 We then typically program a computer to simulate these features.
A Simple Neuron

 An artificial neuron is a device with many inputs and


one output.
 The neuron has two modes of operation;
 the training mode and
 the using mode.
A Simple Neuron (Cont.)

 In the training mode, the neuron can be trained to fire (or not),
for particular input patterns.
 In the using mode, when a taught input pattern is detected at
the input, its associated output becomes the current output. If
the input pattern does not belong in the taught list of input
patterns, the firing rule is used to determine whether to fire or
not.
 The firing rule is an important concept in neural networks and
accounts for their high flexibility. A firing rule determines how
one calculates whether a neuron should fire for any input
pattern. It relates to all the input patterns, not only the ones on
which the node was trained on previously.
Pattern Recognition

 An important application of neural networks is pattern


recognition. Pattern recognition can be implemented by using a
feed-forward neural network that has been trained accordingly.
 During training, the network is trained to associate outputs with
input patterns.
 When the network is used, it identifies the input pattern and
tries to output the associated output pattern.
 The power of neural networks comes to life when a pattern that
has no output associated with it, is given as an input.
 In this case, the network gives the output that corresponds to a
taught input pattern that is least different from the given pattern.
Pattern Recognition (cont.)

 Suppose a network is trained to recognize


the patterns T and H. The associated
patterns are all black and all white
respectively as shown above.
Pattern Recognition (cont.)

Since the input pattern looks more like a ‘T’, when


the network classifies it, it sees the input closely
resembling ‘T’ and outputs the pattern that
represents a ‘T’.
Pattern Recognition (cont.)

The input pattern here closely resembles ‘H’


with a slight difference. The network in this
case classifies it as an ‘H’ and outputs the
pattern representing an ‘H’.
Pattern Recognition (cont.)

 Here the top row is 2 errors away from a ‘T’ and 3 errors away
from an H. So the top output is a black.
 The middle row is 1 error away from both T and H, so the
output is random.
 The bottom row is 1 error away from T and 2 away from H.
Therefore the output is black.
 Since the input resembles a ‘T’ more than an ‘H’ the output of
the network is in favor of a ‘T’.
Different types of Neural Networks

 Feed-forward networks
– Feed-forward NNs allow signals to travel one way
only; from input to output. There is no feedback
(loops) i.e. the output of any layer does not affect
that same layer.
– Feed-forward NNs tend to be straight forward
networks that associate inputs with outputs. They
are extensively used in pattern recognition.
– This type of organization is also referred to as
bottom-up or top-down.
Continued

 Feedback networks
– Feedback networks can have signals traveling in both
directions by introducing loops in the network.
– Feedback networks are dynamic; their 'state' is changing
continuously until they reach an equilibrium point.
– They remain at the equilibrium point until the input changes
and a new equilibrium needs to be found.
– Feedback architectures are also referred to as interactive or
recurrent, although the latter term is often used to denote
feedback connections in single-layer organizations.
Backprop algorithm
 The Backprop algorithm searches for weight values that
minimize the total error of the network over the set of training
examples (training set).
 Backprop consists of the repeated application of the following
two passes:
– Forward pass: in this step the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
– Backward pass: in this step the network error is used for
updating the weights. Starting at the output layer, the error
is propagated backwards through the network, layer by
layer. This is done by recursively computing the local
gradient of each neuron.
Back Propagation
 Back-propagation training algorithm
Network activation
Forward Step

Error
propagation
Backward Step
 Backprop adjusts the weights of the NN in order to
minimize the network total mean squared error.
Neural Network in Use
Since neural networks are best at identifying patterns or trends
in data, they are well suited for prediction or forecasting
needs including:
– sales forecasting
– industrial process control
– customer research
– data validation
– risk management

ANN are also used in the following specific paradigms:


recognition of speakers in communications; diagnosis of
hepatitis; undersea mine detection; texture analysis; three-
dimensional object recognition; hand-written word recognition;
and facial recognition.
Advantage of Neural Network

 They have the ability to learn by example


 They are more fault tolerant
 They are more suited for real time operation
due to their high ‘computational’ rates
Disadvantage of Neural Network

 The individual relations between the input


variables and the output variables are not
developed by engineering judgment so that
the model tends to be a black box or
input/output table without analytical basis.
 The sample size has to be large.
 Requires lot of trial and error so training can
be time consuming.
6.6 Classification by Backpropagation 331

Initialize the weights: The weights in the network are initialized to small random num-
bers (e.g., ranging from −1.0 to 1.0, or −0.5 to 0.5). Each unit has a bias associated with
it, as explained below. The biases are similarly initialized to small random numbers.
Each training tuple, X, is processed by the following steps.
Propagate the inputs forward: First, the training tuple is fed to the input layer of the
network. The inputs pass through the input units, unchanged. That is, for an input unit,
j, its output, O j , is equal to its input value, I j . Next, the net input and output of each
unit in the hidden and output layers are computed. The net input to a unit in the hidden
or output layers is computed as a linear combination of its inputs. To help illustrate this
point, a hidden layer or output layer unit is shown in Figure 6.17. Each such unit has a
number of inputs to it that are, in fact, the outputs of the units connected to it in the
previous layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit is multiplied by its corresponding weight, and this is summed.
Given a unit j in a hidden or output layer, the net input, I j , to unit j is
I j = ∑ wi j Oi + θ j , (6.24)
i

where wi j is the weight of the connection from unit i in the previous layer to unit j; Oi is
the output of unit i from the previous layer; and θ j is the bias of the unit. The bias acts
as a threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an acti-
vation function to it, as illustrated in Figure 6.17. The function symbolizes the activation

Weights
w1 j Bias
y1
j
w2 j
y2
∑ f Output
...

wnj
yn

Inputs Weighted Activation


(outputs from sum function
previous layer)

Figure 6.17 A hidden or output layer unit j: The inputs to unit j are outputs from the previous layer.
These are multiplied by their corresponding weights in order to form a weighted sum, which
is added to the bias associated with unit j. A nonlinear activation function is applied to the net
input. (For ease of explanation, the inputs to unit j are labeled y1 , y2 , . . . , yn . If unit j were in
the first hidden layer, then these inputs would correspond to the input tuple (x1 , x2 , . . . , xn ).)
332 Chapter 6 Classification and Prediction

of the neuron represented by the unit. The logistic, or sigmoid, function is used. Given
the net input I j to unit j, then O j , the output of unit j, is computed as

1
Oj = . (6.25)
1 + e−I j
This function is also referred to as a squashing function, because it maps a large input
domain onto the smaller range of 0 to 1. The logistic function is nonlinear and differ-
entiable, allowing the backpropagation algorithm to model classification problems that
are linearly inseparable.
We compute the output values, O j , for each hidden layer, up to and including the
output layer, which gives the network’s prediction. In practice, it is a good idea to cache
(i.e., save) the intermediate output values at each unit as they are required again later,
when backpropagating the error. This trick can substantially reduce the amount of com-
putation required.
Backpropagate the error: The error is propagated backward by updating the weights and
biases to reflect the error of the network’s prediction. For a unit j in the output layer, the
error Err j is computed by

Err j = O j (1 − O j )(T j − O j ), (6.26)

where O j is the actual output of unit j, and T j is the known target value of the given
training tuple. Note that O j (1 − O j ) is the derivative of the logistic function.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the
units connected to unit j in the next layer are considered. The error of a hidden layer
unit j is
Err j = O j (1 − O j ) ∑ Errk w jk , (6.27)
k

where w jk is the weight of the connection from unit j to a unit k in the next higher layer,
and Errk is the error of unit k.
The weights and biases are updated to reflect the propagated errors. Weights are updated
by the following equations, where ∆wi j is the change in weight wi j :

∆wi j = (l)Err j Oi (6.28)


wi j = wi j + ∆wi j (6.29)
“What is the ‘l’ in Equation (6.28)?” The variable l is the learning rate, a constant
typically having a value between 0.0 and 1.0. Backpropagation learns using a method of
gradient descent to search for a set of weights that fits the training data so as to minimize
the mean squared distance between the network’s class prediction and the known tar-
get value of the tuples.8 The learning rate helps avoid getting stuck at a local minimum

8
A method of gradient descent was also used for training Bayesian belief networks, as described in
Section 6.4.4.
6.6 Classification by Backpropagation 333

in decision space (i.e., where the weights appear to converge, but are not the optimum
solution) and encourages finding the global minimum. If the learning rate is too small,
then learning will occur at a very slow pace. If the learning rate is too large, then oscilla-
tion between inadequate solutions may occur. A rule of thumb is to set the learning rate
to 1/t, where t is the number of iterations through the training set so far.
Biases are updated by the following equations below, where ∆θ j is the change in
bias θ j :
∆θ j = (l)Err j (6.30)

θ j = θ j + ∆θ j (6.31)

Note that here we are updating the weights and biases after the presentation of each
tuple. This is referred to as case updating. Alternatively, the weight and bias increments
could be accumulated in variables, so that the weights and biases are updated after all
of the tuples in the training set have been presented. This latter strategy is called epoch
updating, where one iteration through the training set is an epoch. In theory, the math-
ematical derivation of backpropagation employs epoch updating, yet in practice, case
updating is more common because it tends to yield more accurate results.
Terminating condition: Training stops when

All ∆wi j in the previous epoch were so small as to be below some specified threshold, or
The percentage of tuples misclassified in the previous epoch is below some threshold,
or
A prespecified number of epochs has expired.

In practice, several hundreds of thousands of epochs may be required before the


weights will converge.

“How efficient is backpropagation?” The computational efficiency depends on the time


spent training the network. Given |D| tuples and w weights, each epoch requires O(|D| ×
w) time. However, in the worst-case scenario, the number of epochs can be exponential
in n, the number of inputs. In practice, the time required for the networks to converge
is highly variable. A number of techniques exist that help speed up the training time.
For example, a technique known as simulated annealing can be used, which also ensures
convergence to a global optimum.

Example 6.9 Sample calculations for learning by the backpropagation algorithm. Figure 6.18 shows
a multilayer feed-forward neural network. Let the learning rate be 0.9. The initial weight
and bias values of the network are given in Table 6.3, along with the first training tuple,
X = (1, 0, 1), whose class label is 1.
This example shows the calculations for backpropagation, given the first training
tuple, X. The tuple is fed into the network, and the net input and output of each unit
are computed. These values are shown in Table 6.4. The error of each unit is computed
334 Chapter 6 Classification and Prediction

x1 1 w14

w15 4
w46
w24
x2 2 6

w25
w56
w34 5

x3 3 w35

Figure 6.18 An example of a multilayer feed-forward neural network.

Table 6.3 Initial input, weight, and bias values.


x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 θ4 θ5 θ6
1 0 1 0.2 −0.3 0.4 0.1 −0.5 0.2 −0.3 −0.2 −0.4 0.2 0.1

Table 6.4 The net input and output calculations.


Unit j Net input, I j Output, O j
4 0.2 + 0 − 0.5 − 0.4 = −0.7 1/(1 + e0.7 ) = 0.332
5 −0.3 + 0 + 0.2 + 0.2 = 0.1 1/(1 + e−0.1 ) = 0.525
6 (−0.3)(0.332) − (0.2)(0.525) + 0.1 = −0.105 1/(1 + e0.105 ) = 0.474

and propagated backward. The error values are shown in Table 6.5. The weight and bias
updates are shown in Table 6.6.

Several variations and alternatives to the backpropagation algorithm have been pro-
posed for classification in neural networks. These may involve the dynamic adjustment of
the network topology and of the learning rate or other parameters, or the use of different
error functions.

6.6.4 Inside the Black Box: Backpropagation and Interpretability


“Neural networks are like a black box. How can I ‘understand’ what the backpropagation
network has learned?” A major disadvantage of neural networks lies in their knowledge
6.6 Classification by Backpropagation 335

Table 6.5 Calculation of the error at each node.


Unit j Err j
6 (0.474)(1 − 0.474)(1 − 0.474) = 0.1311
5 (0.525)(1 − 0.525)(0.1311)(−0.2) = −0.0065
4 (0.332)(1 − 0.332)(0.1311)(−0.3) = −0.0087

Table 6.6 Calculations for weight and bias updating.


Weight or bias New value
w46 −0.3 + (0.9)(0.1311)(0.332) = −0.261
w56 −0.2 + (0.9)(0.1311)(0.525) = −0.138
w14 0.2 + (0.9)(−0.0087)(1) = 0.192
w15 −0.3 + (0.9)(−0.0065)(1) = −0.306
w24 0.4 + (0.9)(−0.0087)(0) = 0.4
w25 0.1 + (0.9)(−0.0065)(0) = 0.1
w34 −0.5 + (0.9)(−0.0087)(1) = −0.508
w35 0.2 + (0.9)(−0.0065)(1) = 0.194
θ6 0.1 + (0.9)(0.1311) = 0.218
θ5 0.2 + (0.9)(−0.0065) = 0.194
θ4 −0.4 + (0.9)(−0.0087) = −0.408

representation. Acquired knowledge in the form of a network of units connected by


weighted links is difficult for humans to interpret. This factor has motivated research in
extracting the knowledge embedded in trained neural networks and in representing that
knowledge symbolically. Methods include extracting rules from networks and sensitivity
analysis.
Various algorithms for the extraction of rules have been proposed. The methods typi-
cally impose restrictions regarding procedures used in training the given neural network,
the network topology, and the discretization of input values.
Fully connected networks are difficult to articulate. Hence, often the first step toward
extracting rules from neural networks is network pruning. This consists of simplifying
the network structure by removing weighted links that have the least effect on the trained
network. For example, a weighted link may be deleted if such removal does not result in
a decrease in the classification accuracy of the network.
Once the trained network has been pruned, some approaches will then perform
link, unit, or activation value clustering. In one method, for example, clustering is
used to find the set of common activation values for each hidden unit in a given
trained two-layer neural network (Figure 6.19). The combinations of these activation
values for each hidden unit are analyzed. Rules are derived relating combinations of
activation values with corresponding output unit values. Similarly, the sets of input

You might also like