Professional Documents
Culture Documents
14
fluromaNc GeneraNon of
Neural Network flrchirecrure
Using Evolutionary Computation
EVonk
LCJain
R PJohnson
Output
World Scientific
Automatic Generation of
Neural Networh Architecture
Using EvoluNonarq ComputaMon
ADVANCES IN FUZZY SYSTEMS — APPLICATIONS AND THEORY
Forthcoming volumes:
Vol. 9: Fuzzy Topology
(Y. M. Liu and M. K. Luo)
Vol. 13: Fuzzy and Uncertain Object-Oriented Databases: Concepts and Models
(Ed. R. de Caluwe)
Advances in Fuzzy Systems — Applications and Theory Vol. 14
Automatic Generation of
Neural Nehuorh Architecture
Using Evolutionary Computation
EVonk
Vrije Univ. Amsterdam
LC Jain
Univ. South Australia
R PJohnson
Australian Defense Sci. & Tech. Organ.
World Scientific
Singapore 'New Jersey • London • Hong Kong
Published by
World Scientific Publishing Co. Pte. Ltd.
P O Box 128, Farrer Road, Singapore 912805
USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
v
vi Preface
The book will prove useful for application engineers, scientists, researchers and the
senior undergraduate/first year graduate students in Computer, Electrical, Electronic,
Manufacturing, Mechatronics and Mechanical Engineering, and related disciplines.
Thanks are due to Berend Jan van der Zwaag and Pieter Grimmerink for their
excellent help in the preparation of the manuscript. We are grateful to Professor
Sanchez of the University of Marseille, France, and Professor Karr of the University
of Alabama for reviewing the manuscript. Thanks are also due to Mr Chiang Yew Kee
for his excellent editorial assistance.
This work was supported by the Australian Defense Science and Technology
Organisation (contract number 340479).
E. Vonk
L. C. Jain
R. P. Johnson
Contents
PREFACE v
1. INTRODUCTION 1
3. EVOLUTIONARY COMPUTATION 17
vii
viii Contents
6. IMPLEMENTING GAs 79
6.1 GA performance 79
6.2 Fitness function 81
6.3 Coding 82
6.3.1 Binary coding 83
6.3.2 Real-valued coding 84
6.3.3 Symbolic coding 85
6.3.4 Non-homogeneous coding 85
Contents ix
INDEX 181
1. Introduction
It is often overlooked that the performance of a neural network on a certain problem
depends in the first place on the network architecture used and only in the second
place on the actual knowledge representation (i.e. values of the weights) within that
specific architecture. It can be said that the performance of a neural network depends
on three factors: the problem for which the network is going to be used or rather how
this is measured, the network structure and the set of weights. The performance of a
network is typically measured by the cumulative error of the neural network on some
test data with known target outputs, but can include computational speed and
complexity as well. This performance can be defined by an abstract quality function
Q-
Q =
-- Q(T,
~- Q(T,S,S,W)
W)
where:
Q = the type of quality function
T - the testing data (i.e. the target input/output data set)
S = the structure or architecture of the network
W = the set of weights
F
1
Q-- -T.r
r ,=i
where:
F = the number of test patterns or facts
O = the neural network output vector
T = the target output vector
1
2 Chapter 1. Introduction
The automatic generation of a neural network structure is a useful concept, as, in many
applications, the optimal structure is not known a priori. Construction-deconstruction
algorithms can be used as an approach but they have several drawbacks. They are
usually restricted to a certain subset of network topologies and as with all hill climbing
methods they often get stuck at local optima and may therefore not reach the optimal
solution. These limitations can be overcome using evolutionary computation as an
approach to the generation of neural network structures. In order to optimise the
quality function Q the algorithm used must at least be able to change the structure S as
well as the set of weights W. In the case of feedforward neural networks, optimising
the structure S alone may be sufficient since the existing learning algorithms are such
that, given a neural network structure S, in many cases the optimal set of weights W
can quite easily be found. In many applications the test data set, T, will be stable.
However the network may have to operate in a dynamic environment where the task
description or at least the testing of the network on the task may change over time. In
such a case the algorithm must ideally be able to adapt Tas well.
Artificial neural networks are viable and very important computational models for a
wide variety of problems. These include pattern classification, speech synthesis and
recognition, function approximation, image compression, associative memory,
clustering, forecasting and prediction, combinatorial optimisation, and non-linear
system modeling and control. The networks are 'neural' in the sense that they have
been inspired by neuroscience, the study of the human brain and nervous system. The
artificial neurons used are thought to be very simple models of their biological
counterpart. However, this does not mean that they are faithful models of biological
neural or cognitive phenomena, those are of a much more complex nature. In fact, the
majority of the neural networks presently used are more closely related to traditional
mathematical and/or statistical models, such as non-parametric pattern classifiers, non
linear filters and statistical regression models, than they do to neurobiological models.
Still, the technology of neural networks attempts to mimic nature's approach to solve
certain complex problems that are impossible to solve with the more traditional
techniques.
2.1 Introduction
This section introduces some of the concepts of neural networks. The basic
components of neural networks are discussed and some of the more common forms of
neural networks are considered.
3
4 Chapter 2. Artificial Neural Networks
The study of neural networks was originally undertaken in order to understand the
behaviour and structure of the biological neuron. It was soon realised how inadequate
the artificial neuron models were in comparison with the biological neuron, and as a
result some researchers in artificial neural networks decided that the name of neuron
was inappropriate and used other terms such as node rather than neuron. The use of
the term neuron is now so deeply entrenched that its continued general use seems
assured.
Another point which is sometimes confusing is that different writers use a different
numbering nomenclature for multi-layered neural networks. Some workers do not
count the input layer as one of the layers on the basis that this layer often serves only
for the input data and no processing of data occurs in it. Processing however does
occur within the input layer in some forms of artificial neural network. For the sake of
consistency we include the input layer as one of the layers when numbering the layers
of neurons.
The first stage is a process where the inputs x0, xh ... xn multiplied by their respective
weights w0, wh ... w„ are summed by the neuron.. The input vector x0, xs, ... x„ may be
denoted by X and the weight vector w0, wh ... w„ by W. Weight w0 forms the neuron's
threshold. The resulting summation process may be shown as:
yy =
=ww +xjXj■■ wwj
0 0+ +x
t+x ■2 w+ 2,.+x
2 ■2 w . + ...
n
+ x„
■ w„ ■ w„
=X W
The weight vector W contains the weights connecting the various parts of the network.
The memory of the neural network is stored in the values of the weights. The term
weight is used in neural network terminology and is a means of expressing the strength
of the connection between any two neurons in the neural network.
During training phase of a neural network the values of the weights are continuously
modified by the training process until some previously agreed criteria are met.
Different types of network use different methods of making the necessary adjustments.
Output =fiy)
6 Chapter 2. Artificial Neural Networks
• Step function. The function shown in Figure 2.3a is known as the step function.
The output from this function is limited to one of two values, depending on
whether the input signal is greater or less than zero. Usually the output value
would be one for signal values greater than zero and minus one for signal values
less than zero That is:
• Linear function. The function shown in Figure 2.3b is the only linear function in
the group of four functions shown and it has application in some specific network
nodes where dynamic range is not a consideration. The effect of this function is to
multiply by a constant factor. That is:
Output = K • y
• Ramp function. The effect of the ramp function, shown in Figure 2.3c, is to behave
as a linear function between the upper and lower limits and once these limits are
reached to behave as a step function. Another attraction is that the function may be
simply defined:
• Sigmoid function. The sigmoid function is an 'S' shaped curve, as shown in Figure
2.3d. A number of mathematical expressions may be used to define an 'S' shaped
curve, but the most commonly used form is given by the expression:
f(y) = ^ - 7
2.1.6 Learning
Before a neural network can be used it is necessary to subject it some form of training
during which process the values of the weights in the network are adjusted to reflect
the characteristics of the input data. The learning process is one of developing a
mapping between the output data and the input data. When the network is adequately
trained, it will retrieve the correct output when a set of input data is presented to it. A
valuable property often claimed for neural networks is that of generalisation, whereby
a trained neural network is able to provide a correct matching of output data to a set of
previously unseen input data.
As indicated in Table 2.1 there are two forms of learning: supervised learning and
unsupervised learning.
• Supervised Learning
In this form of learning, a target value is included as part of each fact within the
training data. In this instance a fact incorporates all of the input data for the particular
event and the required output expected from the network for this fact. The target value
is the output value corresponding to a particular fact.
During the training process, the set of training data facts is repeatedly applied to the
network until the difference between the output results and the target values are within
the desired tolerance. When the neural network meets the error criteria on the training
facts, the previously unseen test data set of facts is applied to the neural network to
test the generalisation performance of the network.
• Unsupervised Learning
Unlike supervised learning there is no target value in this form of training. Instead, the
set of data which contains the facts is repeatedly applied to the network until a stable
network output is obtained. It has been suggested that this form of training is more
similar to the biological neuron as in the biological situation there is not normally a
target value.
2.1 Introduction 11
Rule
Rule Weight adjustment
Weight adjustment AAWy
Wy Comments
Comments
Hebbian
Hebbian Aw,lj=r)f(w,X)x
Aw J=r]f(wlX)Xj
J 77 == learning
learning rate
(enhance successful
successful w = weight
weight vector
vector
connections) X == input
input vector
vector
Perceptron
Perceptron Awy
Aw,j == rj(t,-sgn(w
■q(t,-sgn(w iX))x
1X))x J J tt -- target
target vector
vector
(binary response, no 77 == learning
learning rate
action if no error)
Delta
Delta Aw^ = r\
Awji r\8dpja
Pja„,
p, SiSj==weighted
weightedsumsumofofinputs
inputstoy
to;
A: 8p]8=f(S ])(t
pj=f(S J)(tp] -a-ap]pj))
l>J A: output layer error
B: Spj S=f(Sj)ZAw
B: pj=f(Sj)£Aw kjkj B:
B:hidden
hidden layer
layer error
error
Least Mean
Least Mean Square
Square Aw, = rj(t
Aw, rWjX)Xj
Tj(t,-WjX)Xj 77, t,t,XX and
and vv
w are
areasasabove
above
(Widrow-Hoff)
Outstar (Grossberg)
Outstar Aw
Awj,p = r](t
T](trWj.J
rwp)
Winner Takes
Winner All
Takes All A: Aw
A: Aw,j
v =- IKXJ-W,J)
r\{xfw,^ A:when
A: whenininnear
nearneighbourhood
neighbourhood
(nearby neurons modify
modify B: Aw,.
AwtJ=
=00 B:when
B: when not
not inin near
near
in a similar fashion)
fashion) neighbourhood
The multiple layer perceptron network will be described in further detail, since its
regular feedforward structure lends itself well to an investigation of genetic algorithm
techniques for network design.
The mean squared error, often called training error or network error, between the
actual output and the desired output is defined as follows:
E 2 2
E == lYJ(tljk-y
J(tkk)-yk) (2.1)
(2.1)
1
k
where
tk = target output of the k& neuron in the output layer
yk - actual output of thefc*neuron in the output layer
The derivative of the error, with respect to each weight is set proportional to weight
change as:
dF
dE
AM>
Aw
jk == - £ - - T — (2.2)
(2.2)
dWjk
^'jk
dE
Awikjk(t(t +
+1)
l) = - e£ ■• - — ( (t
t ++1)
l) + ft • fi-Aw
AwJkjk(t)
(t) (2.3)
(2.3)
dw jk
dWjk
where
The back-propagation algorithm, despite its simplicity and popularity, has several
drawbacks. It is slow and typically needs thousands of iterations to train a network to
solve a simple problem. The algorithm performance is also dependent on the initial
weights, and the values of [i and £.
2.3 Conclusion
Artificial neural networks are viable and important computational models for a wide
variety of problems. It is a common practice to use trial and error to find a suitable
neural network architecture for a given problem. This trial and error method is time
consuming, and may not generate an optimum neural network structure. The learning
process whereby the network encodes information from the training process is also of
great importance in neural network performance and generalisation.
There is some confusion about the grouping and naming of the various kinds of
evolutionary computations. In this report the distinction is made between three kinds
of evolutionary computations: Genetic Algorithms (GAs), Genetic Programming and
Evolutionary Algorithms. The latter can be divided into Evolution Strategies and
Evolutionary Programming.
17
18 Chapter 3. Evolutionary Computation
I is referred to as the chromosome length. Commonly all alphabets are the same: A =
A] = A2=... = Ah and in the case of binary genes: A = {0,1}.
The definitions of the basic terms in a genetic algorithm are given below:
The aim of the genetic algorithm is to find an individual with a maximum fitness by
means of a stochastic global search of the solution space.
P(x) = -1 ifjt<0
+1 ifxSO
The aim is to get the network to perform the XOR function, which is described by the
input-output mapping as described by Table 3.1.
The task is to find the set of weights such that the neural network performs the XOR
function. Figure 3.1 shows the neural network structure.
20 Chapter 3. Evolutionary Computation
A chromosome or genotype consists of all the weights of the network, including the
bias weights. One gene of a chromosome represents a single weight-value. For the
demonstration, a simple genetic algorithm is used with binary valued chromosomes.
Thus the alphabet is {0,1}. The alphabet size or cardinality, k, therefore is two. During
the evaluation of the chromosome, a gene in the chromosome that has a value of 0 will
be translated into a weight-value of -1. The weights are numbered from 1 to 9 in
Figure 3.1 which reflects the order in which they are represented in the chromosome.
The chromosome length, /, is 9.
An example of an ordered set of weights such that the network correctly performs its
task is:
x = (1 1000 1110)
The phenotype of this individual can be seen as the actual neural network structure
with the values of the weights given by the set { +1,+ 1,-1,-1,-1,+ 1,+ 1,+ 1,-1}. Such a
phenotype is a potential solution of the XOR problem; in fact this phenotype is an
optimal solution to the problem.
3.1 Genetic Algorithms (GAs) 21
The fitness function should reflect the individual's performance on the actual problem.
The standard genetic algorithm searches for the maximum fitness and a low
performance error should be reflected in a good performance and therefore a high
fitness value.
where E(x) is the cumulative performance error on the training set and Emax is the
maximum value E{x) can obtain. E(x) is given by:
i=\ ,=\
The maximum performance error, EmliX, would in this case be: 4 * 2 =16, and thus the
fitness value of the chromosome in question \%:f{x) = 16 - 0 = 16.
22 Chapter 3. Evolutionary Computation
The reproduction (or selection) operator that is most commonly used is the Roulette
wheel method, where members of a population are extracted using a probabilistic
Monte Carlo procedure based on their average fitness. For example, a chromosome
with a fitness of 20% of the total fitness of a population will, on an average, make up
20% of the intermediate generation. Apart from the Roulette wheel method many
other selection schemes are possible. An overview is presented in a later chapter.
The heuristics of GA are mainly based on reproduction and on the crossover operator,
and only on a very small scale on the mutation operator. The crossover operator
exchanges parts of the chromosomes (strings) of two randomly chosen members in the
intermediate population and the newly created chromosomes are placed into the new
population. Sometimes instead of two, only one newly created chromosome is put into
the new population; the other one is discarded. The mutation operator works only on a
single chromosome and randomly alters some part of the representation string. Both
operators (and sometimes more) are applied with a certain probability. Figure 3.2
shows the flowchart of the standard genetic algorithm.
The stopping criterion is usually set to that point in time when an individual that gives
an adequate solution to the problem has been found or simply when a set number of
generations has been run. It can also be set equal to the point where the population has
converged to a single solution. A gene is said to have converged when 95% of the
population of chromosomes share the same value of that gene. Ideally, the GA will
converge to the optimal solution; sometimes however a premature convergence to a
sub-optimal solution is observed.
3.1 Genetic Algorithms (GAs) 23
Many variations of the above algorithm are possible and will be discussed in some
detail in later chapters.
111110011
0 110 10 10 1
001101011
111001101
0 1111110 1
In the general case a gene a, will be initialised from a set of values corresponding to
the alphabet of the gene, A,. For example real-valued genes can be initialised in a
certain range using a normal distribution.
3 I0
000 111 01100 1 10 1 1 {+1,-1,+1,+1}
( + 1,-1,+ 1,+"JT~ 12 12 4 I 4 ~
w
4 111001101
111001101 " 4 12
5 0 1111110 11 { + 1,+ 1,-1,+ 1}
{+1,+1,-1,+1} 1212 4 4_
In most genetic algorithm systems this assessment is by far the most time consuming
activity, so care must be taken in implementing it.
Since the stopping criterion is not satisfied, a new generation is created from the
present one. First, the intermediate population is made by means of the reproduction
operator.
( ) - f(x)
P'select ( * / y r
The Roulette wheel operator is best visualised by imagining a wheel where each
chromosome occupies an area that is sized according to its relative fitness:
26 Chapter 3. Evolutionary Computation
Selection of the chromosomes can now be seen as spinning the roulette wheel. When
the wheel stops a fixed marker determines which chromosome will be selected. To
make the intermediate population the Roulette wheel is simply spun 5 times.
Since the expected number of times that chromosome x will be selected is given by
Eseiect(x) = N * pseieci(x), where N is the population size, this can be expressed as:
C
■ select I * / -f
Table 3.3 gives the statistics of the current population. The last column shows the
actual number of times the chromosome is chosen. The intermediate population
therefore consists of two copies of chromosome 4, and one copy of chromosome 2,3
and 5.
In practice this intermediate population or mating pool is normally not actually
formed. Instead the reproduction operator is used to select parents that will be subject
to the crossover operator. The reproduction operator is therefore more accurately
referred to as the selection operator.
3.1 Genetic Algorithms (GAs) 27
The crossover operators used most often are based on 1-point or 2-point crossover,
depending on the number of crossover-points selected in a chromosome. In general n-
point crossover is possible with n<l. The different versions of the crossover operator
are illustrated by applying them to the XOR weight optimisation example.
0 1-Point Crossover
After selecting the parents the crossover-site within the chromosome is randomly
selected and the substrings about the crossover-site are swapped between the two
parents. The crossover site is randomly chosen from {1 / - l ) , / being the length of the
cl iromosorr es. This process is illustrated in Table 3.4.
28 Chapter 3. Evolutionary Computation
When crossover is chosen to produce 2 offspring and the population size is uneven
(e.g. N = 5 as in the example), the last crossover operation can only result in one
offspring. An option is to randomly discard the second offspring.
0 2-Point Crossover
Table 3.5 gives an example of using 2-point crossover, where two crossover-points are
randomly selected and the substring between these two point is swapped.
Usually 2-point crossover is implemented so that the two crossover sites are chosen at
random and independent from each other. When the second crossover site lies to the
left of the first crossover site, the chromosome string is treated as being circular where
the endpoint and starting point are connected. Table 3.6 shows an example of this
situation.
0 Uniform Crossover
Another version of the crossover operator is the so called uniform crossover. Instead
of using a predefined number of crossover points, the number is chosen
3.1 Genetic Algorithms (GAs) 29
• Perform Mutation
After the new offspring is formed, mutation is performed on the selected
chromosomes. Mutation is usually implemented as follows: each gene in every
chromosome may undergo mutation with a probability of pm, where pm is the mutation-
rate. The mutation-rate is usually set to a low value such as 0.001.
In our example, since genes are bits, mutation normally just inverts the value of the
gene. In the more general case mutation re-initialises the value of a gene with a
random value taken from the initial distribution or alphabet. In the case of a binary
coded chromosome re-initialising the value of a gene will result in a 50% chance of
inverting it. The effect of the binary 'inversion-mutation' will on an average be the
same as the binary re-initialising mutation with half the mutation rate. The 'inversion-
mutation' will be used here. The expected number of genes altered by the mutation
operator, Em, is:
E,„ = N Pm
Here pm is set to 0.01. Since the total number of genes in our population, l*N, is 5 * 9
= 45, a total of 45 * 0.01 = 0.45 genes are altered on average per generation.
• Finished Loop
The algorithm now returns to the second step where each chromosome is evaluated
and the stopping criterion is checked. The process continues until the stopping
criterion is met.
performed on the (binary) strings or genotypes. The other space is the evaluation
space or phenotype space where the actual problem-structures or phenotypes are
evaluated on their ability to perform the task and where their fitness is calculated. An
interpretation or mapping function is necessary between the two. This is visualised in
Figure 3.4. The problem space constitutes all the potential solutions relating to the
problem. The evaluation space is in general a subset of the problem space and is
dependent on the representation used. Naturally, the evaluation space should be
'chosen' in such a way that it includes the optimal structure.
p = g(x,EP)
3.1 Genetic Algorithms (GAs) 31
f(p,PS) -»9T
For example when the network performs the XOR function, it can be described as:
One of the hidden neurons performs the OR function, the other the NAND, but it does
not matter which hidden neuron performs which. In our example both chromosomes (1
1 0, 0 0 1, 1 1 0) and (0 0 1, 1 1 0, 1 1 0 ) perform the XOR function correctly. The
first one by means of AND(OR,NAND) and the second one by AND(NAND,OR).
However, since the functioning of their hidden neurons is swapped with respect to one
another, standard crossover is not expected to yield a useful offspring. This is because
the standard crossover (and mutation) operator does not use any topological
information available in the phenotype. The two individuals suffer from 'competing
conventions' Instead of one there are two optimal solutions to the problem. This
problem increases when more than two hidden neurons are used, and is thought to be a
main source of poor GA performance on such problems.
members of the current population [23]. These members are chosen based on their
fitness. One or more new chromosomes are then merged into the current population
taking the place of a 'doomed' chromosome. This 'doomed' chromosome is usually
chosen based on its inverse fitness. For a single generation step, this process is
repeated until the number of removed chromosomes equals the number of members in
the population. This approach is called a Steady State Genetic Algorithm as opposed
to the standard or Batch Genetic Algorithm. It requires much less memory storage as
only one population instead of two needs to be stored. A certain notion of age can be
built into the system where for a certain number of iterations these newly made
members can not be reselected to create a new offspring.
The first and most commonly used genetic algorithm software package based on a
Steady State Genetic Algorithm is 'Genitor' developed by John J. Grefenstette.
Genitor uses a linear ranking selection method (see section 6.4) and unconditional
ranked replacement. Of the two offspring made by crossover only one is allowed to
enter the population, replacing the doomed individual; the other offspring is
discarded. This type of Steady State Genetic Algorithm is also referred to as a
'Genitor-type' genetic algorithm.
3.1.7 Elitism
Elitism is an optional characteristic of a genetic algorithm. When used, it makes sure
that the fittest chromosome of a population is passed on to the next generation
unchanged; it can never be replaced by another chromosome. Without elitism this
chromosome may be lost. Extended forms of elitism are also possible where the best
m chromosomes of the population are retained. Simple elitism is the case where m=\.
In effect elitism means that the number of offspring that are generated each generation
is reduced from N to N-m replacing the worst N-m individuals in the population. A
Steady State Genetic Algorithm with ranked unconditional replacement (Genitor-type)
can be seen as a GA using extended elitism with m=N-\.
• Niched GAs
Niched genetic algorithms are used to preserve information across a diverse
population. The simple standard GA loses information by quite rapidly converging to
a single solution. Niched GAs however, try to maintain several sub-populations of
individuals relating to different fit solutions. They are especially useful in finding a set
of mutually supportive solutions to a problem and have been successfully used in
solving multimodal functions. They can offer a solution to the competing conventions
problem (section 3.1.4). A niche is defined as a region in the fitness landscape with a
high fitness. A niched GA tries to 'fill' each niche with a set of chromosomes in
proportion to the quality of the niche.
There are a number of mechanisms available to achieve niching. The most frequently
used is fitness sharing. Here the normal or unshared fitness of an individual is
degraded depending on the presence of nearby individuals. The distance metric often
used in binary coding is the Hamming distance between the genotypes
(chromosomes). However a distance metric in the evaluation space relating to the
phenotypes of the individuals can also be used. Fitness sharing spreads the population
out over the niches where each niche is filled according to its height. Other niching
methods include restrictive mating schemes where in general only similar
chromosomes are allowed to reproduce.
3.2 Genetic Programming (GP) 35
• Meta-LevelGA
In a meta-level GA, GAs are contained within other GAs. For the simplest case of a
two level GA, the top level GA calls upon the bottom level GA during evaluation.
This bottom level GA can be used to optimise some sub-problem of the overall
problem. A two level GA has been used where one GA is used to control the
parameters (mutation rate etc.) of the other GA.
• Representation
In GP the chromosomes are made up of a set of functions and terminals connected to
each other by a hierarchical tree structure. The endpoints or leaves of the
chromosomal tree are defined by the terminals, all the other points are functions.
Typically the set of functions (denoted by 'F') includes arithmetic operations, logical
operations and problem specific operators. The terminal set (denoted by 'T') is made
up of the data inputs to the system and the numerical constants. Functions can
generally have other functions as well as terminals as their arguments and must
therefore be well-defined to handle any input combination. The number of arguments
a function has must be defined beforehand. GP incorporates 'variable selection'; it is
not needed to set a priori which data-inputs are going to be used. These are selected
36 Chapter 3. Evolutionary Computation
on the run, which can be a useful concept when it is not known in advance exactly
which data-inputs are needed in order to solve the problem. Figure 3.6 shows an
example of a very simple chromosome, made up of the functions AND and OR and
the data terminals DO and Dl. The function set could for this example be: F = {AND,
OR, XOR} and the terminal set simply: T= {D0,D1}.
• Evaluation
During evaluation of the chromosome the data inputs DO and Dl are assigned actual
input values. The output of the chromosome (= program) is then calculated as the
value of the top-most function-point in the tree, the root, and is used in the fitness
function as a measure of the performance of the individual on the problem. In this
example of Boolean functions, the tree representation used by GP is much more
natural than a string representation used by a GA.
Mutation re-initialises a randomly chosen point (= gene) in the tree. In general this can
be a function or a terminal. An example is shown below where a function-point is
chosen to undergo mutation.
The names PROGN, DEFUN etc are labels used in the actual representation of the
chromosome in the GP system. An ADF is defined by its name (e.g. ADFO), by the list
of its dummy arguments (ARGO and ARG1) and by the actual function as defined in
the body. This function is just another tree structured program like the one in Figure
3.6 as is the result producing branch. When ADFs are present, the function set is
3.2 Genetic Programming (GP) 39
extended with the ADFs. In the example above, the function set would now be: F =
{AND, OR, XOR, ADFO, ADF1), with ADFO being a function taking two arguments
and ADF1 taking three.
As an illustration Figure 3.10 shows an example of the body of ADFO and of the result
producing branch or main program of a chromosome.
When the chromosome is evaluated the result producing branch is computed where
the body of ADFO is called upon when the function ADFO is encountered. The ADF
body is then instantiated with the appropriate arguments from the main program which
can be other functions or terminals and its output is returned. This evaluation always
takes place 'bottom-up' The outputs of the functions is fed from the bottom of the
tree towards the top (or root).
Figure 3.10 Example of the body of ADFO (left) and of the result producing branch
or main program of a chromosome
The genetic operators work on both branches. The idea is that GP will dynamically
evolve functions that are useful to the problem (ADFs) as well as a main program that
calls upon these functions. A parallel can be drawn here to the field of neural networks
where a certain part of the network performs a function that can be seen as a subtask
for the complete problem. The difference is that its position within the neural network
is fixed and that it is of no use to the network if its needs this same function
somewhere else but with different inputs.
40 Chapter 3. Evolutionary Computation
( H Ip + X ) - ES or (id Ip , X) - ES
During the course of a generation the ii parents initially create X offspring by means of
mutation and sometimes recombination. Then the intermediate population consisting
of parents and offspring is reduced to the original size by means of a 'selection'
process which simply retains the best ix individuals and discards the rest. The '+' and ','
denote the selection method used. In a ( p Ip , X ) - ES the parents can not be selected
as members of the next generation, while in a ( p Ip + X ) - ES system they can. The
integer p, also called the mixing number, denotes how many parents mix their genes
during the creation of offspring. In the case p - 1 two parents mix their genes by
means of a crossover operator to produce offspring (typically one). The offspring are
then mutated. In the absence of crossover (mutation only) p = 1. The first systems
developed were ( 1 + 1 ) - ESs where a single parent produces a single offspring that
replaces it if is better and is discarded otherwise. Multimembered ESs were developed
later including the addition of crossover operators. There is no selective pressure in a
multimembered ES; every individual has an equal chance of producing offspring.
An important feature of ESs is that the range of mutations, the stepsize, is not fixed but
inherited. It is unique to an individual and generally different for each gene. An
individual is represented by the pair of vectors v:
v = (x , CT)
x denotes a point in the search space consisting of / genes and a is a vector of the
same length consisting of standard deviations, one for each gene. Mutation creates a
new offspring x' from x by adding to it a Gaussian number with mean 0 and standard
deviation <x
x'=x + /V(0, a)
42 Chapter 3. Evolutionary Computation
Although not present in the earliest models o is normally adapted during the mutation
process as well. A commonly used method is:
oJ=CT.emAo)
where ACT is a system parameter. A commonly used crossover operator creates a single
offspring (JC,<T) from two parents (x'.cr1) and (x 2 ,^) by randomly mixing their genes
(as uniform crossover in GAs) and their step sizes. Mutation is performed after this to
complete the process.
Many extensions and alterations have been made on the basic ES scheme described
here. It is interesting to note that although the fields of GAs and ESs vary in a number
of ways, quite a few ideas are being taken from one field and implemented in the
other. Examples are the introduction of the crossover operator in ES and real-valued
instead of binary encoding with 'creeping' or additive mutations in GAs. Also the idea
of adaptive parameters, especially the mutation rate, has received a lot of attention in
the GA community lately.
4. The Biological Background
Since practically all ideas and certainly most of the nomenclature in the field of
evolutionary computation are taken from its biological counterpart, a brief
introduction of genetics [42],[58] is presented in this chapter together with an
overview of the main concepts of Darwinian evolutionary theory. First, the genetic
structures as observed in nature are described. Second, the actual process of
reproduction and the occurrence of mutations is then dealt with. Third, the process of
natural evolution is described in section 4.4, by present day evolutionary theories.
Fourth, the link is made between this biological background and the field of
evolutionary computation (focused on genetic algorithms).
A chromosome consists of two strands called chromatids joined together at one point
by a centromere. Chemically the genetic information in a chromosome is carried by
the nucleic acids DNA and RNA.
All cells in an organism are identical in their chromosomal content. There is thought
to be some switching mechanism which, together with the position of a cell in the
organism, determines which genes become operative and which do not. This in turn
determines the specialisation of a cell; i.e. if a cell operates as a liver cell or a skin
cell.
43
44 Chapter 4. The Biological Background
• Epistasis
The way a gene is expressed in the phenotype or whether it is expressed at all often
depends on the presence or absence of another gene. When there is such an interaction
between genes in the expression of the genotype, it is called epistasis. The most
common form of epistasis is the masking effect. This means that a gene acts as a mask
for one or more other genes. When the masking gene is present in the chromosome it
completely 'turns off this set of genes; i.e. these genes are not expressed in the
phenotype. In the absence of the masking gene they are.
4.2 Reproduction
In organisms there are two reproductive methods by which cells divide to form new
cells. The first kind is mitosis, where the parent cell simply divides itself in two cells
identical to the parent. This is the main method by which organisms produce new cells
in order to grow larger. It is also part of asexual reproduction as used by simple
organisms. The second one is meiosis, or 'reduction division', and is used for sexual
4.2 Reproduction 45
reproduction. Meiosis produces four cells from one parent cell. In sexual reproduction
special reproductive cells called gametes are used.
When two organisms perform sexual reproduction, each of them produces gametes
(the sperm of the male and the egg of the female) by means of meiosis. Normal cells
in an organism carry pairs of chromosomes of each type and are said to be diploid.
The two chromosomes in a pair are called homologous chromosomes. A gamete
carries only one such set of chromosomes and is said to be haploid. Thus a haploid
cell contains half the number of chromosomes of a diploid cell. Also, in a gamete,
instead of two genes at every locus there is only one gene. Each of the two genes of a
locus of a cell before meiosis has a chance of 50% of ending up at the locus of the
gamete; this process is called segregation. This is Mendel's First Law, which says that
characteristics of organisms are carried in pairs and only one of each pair can be
carried by a gamete, each having equal chance of ending up in the gamete. The second
stage of sexual reproduction is fertilisation where the gametes of the male and female
unite to form one new cell called a zygote, restoring the original count of
chromosomes and again having two genes at each locus.
Another phenomenon found in reproduction is gene linkage. It is found that during the
formation of gametes, alleles associated with the same chromosome remain together in
the offspring. For example alleles such as aft] or a2b2 may be linked in the offspring
forming a linkage group.
46 Chapter 4. The Biological Background
4.3 Mutations
Apart from the normal processes described above, comparatively rare events called
mutations can occur. A mutation is a change in a chromosome which may result in a
change in the characteristic of a cell or an organism. A mutated individual is called a
mutant. Most often mutations are harmful to the cell or organism resulting in disease
or even death. When they are beneficial however, they have great effect, providing a
basis for variation between and within a species. This ensures that species can adapt
to changing environments. Mutations can be divided into chromosome mutations and
gene mutations.
• Recombination
Another form of chromosome mutation that also occurs during meiosis is
recombination. During meiosis the homologous chromosomes are intimately
intertwined and various types of mixing of chromosomes can occur when they wrap
around each other. This type of general or homologous recombination is also known
as crossover.
Points of attachment in a chromosome are called chiasmata and define the points
where a chromosome might break and rejoin with the homologous chromosome next
to it. A single crossover involves the swapping of the parts of two chromosomes at a
single chiasma. Double or triple crossover can occur when chromosome parts are
swapped at more than one place. The probability that two different linked alleles cross
over together (i.e. end up in the same offspring) is a function of how close they are
together on the chromosome. The closer they are together, the higher the frequency.
4.3 Mutations 47
ABCIDEF A B C d e f
abcldef a b c D E F
ABAIBlCD A B A B A B C D
=>
AlBIABCD ABCD
Unlike the above cases certain recombinations are non-reciprocal and only one of the
offspring is changed by crossover while the other remains unaffected. This is referred
to as gene conversion.
ABCIDEF A B C d e f
abcldef a b c d e f
Still other forms of recombination are possible, often resulting in more subtle changes
in the DNA structure of the chromosomes.
• Inversion
Inversion occurs when a chromosome section breaks off and the broken part turns and
rejoins the rest of the chromosome resulting in a reverse order of the genes in that
section.
• Deletion
Deletion is the phenomenon where a chromosome section breaks off and is omitted
from the chromosome altogether. The two loose ends of the chromosome then join up
resulting in a shorter chromosome.
48 Chapter 4. The Biological Background
• Translocation
When crossover occurs between two non-homologous chromosomes, this is called
translocation. This phenomenon is also known as non-homologous recombination.
' Polyploidy
Occasionally, because of an erroneous meiosis, a diploid gamete is produced instead
of a normal haploid one. When this gamete is united with a normal haploid gamete
during fertilisation, the resulting zygote will have three sets of chromosomes instead
of two and is called triploid. If two of those abnormal diploid gametes unite, the result
is a tetraploid zygote. This phenomenon is called polyploidy and although rare in
animals, it is quite frequently found amongst plants and can actually be beneficial for
the organism.
a single common ancestor. In everyday use the term evolution is often confused with a
specific evolutionary theory such as the one proposed by Darwin, which tries to
explain how evolution actually works.
While the kind of evolution described by Darwin normally takes place over very long
time spans and observations of it are based on fossil records, evolution can and has
been directly observed within a span of only several years. For this reason the
distinction between microevolution and macroevolution is often made. While some
biologists feel the mechanisms of both are different, most simply treat macroevolution
as a long cumulative series of microevolutions.
Evolutionary mechanisms can basically be grouped into two categories: those that
increase genetic variation and those that decrease it. The mechanisms that increase
variation are the mutations occurring during reproduction as described in the last
section as well as a concept called gene flow. Gene flow simply means that new
genetic information is introduced into the population by migration from another
population. It occurs when two more or less related species from different populations
mate. The mechanisms decreasing genetic variation are natural selection and genetic
flow and these are now described in more detail.
• Natural Selection
In Darwinian evolutionary theory natural selection is seen as the creative force of
evolution. When supplied with genetic variation it makes sure that sexually
reproducing species can adapt to changing environments. In the course of evolution
natural selection preserves the favourable part of the variation within a species. It
often does this by letting the fittest individuals of a species produce the most offspring
for the next generation. It provides a selective pressure that favours the fitter
individuals of a population. The theory of natural selection is therefore often referred
to as the survival of the fittest. This term is misleading for a number of reasons.
Reproduction ('survival') of the organism itself is not the driving force of natural
selection. The driving force is the contribution of the organism's alleles to the next
generation's gene pool. Natural selection favours selfish behaviour but does so more
at the level of genes than at the level of organisms. For example it can be beneficial
for an organism to help other organisms reproduce that are closely related to it, i.e.
share many of the same alleles, sometimes even sacrificing its own chances of
reproduction or even its own life. For this reason fitness is often split into two
50 Chapter 4. The Biological Background
components: direct fitness, which is a measure of how many alleles the organism can
enter into the next generation's gene pool by reproduction of itself, and indirect
fitness, which measures how many alleles identical to its own but belonging to other
organisms it helps enter the gene pool. Natural selection works in such a way as to
increase the combination: the inclusive fitness.
Another point against the term "survival of the fittest" is that survival is only one
component of selection. Another one, often even more dominant, is sexual selection.
In many species males have to compete against each other for mates. This competition
can be physical or it can be ruled by female choice. In the latter case organisms evolve
traits, 'status symbols', which are favoured by females for sexual selection. In some
species where very few males monopolise all females, many males live to
reproductive age but very few of them ever mate. While they perhaps do not differ in
their ability to survive, they do differ in their ability to attract mates. The fitness of an
organism is therefore not just a measure of its physical abilities, it is often much more
a measure of its sexual attractiveness.
For natural selection to be a creative force, the genetic variation must be random and
its effect relatively small. This is the case in present evolutionary theories. A
fundamental concept of Darwinism often not understood is that evolution has no
direction and that there really is no sense of progress where certain organisms are
'better' than others. Organisms just become better adapted to their environments. The
changes made may in fact prove harmful when the environment changes. A related
popular notion is that natural selection favours organisms with a high level of
complexity resulting in an "evolutionary ladder' from simple one-celled organisms to
the ultimate creation: man. In fact by far the most successful species in the past and
present are the simplest of them all: bacteria, whose existence is incidentally crucial to
our own. From evolutionary theory it should be concluded that the evolution of
mankind is nothing more than a lucky outcome of thousands of linked events and by
no means inevitable.
• Genetic Drift
Even without a selective pressure contributed by the mechanism of natural selection,
there is a mechanism at work that decreases genetic variation. If it were the case that
each organism had an equal chance of producing offspring (i.e. no selective pressure)
and there was no mechanism for introducing variation, the frequency of alleles would
decrease by means of genetic drift. Genetic drift is simply the binomial sampling error
of the gene pool. The organisms that reproduce increase the frequency of their alleles
4.4 Natural Evolution 51
over the population. In the next generation the frequency of these alleles is expected to
increase even more simply because there is a larger chance that an organism
possessing them is chosen to reproduce. Without mechanisms to introduce variation,
the effect of genetic drift (with or without natural selection) would ultimately be a
compete lack of genetic variation in the gene pool.
• Preadaptation
One of the main difficulties for evolutionary theories is to explain how complex
structures in organisms evolved from scratch. For example it can be very beneficial
for an organism to have an eye, but since evolution works in small steps, how
beneficial can it be to have say 5% of an eye? This is usually explained by the concept
of preadaptation. Preadaptation states that a structure in an organism can change its
function radically while its form remains approximately the same; i.e. functional
change in structural continuity. In the first steps towards the evolution of an eye, the
structure serves a different purpose than vision. This purpose has to be beneficiary for
the organism for it to be rewarded by natural selection.
• Niching
Biological systems use a restrictive mating scheme to encourage the formation of
species: speciation. Only organisms in the same niche can mate with each other. A
group of organisms within a species is called a population of organisms. When a
population differs to a certain extent from the rest of the species it forms its own niche
and can ultimately form a new species.
that an organism learned in its lifetime was not physically passed on to future
generations [58]. The opposite view was Lamarckian inheritance where learned
characteristics are passed on to offspring. Lamarckism is not widely accepted.
• Optimisation
Natural selection does not necessarily have the effect of producing optimal structures
or behaviours. For one thing it acts on the organism as a whole, not on specific traits.
There is only one fitness measure (the inclusive fitness) that is influenced by many
factors. Many species are stuck in so called local optima simply because the transition
from this local optimum to a global optimum (assuming there actually is one) is very
unlikely. This transition would normally involve having to pass through less adaptive
states. Natural selection does not cater to this; the only way the species can reach a
state with a higher fitness is by a lucky variation (mutation) or combinations of these.
Since environments are generally non-stationary, even being in a very fit state does
not mean the species will continue to thrive in the future. In fact when a species has
specialised itself to function perfectly in a certain environment, it is likely to find
difficulties in adapting if this environment happens to change. Natural selection has no
mechanism that provides future planning. It is a purely local mechanism.
4.4 Natural Evolution 53
To abstract from the special Darwinian theories described in this section the concept
of a minimal Darwin Machine is introduced [7]. A Darwin Machine being a system
ruled by a Darwinian process must have the six essential properties listed in Table 4.1.
For each property the corresponding occurrence in Darwinian evolutionary theory is
given.
Table 4.1 The requirements for a minimal Darwin Machine illustrated by their
occurrence in Darwinian evolutionary theory.
• Genetic Representation
The string representation of chromosomes in GAs is comparable to the ones found in
real life. However, nearly all evolutionary computation algorithms so far have been
limited to haploid chromosomes, where each locus can only contain one gene. While
4.5 Links to Evolutionary Computation 55
in biological systems this is true for gametes during reproduction, normal cells always
have a pair of genes contained in one locus. This feature allows organisms to adapt
more quickly to changing environments and is especially useful if the organism is
required to switch between two environment states. Also a population of organisms
that have diploid chromosomes can contain a much larger genetic variability than
organisms with haploid chromosomes.
Lately the representation used in GAs tends to be more problem specific and no longer
limited to the classic genetic string. Genetic programming of course with its tree
structured chromosomes uses a representation quite different from the one found in
nature.
• Selection
In nature, adaptation is performed using natural selection instead of the selection
method used in most evolutionary computation systems. The main difference is that in
natural selection there is no such thing as a superimposed fitness measure. Not just the
organisms but also the fitness measure evolves. EC systems where this is implemented
are called open-ended evolution. The majority of EC systems however works as a
function optimiser and therefore necessarily have a fixed fitness function. This is
56 Chapter 4. The Biological Background
probably the main difference between natural evolution and EC. Most biologists
would argue that the idea of optimisation in itself is not found in nature and it is for
this reason possibly quite dangerous to blindly copy ideas from natural evolution into
the field of EC.
While it seems to be true that Lamarckian learning does not actually occur in
biological systems, it can prove beneficial for evolutionary computation. It not only
changes the fitness (by means of local search) but it also changes the genetic
representation of the individual so that the learned information can be passed on to
future generations. In EC systems where there is no way for the fitness function to
evolve, Lamarckian learning provides the only mechanism for passing on learned
information. Since almost all EC systems are used as optimisation algorithms the
fitness function will indeed be fixed.
• Epistasis
Tackling epistasis is one of the main problems in GAs. Present day GAs usually fail
when the level of epistasis is high. As most theoretical work in GAs is concerned with
problems of low epistasis, more work needs to be done to understand the working of
problems with high epistasis. By contrast, biological systems perform well even with a
very high level of epistasis. In fact the level of epistasis found even in simple
organisms is so high that some biologists reject the reductionist approach resulting in
the idea of genes as a useful tool for studying genetics. A better understanding of
biological systems concerning epistasis is expected to be of value in research on GAs.
Table 4.2 The requirements for a minimal Darwin Machine and their occurrence in
evolutionary computation.
There seem to be two main reasons why, in general, evolutionary computation does
not qualify as a Darwin Machine. First, there is no equivalent to the struggle for a
place in a limited territory or workspace in EC. An individual feels no effect of the
way other individuals perform on the problem and there is no notion of some kind of
'source' (e.g. territory, food, or even mates) that is limited in any sense. Second, there
is no influence of the environment reflected in the fitness value. The only thing that is
reflected in the individual's fitness is its own performance on the problem.
An exception of this general picture of EC can be found in the work by Nolfi and
Parisi [55] where the GA system evolves artificial organisms represented by
ecological neural networks that compete with each other in a limited two dimensional
world in the quest for food. A changing environment is modelled by varying the food
resources over time. Experiments are performed where the fitness function itself is left
to evolve, resulting in observed forms of preadaptation to changing environments.
This system does meet all the requirements for a Darwin Machine even though the
concept of a Darwin Machine was really set up to compare natural processes rather
than artificial ones. The GA system in this approach can therefore be said to belong to
4.5 Links to Evolutionary Computation 59
the field of artificial life, where complex natural like behaviour is generated from
interacting artificial organisms operating with a relatively simple rule-based system.
• Overview
Table 4.3 gives a brief overview of the main differences between the evolutionary
theory of biological systems and the operation of most present day GAs.
Table 4.3 A brief overview of the main differences between Darwinian evolutionary
theory and most present day GAs
As stated before, the main overall difference between the two systems lies in their
goals, (or rather the lack of one in Darwinian evolutionary theory). While in
evolutionary computation the goal almost always is the optimisation of some kind of
fixed problem, this does not necessarily seem to be the case for biological
evolutionary systems. Still, thfi success of evolutionary computation as a function
optimiser, as reported on a wide variety of problems and in some parts supported by
theoretical foundation, indicates that many features of Darwinism lend themselves
very well for this purpose.
5. Mathematical Foundations of
Genetic Algorithms
Several approaches can and have been taken as a first step to form a basic theory of
genetic algorithms (and genetic programming), each one providing some useful
insights into their functioning. Still, a fundamental foundational theory incorporating
all aspects of genetic algorithms is a long way off.
One of the first and most frequently referenced foundational works in this field is by
Holland [30], who examined the case of a binary coded fixed length genetic algorithm
and introduced a mathematical foundation known as the Schema Theorem. Goldberg
[19] extended this idea with the notion of the so called Building Block Hypothesis.
The reproduction or selection operator makes sure that the search is biased in the
direction of chromosomes with high fitness values if maximising or low fitness values
if minimising. Chromosomes that have above average fitness have more chance to
survive and to reproduce than others. Chromosomes with very low fitness will die off.
Low fitness chromosomes are needed in the population though, because they can
contain information that can be useful or even crucial to the formation of the optimal
chromosome.
The crossover operator ensures that partial information contained in one chromosome
can reach other chromosomes in the population. This mixing of information leads to
60
5.1 The Operation of Genetic Algorithms 61
the formation of optimal chromosomes. In order for the genetic algorithm to perform
well the crossover operator should be such that a high correlation exists between the
fitness of the parent chromosomes and the fitness distribution of their offspring.
In order to fully explore the search space, diversity of the population is crucial. The
purpose of the mutation operator is to maintain enough diversity in the population to
overcome local optima and eventually reach the global optimum. A mutation-rate that
is too high can destroy useful information in highly fit chromosomes and slow down
the search.
• Evolvability
Since even in a pure random search there is a chance that the offspring chromosomes
are fitter than the parents, a genetic algorithm si juld have a better average
performance than a random search. In order to do so, the effect of the genetic
operators should be such that there exists a high correlation between the fitness of the
62 Chapter 5. Mathematical Foundations of Genetic Algorithms
parents and the fitness distribution of the offspring. When this is the case the fitness
distribution of the offspring can on average be expected to be better than the one
belonging to the parents. This correlation property is called the evolvability of a
genetic algorithm [2] and serves as a local performance measure. The global
performance measure then is simply the ability of the genetic algorithm to produce
fitter offspring over the course of one or more generations. This global performance
depends on the maintenance of the evolvability of the population as the search is
guided to the global optimum. Using the Schema Theorem, this evolvability can be
expressed by the Building Block hypothesis.
The search space, £2, is the complete set of possible strings. In the case of a fixed-
length chromosomal string where each gene can take on a value in the alphabet A, the
size of the search space is:
size(Q) = k!
where:
k - alphabet size
/ = chromosome length
Returning to the example of the last chapter, the search space has a size of 29 = 512. In
other words: there are 512 different chromosomes in the search space.
H, = * 1 0 * 0 1 0 0 1
These strings are said to belong to schema Hh but belong to many other schemata as
well. The search space thus can be defined as a schema of length / with a 'don't care'
symbol at every position. In our case: Q.-*********. It also follows that the
9
number of schemata possible is ( k+\ ) ; in our case: ( 2 + 1 ) = 19683.
In order to present further discussion about schemata, the following properties need to
be defined. The order of a schema, o(H), is the number of fixed positions in H. The
order of Ht for example is o {H,) = 7. The defining length, b\H), of a schema is the
distance between the first and the last fixed positions of the schema. In the example
schema: b\H,) = 9-2 = 1.
The number of strings in the population belonging to schema H, m(H) is given by:
m(H) = 2^m(x)
xefj
Using schemata the effects of the genetic operators on the fitness distribution of the
population can be seen.
The average fitness of all strings in the population representing schema H is defined
as:
^f(x)m(x)
f(H) = ^
m(H)
f(H) is also called the average payofffunction of schema H, and the fitness of a string
x in the population is f(x). Using the standard Roulette wheel reproduction the
expected number of times string x is selected is given by E(x) = f(x) I f .It can be
seen that the expected number strings belonging to schema H in the next population is
given by:
m(H,t + \) = m(H,t)^£^-
where / is the average fitness of all the strings in the population at time t.
From the above equation it can be seen that schemata with above average fitness
values will reproduce in increasing numbers in the next generation, while schemata
with below average fitness values will eventually die off. When/[W) / / i s relatively
constant, the equation can be approximated by a linear difference equation of the form
m(H, t+l) = a m(H,t). The solution is then given by:
m(H,t + l) = m(H,0)a'
With a being approximated by f(H) if, it can be seen that strings belonging to above-
average schemata are expected to grow exponentially while strings belonging to
below-average schemata are expected to decay exponentially.
schema depends on its defining length and can best be illustrated by an example.
Consider a string x that is selected for 1-point crossover, and consider two
representative schemata Ht and H2 within that string:
x = 0100011001
H-. = **00*l****
The crossover point shown above is randomly chosen to be 6. Unless the string with
which Si mates has the same gene values at positions 2 and 9, a possibility that will be
ignored for now, schema Hj will not survive. Schema H2 however does survive the
crossover operator and at least one of the offspring will belong to H2. Due to its longer
defining length it is clear that schema Ht has less chance of surviving crossover than
does H2. Only a crossover with its crossover site at position 3 will destroy schema H2,
while only a crossover at position 1 preserves schema H,. The defining lengths for the
schemata are: b\H,) = 9 - 2 = 7 and b\H2) = 4 - 3 = 1 .
Generally speaking a schema survives 1-point crossover if the crossover site falls
outside its defining length. Assuming the crossover site is chosen randomly from [1,...,
Z-l] the probability of survival for a schema H is:
. S(H)
=
"■ '~ —
When crossover itself is applied with probability pc, the following expression gives a
lower bound on the survival probability of schema H due to the crossover operator:
8(H)
>1
p >l—p
This result can be extended for the 2-point crossover operator. Assuming the two
crossover sites are chosen independent from each other and assuming they are not
equal to each other (otherwise there would be no crossover), the survival probability is
given by:
66 Chapter 5. Mathematical Foundations of Genetic Algorithms
S H
>1_ < > | lI-8(H)
('8(H) ~S(H) 8(H)}
8(H)''
P,Zl-Pc'
Ps Pc
Uv /-- ll l~l 1-2,
1-2)
Uniform crossover diminishes the survival probability of schemata. Since every gene
in the chromosome has a 50% chance of survival, the lower bound on the survival
probability is:
P , > l --Pc/-0.5
Ps>\- V5(H0>. 5 ^
/-,
(, \o(H)
\o(H)
Ps
For small values of pm, as is usually the case, this may be approximated by \-o(H)pm.
f(H)( 8<H)\
m(H,t + \)>m(H,t)^J- J \ l - p . - f' "- ^I J -o(H)Pm)m)
(l-o(H)p
J V /-I )
5.2 The Schema Theorem and the Building Block Hypothesis 67
m(H,t + l)>m(H,t)^^-\l-pc^^--o(H)p(m)
From the above equation the Schema Theorem may now be stated:
The Schema Theorem: Using reproduction, crossover and mutation in the standard
genetic algorithm short, low-order, above-average schemata receive exponentially
increasing trials in subsequent generations. [ 19]
Finally, it can be shown that the number of schemata which are effectively processed
in each generation is of the order /V3, with N the population size. This property of GAs
which helps explain its performance on many optimisation problems is known as
implicit parallelism.
The Building Block Hypothesis (BBH): The partial information contained in the
building blocks is combined in a GA to form globally optimal strings. [19]
The genetic algorithm can now be seen to work in such a way that the building blocks
are sampled, recombined and resampled to form strings of higher fitness which
ultimately should arrive at the global optimal string. The building block hypothesis is
in fact a way to express the evolvability of the standard genetic algorithm. It states that
a genetic algorithm tries to find low-order schemata that have the best average payoff
in each hyperplane partition of the search space and that it combines these to form a
more complete solution.
68 Chapter 5. Mathematical Foundations of Genetic Algorithms
Although the building block hypothesis (BBH) has been shown to work in many
applications, there are GAs for which it does not [23]. The problem-coding
combination in such GAs is generally referred to as being GA-deceptive, meaning that
the GA search is deceived or mislead in finding the global optimum. In GA-deceptive
problems there is no regularity in the function-coding combination that may be
exploited by the recombination of short length schemata. Building blocks cannot form.
GA-deceptiveness is a theoretical concept derived from the analysis of schemata
payoff functions. In contrast GA-hardness is a practical concept expressing the actual
performance of the GA; i.e. how easy is it for the standard GA to converge to the
global optimum? It is important to note that GA-deceptiveness does not necessarily
entail GA-hardness. Some problems classified as being GA-deceptive in fact turn out
to be quite easily solved by the standard GA.
As an example where the standard genetic algorithm should have no problem finding
the optimum solution, consider the following GA-non-deceptive problem. Suppose the
optimum solution is the string 000 ... 0 (the string length is undefined). Furthermore
for the average schema fitnesses the following holds:
/f(0*
( 0 * . ..... *) >> / (( l1* .*... . . •*) )
/ ( 0 00 **. . ..*)
. * ) > / ( 0 11 *. *...*)
..*)
/f(00*
( 0 0 * . ...*)
..*) > / ( 1 00**. . ..*) . *)
/ ( 0 00**.. ..*)
. . * ) >>/ /( ( 1l l1 ** ......*)
*)
etc.
In other words, for all schemata of a certain length (a hyperplane partition), those with
all 0's in their fixed position are preferred. According to the building block hypothesis
this problem should not deceive the GA and it should easily converge to the global
optimum.
Now consider the same problem, i.e. the optimum is 000 ... 0, but now schemata that
have l's are preferred for every hyperplane partition. Thus:
This is a GA-deceptive problem according to the BBH, since the coding regularity
occurs in the non-preferred schemata, and the standard GA should have great
difficulty in finding the optimum. The problem now is to design the genetic coding in
such a way that the problem is not GA-deceptive so that building blocks can form and
the BBH will hold. This is in general very hard to do. One apparent requirement of the
coding in order for building blocks to form is that related genes should be close
together on the chromosome. When they are close together they can form a building
block and guide the GA search to better individuals. According to the BBH therefore
the ordering of the genes in the chromosome can play an important part in the GA
performance.
Using this viewpoint, the genetic algorithm can now be seen as moving between points
across different hyperplanes in search of the optimal point in the search space.
1 2 ' -1
w x / x
j=^rllf(
L
Lx=0
)y j( )
JC=0
where:
y/j(x)== the Walsh function
Vj(x)--
Wj --== the Walsh coefficient relating toj
Wj tojf
j — = aabinary
binary string
string of
oflength
length /I
The summation is over all 21 'integer' values of x; i.e. x = 000, 001, 010, 011 etc. for
the case / = 3. The function^*) is transformed into a set of coefficients wj, one for
each possible bitstringj. The total number of such bitstrings and therefore the number
of Walsh coefficients is 2 . The Walsh coefficients are sometimes also called partition
coefficients. The Walsh function y/j(x) is given by:
¥j(x)=ll(-ir-if.-*
y/j(x)-- j
'
i=i
where:
x, = the value of the i{ bit of JC
jji, = the value of the t bit of j
The Walsh function will have a value of 1 if JC and j are equal in an even number of
positions and a value of -1 otherwise. The inverse Walsh transform is:
22'-l
'-l
f(x)--
f(x) = YdwW
JYt:(X)
f(x)
j=0
j=0
5.2 The Schema Theorem and the Building Block Hypothesis 71
So for the case of a three bit string: f(x) = ± WQOO ± Wooi ± vfoio ± wm 1 etc. The Walsh-
schema transform is now:
f(H)
f(H)= = ^WjY/MH))
^J¥J(P(H))
jzJ(H)
where:
J(H) = a set generator of schema H
/?(//) i
P(H) = an operator that maps H to a binary string
The set generator J(H) generates a set of binary vectors from a schema H. This set is
defined by:
Ji(H,)
Wd == 0, if H, = **
*, if //,
Ht = 0,1
So for example/(***) = 000 = {000} and/(**l) = 00* = {000, 001). p\H) is defined
by:
p\<H,)== 0,0,ififHiH,==0*
P,(Hd 0*
1, if //,= 1
Using these definitions the average schema payoff function of a schema f(H) can be
transformed into Walsh coefficients. For example:
The values of the Walsh coefficients can be obtained from the problem dependent
values of the schema payoff functions f{H) by simple back-substitution. Insight into
whether a problem may be difficult to solve for a GA may be gained from observing
the Walsh coefficients. For example for a problem to be GA-deceptive, conditions
such as the following may need to hold:
72 Chapter 5. Mathematical Foundations of Genetic Algorithms
ft**l)<f(**Q)
f(*l*)<f(*0*)
/(l**)</(0**)
etc.
This can be translated into the following relations concerning Walsh coefficients:
wooi <0
w0io < 0
w 100 <0
These relations can easily be checked once the Walsh coefficients are determined. The
Walsh-Schema transform provides an analysis into the deceptiveness of a problem.
Furthermore, contributions in schema fitnesses due to epistatic interactions between
certain bit positions can be investigated [19]. A disadvantage of the Walsh-Schema
transform is the excessive amount of computation needed in the analysis. The
nonuniform Walsh-Schema is much better in this sense and provides dynamic analysis
of problems for which the computation needed in a normal Walsh-Schema transform
would be impractical.
A special point of criticism of the BBH is that it is based on a static analysis of the
payoff functions of schemata, while a dynamic view would be needed to properly
explain the working of a GA. According to Grefenstette [23], this means in fact that
the BBH is false and fails in practice due to the following factors:
• the population size is always limited and there is a large variance within
schemata
This means that even in the initial random population the GA cannot estimate the true
average schema fitnesses. To illustrate this, consider the following problem which
according to the BBH is GA-easy or at least non-deceptive.
f(x)= x2 ifx>0
2048 ifx = 0
It can be seen that for any schema H which contains the optimum string 000...0 (i.e. all
the schemata with only 0's in their fixed positions), the average schema fitness, f(H) >
2, since the sum of all its payoff functions is at least 2048 (due to the optimum string)
74 Chapter 5. Mathematical Foundations of Genetic Algorithms
and the number of strings contained in H is at most 210 = 1024. Also for any schema H
that does not contain 000. ..0, the average payoff function/(//) < 1, since the sum of all
its payoff functions is at the most 1024*1 = 1024. Therefore schemata containing only
0's in their fixed positions are always preferred over others and the problem is GA-
non-deceptive.
However a standard GA will find it extremely difficult to find the optimum string
000...0. Unless it was already part of the initial population or because it was
introduced by a very lucky crossover or mutation, the string will very likely not be
found. Intelligent sampling of hyperplane partitions will not lead to the discovery of
the optimal string as predicted by the BBH. This is because the variance in the best
schemata is extremely high due to this problem being a "needle-in-the-haystack'
search. The GA can not accurately estimate the true average payoff functions of these
schemata.
Grefenstette states that while the Schema Theorem as presented by Holland refers to
the average payoff of schemata according to the current sample in the population, the
BBH ignores this crucial feature and should therefore not be used as a fundamental
theorem for GAs. The classification of problems being GA-hard or GA-easy using the
BBH is certainly not always true as shown above.
In order to account for the effects of the choice of representation and genetic operators
the notion of a transmission function is used. A transmission function is the
probability distribution of the offspring chromosomes from every possible mating. For
5.5 Markov Chain Analysis 75
the case of two parents, as used in normal crossover, the transmission function is
represented as: T(i <- j,k), where (' is the label for the offspring and j,k are the labels
for the parents. T{i <— j,k) represents the probability that an offspring of type i is
produced by parental types /' and k resulting from the application of the genetic
operators on the representation.
The performance of a genetic algorithm is now determined by the relation between the
transmission function and the fitness function. Price's theorem is used to analyse the
dynamic behaviour of the fitness distribution over one generation. This is shown to
depend on the covariance between parent and offspring fitness distributions and a so
called 'search-bias' which indicates how much better the effect of a genetic operator
on the current population is than pure random search.
Using the search-bias a quantitative notion can be given to the idea that the
transmission function should find a balance between exploring the search space and
exploiting the current population. It is still very hard to actually use the theorem in
practice in order to analyse or optimise a genetic algorithm, but enhanced ease of use
may be expected in future.
Even in the case of small scale models very useful insights can be gained such as the
concept of genetic drift and the effect of preferential selection on the population.
In [22] a genetic algorithm was modelled that had binary chromosomes of length one;
i.e. a 'single-locus genome'. In other words, the only individuals possible are '0' and
T . A population of size N now gives a total of ( N + 1 ) possible states, whereby the
location of a chromosome in the population is of no concern. For example a
population of size two has states '00', '01', and '11' State i is referred to as the state
76 Chapter 5. Mathematical Foundations of Genetic Algorithms
with exactly i ones and ( N - i) zeros. The operation of the genetic algorithm is now
defined b y a ( 7 V + l ) * ( / V + l ) transition matrix P[i, j] that maps the current state i
to the next state j . The probability of a transition from state i to state j is given by one
entry in the matrix: p(ij). Figure 5.1 visualises these terms for a simple single-locus
genetic algorithm with population size 2. In general with a chromosome length I, the
number of possible binary chromosomes is 21 and the number of states is (N + 2 - 1 )!
/ N\( 2 1 )!. For any realistic genetic algorithm the transition matrix becomes of
unmanageable size.
When simple Roulette wheel selection is the only genetic operator in the system (i.e.
no mutation and crossover), the transition matrix can be generated quite easily. It can
be used to examine the influence of selection pressure only on the system, f, is defined
as the fitness of an individual T , and f0 the fitness of a '0'. The probability of
choosing a T for the next population is simply p, = f, I Yf. When the number of ones
in the current population is given by i, p, can be expressed as:
Px
i-f[+(N-i)-f0
5.5 Markov Chain Analysis 77
ir
i•r N-i
N-i
ri and Fo
Po
i-r + (N-i)
(N-i) i-r + (N-
(N-i)
■i)
The probability of transition from a state with *' ones to a state with; ones is now:
N-j
,. .. ^ / \ j /
j \N-j
j *> ( i-r
i-r \( N —- i1
IN >
P(iJ)
P0.J)== ■ (p.) (p 0 r == • :
P.) (Po) [i-r + jrr—z - +777—
(N-i)J l,i-r (N-i) ;
UJ U A i - r + (N-i)Ai-r + (N-i)J
This equation defines the complete (/V + 1 ) * (/V + 1 ) matrix P[i,j]. With r=l both
T and '0' individuals have an equal fitness value. There is no preference for a state
(i.e. a population) with all-ones or a state with all-zeros, and the equation reduces to
the one for pure genetic drift. This genetic drift causes the simple GA to always
converge to a uniform population, in this case to a state with all-ones or one with all-
zeros; i.e. i - {0, N}. These two states are absorbing states, meaning that once the
system is in such a state it will always stay there. In other words, the transition
probability from such a state to itself is one ip(i,i)=\) and zero to all other states
(p{ij)-0; &j ). Absorption time is defined as the expected number of generations until
the genetic algorithm finds itself in one of the absorbing states. Absorption time
depends linearly on the population size, N [22].
In [31] the above model is extended to include niched genetic algorithms based on
fitness sharing (see section 3.1.8). It is reported that in the case of 'perfect sharing',
where the niches do not overlap, the effect of fitness sharing ('niching force') balances
exactly the effect of selection/drift. The niching force is a stabilising one in that it tries
to spread the population out evenly over the niches, as opposed to the effect of
selection/drift. When overlapping niches are examined, as is often the case, it is found
that the niching force dominates for small overlaps but for larger overlaps its influence
decreased. As could be expected the absorption time is significantly larger than for the
simple GA without niching and grows exponentially with the population size.
In [12] the transient behaviour of GAs with a finite population size is modelled using
Markov chains. The concept of the state transition probability matrix P is extended to
a fc-step transition matrix P* Using P% analysis is done on expected transient
behaviour of simple GAs. Questions like: "What is the probability that the GA
78 Chapter 5. Mathematical Foundations of Genetic Algorithms
population will contain a copy of the optimum at generation k can be answered using
this approach using relatively simple function optimisation problems. Also, expected
waiting time analysis can be performed to answer questions like: "For how many
generations does the GA have to run on average before first encountering the
optimum?" The effects of crossover, mutation and fitness scaling can be seen in the
expected waiting time analysis and useful insights are gained. Future work in this
approach needs to concentrate on scaling up to problems of more realistic size. Also
visualisation techniques to display for example transition matrices are expected to be
of much help in gaining insight into the operation of GAs.
6. Implementing GAs
Although highly problem dependent, some general remarks can be made on what the
options are concerning coding (representation) and genetic operators for a GA system
and how they will affect the performance. A brief overview of the most commonly
used GA settings is given. Aside from coding the most crucial part of the set-up of a
GA is the fitness or evaluation function. General remarks concerning the fitness
function are given as well, but first some general comments are made about the
performance of a GA.
6.1 GA Performance
This section describes some of the main problems found while implementing genetic
algorithms.
• Premature Convergence
A common problem in GAs is premature convergence of the population to a local
optimum. This happens when a super-fit individual, representing a sub-optimal
solution, is chosen to reproduce many times and takes over the population in a few
generations. After this, the only way the GA can overcome this local optimum is by
the (re)introduction of new genetic material by means of mutation. This process is
then just a slow random search.
• Epistasis
Problems that are difficult to solve for a GA can generally be classified as problems
with high epistasis. The level of epistasis in a certain problem-coding combination
reflects the dependence of gene expression in the phenotype on the values of other
genes. With epistasis, a specific variation in a gene produces a change in the fitness of
the chromosome depending on the values of other genes. The level of gene interaction
79
80 Chapter 6. Implementing GAs
measures the extent to which the contribution to the fitness of a single gene depends
on the values of other genes in the chromosome. In the absence of epistasis a
particular change in a gene always produces the same change in the fitness of the
chromosome. As an example of such a problem consider the case of a binary string
where the fitness is simply equal to the number of ones in the chromosome:
;
f(x) = ^ai , x = («!,„.,a,),a t e {0,1}
i=i
There is no interaction between the genes at all. The fitness function is a composite of
the contributions of each gene.
A medium level of gene interaction can be defined where the problem is such that a
particular change in a gene always produces a change in the fitness function of the
same sign (or zero). In this case, the change in fitness depends on the values of other
genes. An example of such a problem is one where using binary coding the fitness is
one if all genes are one, and zero otherwise:
The two obvious ways to tackle problems that have high epistasis are: design a coding
such that the problem becomes one with low or no epistasis, or: design the genetic
operators (crossover and mutation) so that the GA will have no problem with epistasis.
In effect this means that some prior knowledge about the objective function has to be
build into the GA system. Although it can be shown that in theory any high epistasis
6.2 Fitness Function 81
problem can be reduced to one with low or even no epistasis, in practice this is very
hard to do. The effort needed to accomplish this may be greater than is needed to
actually solve the problem.
• Genetic Hill-Climbing
While in theory a genetic algorithm performs a global search of the solution space, in
some implementations the search is not as global as most theory would suggest. For
example in a Steady State GA with ranked replacement (a 'Genitor-type' GA) and a
relatively small population size, the search is often centred around the single fittest
individual. This is due to the very high selective force on above average individuals in
this type of GA. The Genitor-type GA 'pushes' very hard, often becoming stuck in
local optima. The performance of this GA with relatively small population sizes is
sometimes found to be largely independent of the population size, see e.g. [64] where
good solutions to a neural network weight optimisation problem were found even with
a population of size 5. Instead of intelligent hyperplane sampling, the GA basically
performs a local search around the fittest individual. It is said to perform genetic hill-
climbing. This does not necessarily mean that the algorithm does not function well; it
may still outperform conventional hill-climbing algorithms. However, the foundational
work of GAs will have to be extended to include the phenomenon of genetic hill-
climbing. GAs working as a genetic hill-climber are commonly found to require a
relatively high level of mutation for a good performance. The GA is in a way always
in a state of premature convergence and strong mutation is simply needed to make a
transition to another better state.
meaningless, they may contain information that is crucial for the production of highly
fit meaningful chromosomes.
Some researchers say that the fitness function should be smooth and regular and
chromosomes that are close in the representation space should have similar fitness
values. If this were the case however, a simple hillclimbing algorithm could always
find the optimum and a genetic algorithm would not be needed. In practice the fitness
function will typically contain many local optima making it hard for a hillclimbing
algorithm to find the global optimum. However a fitness function may be constructed
that minimises the effect of local optima and enhances the better performance of a
GA.
6.3 Coding
In the early days of the field of genetic algorithms researchers practically always used
a binary coding scheme following the example of Holland. Since then many variations
have been used such as real-valued and symbolic coding.
A simple analysis of the Schema Theorem seems to suggest that alphabets of low
cardinality (small alphabets) yield the highest rate of schema processing: the amount
of implicit parallelism is the highest. This is because the number of schemata available
for genetic processing is the highest when low cardinality alphabets are used. Since
genetic algorithms are run on computers, all genetic information is ultimately stored as
bits. Supposing we have an alphabet of cardinality k, then each position in the
chromosome can represent k+\ schemata. Since each position represents log2fc bits,
the number of schemata per bit of information, ns, is:
I
log k
ns =(k + \) '-
Therefore it is easy to prove [20] that chromosomes coded with smaller alphabets
represent a larger number of schemata than ones with larger alphabets. Because of this
apparent higher rate of schema processing of small alphabets traditionally only binary-
coded chromosomes were used. However, GAs using codings with high cardinality
alphabets (large alphabets) such as real-valued codings have been shown to work well
in certain applications. Goldberg [20] reconciles this with the Schema Theorem as
described in a following section.
6.3 Coding 83
Intuitively, the coding should be such that the representation and the problem space
are close together; i.e. a "natural' representation of the problem. This also allows the
incorporation of knowledge about the problem domain into the GA system in the form
of special genetic operators. The GA can then be made more problem specific and
achieve better performance. One concept of particular importance is that points that
are close together in the problem space should also be close together in the
representation space.
However the choice of which coding to use is usually not straightforward. Quite often
the problem to be solved is integer or real-valued and genes can be chosen to be
binary or real-valued. An example of a binary coding for a real-valued problem is the
following. Suppose the problem is to optimise a function f(xhx2,x3) that takes real-
valued arguments: xhx2,x3 e [0,1]. A chromosome has to represent these three
arguments. Each argument can for example be represented by 8 bits, making the
chromosome a binary-valued string of length 24. In 'standard' binary coding the
substring 00000000 will correspond to the value 0.0, 00000001 to 1/256=0.0039, ... ,
and 11111111 to 1.0. An example of such a binary-coded chromosome is:
Some authors use the term gene for a substring of 8 bits representing a single real-
valued argument. This is not very appropriate since a gene in GA-theory is considered
to be a unchangeable piece of data. In this example the standard GA will work on the
bitstring without any knowledge of the actual representation within it. A gene simply
corresponds to a single bit.
84 Chapter 6. Implementing GAs
As can be seen in this example the effect on the phenotype of changing a single bit in
the chromosome depends on its position within the string. When the right-most bit of
an 8 bit substring is changed the effect is very small, but when the left-most bit is
changed the effect is relatively quite large.
In this coding, the Hamming distance (the number of different bit positions) between
two individuals does not reflect the distance between the two in the problem space.
This makes it very difficult for the mutation operator to make small changes in the
values represented in the chromosome and is the reason why Gray coding is often used
instead of 'normal' binary coding to code real values. In Gray coding adjacent real
values (or integer values) differ from each other in only one bit position. Going
through the real values represented by a Gray coding from low to high only requires
flipping one bit at a time. GAs using Gray coding are often found to perform better
than ones with standard binary coding when solving real-valued problems.
In [20] Goldberg gives an explanation for the success and failure of real-valued
codings based on the Schema Theorem by suggesting that the GA breaks the original
coding into 'virtual alphabets' of higher cardinality. The real-valued GA still has a
high rate of implicit parallelism.
xx = (au ...,
= {a\, ..,«,), a,ee {a,b,c,
af), a, {a,b,c, ...}
...}
Often the genes are simply implemented as unsigned integer values taken from a
certain range. The main characteristic of symbolic coding is that there is no measure
of distance between two symbols. For example symbols that are 'adjacent' in the
alphabet are not considered to be closer to each other than any other two symbols.
Homogeneous coding is the special case where A, - Aj, all ij. Alternatively, a
chromosome could consist of a part that is binary coded (A = {0,1}) and a part that
uses real-valued coding (A - Si):
x -=-- (a
(auu ....• ,i a
amm, , flm+b
am+], ■...■ ■,, a,),a,), a,
a,ee {0,1}, 1 < ii<
<mm
a, € 9t,
a,e 91, m< ii<l</
This type of coding poses extra constraints on the genetic operators. The normal
crossover operator can still be applied even when a substring is swapped that contains
more than one coding because it leaves the genes intact. The mutation operator has to
be changed in that it re-initialises a gene depending on the coding of that gene since
the alphabets of the different codings will not be the same.
86 Chapter 6. Implementing GAs
One way to overcome the problems mentioned in section 6.1 is to use a. fitness
remapping scheme. Before individuals are selected using proportionate reproduction,
their raw fitnesses are remapped to new values. Two of such techniques are described
below.
• Fitness Scaling
Instead of using the actual fitness values in the selection mechanism, the fitness values
of all individuals are scaled to a certain range. This is commonly done by first
subtracting a fixed value from the fitnesses. These fitnesses are then divided by their
average value to produce the adjusted fitnesses. When fitness scaling is used the
amount of relative selective pressure on an individual can be controlled. Very fit
individuals no longer produce excessive numbers of offspring. There is a price to be
paid however. When there is a single super-fit (or super-unfit) individual in the
population, fitness scaling leads to overcompression. When just one individual has a
fitness many times higher than any other, fitness scaling will result in flattening out the
fitness distribution of the rest of the individuals. They will obtain near identical fitness
values, and the difference in selective pressures between them will almost be lost.
Performance suffers if there are extremely valued individuals.
6.4 Selection Schemes 87
• Fitness ranking
The dominance of extreme individuals may be overcome by fitness ranking. Here
individuals are ranked based on their fitness and then the new reproductive fitness
values are given to them based solely on their rank, usually using a linear function
(linear ranking). Similar to fitness scaling, fitness ranking ensured that the ratio of
maximum to average fitness is fixed. However it also spreads out the remapped
fitnesses evenly over the interval. The problem of overcompression is gone. It no
longer matters whether the fittest individual is extremely fit or only just fitter than its
nearest competitor. By means of the ranking function the selective pressure of
individuals relative to each other can be controlled. Non-linear ranking may also be
used where the ranking function is such that the remapped fitness of an individual is
for example an exponential function of its rank (exponential ranking).
individual is very high, where the growth rate is defined as the proportion the
individual takes up in the mating pool relative to the proportion it takes up in the
current population. A normal GA can also gain high growth rates when an appropriate
selection scheme is used (such as non-linear ranking), and it is suggested that it should
then show similar behaviour to the Genitor-type GA.
In the standard GA, crossover and mutation operators work with fixed rates. Lately
much work is being done on adaptive rates for these operators. Concepts being
investigated include coding the values of the rates in the chromosomes to let the GA
find optimum values or using a diversity measure of the population to control the
rates. For example when the diversity is very low, new genetic information can be
introduced by setting the mutation operator temporarily to a high value. Yet another
idea comes from the field of simulated annealing where the rates of the operators are
controlled using a 'cooling scheme'
6.5.1 Crossover
The standard one-point, two-point and uniform crossover operators were described in
section 3.1.3 for binary-valued chromosomes. They can be used in the same form for
any coding. These crossover operators swap entire genes or series of genes between
individuals and therefore can never change the value of a gene.
6.5.2 Mutation
The normal mutation operator as described in section 3.1.3 for binary coding is easily
extended to any representation. With probability pm it will re-initialise the value of a
gene. The set of possible gene values will usually be the same as that used for
initialising; i.e. the alphabet.
When the chromosomes consist of real-valued genes, another form of the mutation
operator may be used. Instead of re-initialising the value of the gene, a small randomly
selected value (usually Gaussian) is added to it. This version of the mutation operator
is called creeping mutation. It can be seen as a local search mechanism within the GA
and can operate simultaneously with the normal mutation operator. When creeping
mutation is the only mutation operator used in the GA, it is empirically found that the
mutation rate should be much higher than usual and mutation rates of up to 0.1 have
been used. With creeping mutation, gene values can be obtained that lie outside the
range of the initial population. If the genes are restricted to a certain range (a, e
[min,max]) then the creeping mutation operator can simply be altered so that it can not
take a gene value beyond that range.
6.5.3 Inversion
While not part of the standard GA toolkit, inversion is often added to a GA system to
operate alongside crossover and mutation. It is called upon in a similar way as
crossover and mutation and operates on a single chromosome. A chromosome is
inverted with probability p,. Inversion randomly picks two points in the chromosome
and inverts the order of the substring between these points. The 'meaning' of the
chromosome remains the same however. The only thing that is changed is the order of
the coding. Inversion requires that genes carry labels. For example a gene af
represents the j * gene having a value a,-. The order in which the genes appear in the
chromosome does not have to resemble the label. Inversion is illustrated below:
Both chromosomes before and after inversion code exactly the same information and
represent the same phenotype; only the order of the genes is changed. The order of the
genes can play an important factor in the GA performance. The building block
hypothesis for example requires that related genes should be close together on the
chromosome in order for building blocks to form. Inversion is an operator that
changes the order of the genes and can therefore improve GA performance in some
circumstances.
7. Hybridisation of Evolutionary
Computation and Neural Networks
Evolutionary Computation (EC) can be used in the field of neural networks in several
ways. It can be used:
Each of these approaches is briefly described below; also see [62]. The last two of
these are dealt with together as many remarks concern both and the distinction is often
subtle.
91
92 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks
The members of the population are the weights of the network which are coded as
strings. When real valued weights are used, they are often coded into a binary string
using a binary or a Gray coding mechanism, although real-valued coding is also
possible. The fitness measure is normally calculated as the performance error of the
network on the training data. The genetic algorithm can then be classified as a
supervised learning algorithm.
In [44] a GA is used to evolve ecological neural networks that can adapt to their
changing environment. This is achieved by letting the fitness function, which in this
case is seen as individual for every gene, co-evolve with the weights of the network. A
special feature of this research is that there is no reinforcement for 'good' behaviour
of the network; the network just tries to model or adapt to the world in which it lives.
The system can be classified as open-ended evolution.
De Garis [18] uses a method which is based on fully self-connected neural network
modules. It is shown that using this approach a network can be taught a task even
though the time-dependent input varies so fast that the network never settles down.
The system does not use a crossover operator (it could therefore be called
evolutionary programming) and is used to teach a pair of sticks to walk.
In [28] and [52], a genetic algorithm is used on a fixed three layer feedforward
network to find the optimal mapping from the input to the hidden layer (i.e. the set of
optimal hidden targets). In the evaluation phase, the weights from the hidden to the
output layer are learned using a simple supervised gradient descent rule. The search
space is not the weight space but the hidden target space. It is suggested in [28] that
the hidden target space might have more optima than the weight space and that finding
the optimum will therefore be easier.
In [49], [50] and [64], instead of binary coded chromosomes, chromosomes with real-
valued genes are used. Satisfactory results are reported using a Genitor type Steady
State Genetic Algorithm with a relatively small population size of 50. Instead of the
normal mutation operator, creeping mutation is used where a small random value is
added to the gene. In [49] several special genetic operators are investigated such as a
crossover operator that swaps groups of weights corresponding to a neuron. This
specific operator did not show any obvious improvement. Experiments with
decreasing population size strongly suggest that genetic hill-climbing (see section
6.1.1) is the main search mechanism in these implementations. Even with a population
size as small as 5, good results were obtained. The genetic algorithm is said to
7.2 Evolutionary Computing to Analyse a NN 93
outperform back propagation on certain problems that require a neural network with
over 300 connections.
As the chromosomes usually do not contain information concerning the weights of the
network; these have to be set to an initial (random) value. After a network is trained it
is evaluated on its performance, which is reflected in the fitness measure. The
performance measure can simply be the overall error on the training data, but often
reflects other properties such as network size as well. Instead of testing the network on
the training data it can be tested on (real) test data as well. In a real-life application
however the actual test data will not be available until the network is used for its task
(otherwise it might as well be included in the training data). The training set can be
divided into two parts though, one part serving as training data for the training module
and the other part used in the evaluation phase as a test of the generalisation
performance of the network.
However the EC system as in Figure 7.1 in theory can act as a real-time system where
networks are evaluated on real data taken from the environment and the end-user
simply uses the best individual found so far. The EC system is run continuously and
when a better individual is found, then the one currently in use is replaced. This is an
evolutionary adaptive system operating in a non-stationary environment, pictured in
Figure 7.2. It will not be of any use in a stationary environment because once the
optimum network is found the EC system becomes redundant.
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 95
The network as used by the end-user, of course, operates in the same non-stationary
environment. In practice such a system would be very hard to implement. The EC
system for example must have some way of testing the networks it generates on the
task they are going to be used for by the end-user in order to determine their fitness
values. Thus a model of the task dictated by the end-user has to be built into the EC
system. Furthermore special care has to be taken in implementing the specific EC
system to avoid premature convergence. Also the information that is fed into the EC
system from the environment has to chosen carefully, since, if it is only based on the
present situation, useful information that was observed in the past may be lost. As
described in section 4.4.1 there is evidence suggesting a very similar process as
pictured in Figure 7.2 might be at work in the human brain. The brain makes a model
of the outside world and ideas are generated and tested according to a Darwinian
process resulting in the fittest one actually being used.
The performance of a neural network usually depends on the values of the initial
weights. Therefore the networks should be trained several times using different
random initial weights each time and the results be averaged, in order to get a good
performance measure. This can cause the approach to become very slow, see e.g. [4].
In some applications the generation of the network architecture is done simultaneously
with the learning of the weights. The chromosomes not only code the architecture of
the network but they also code the values of the weights.
96 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks
There are several ways to encode a neural network architecture as a chromosome that
can be used by an evolutionary computation algorithm. These methods can be divided
into the following approaches:
• direct encoding
• parametrised encoding
• grammar encoding
These methods as well as their applications are described below in more detail.
An alternative approach is to use a genetic algorithm where the topology and weights
are encoded as variable-length binary strings [46]. In [11] a structured GA is used that
simultaneously optimises the neural network topology and the values of the weights. A
two-level genetic structure represented by a single binary string is used where one
level defines the connectivity (topology) and the other the values of the weights and
biases. It was found that although the algorithm worked well on small problems like
XOR, it could not scale up properly to bigger real-world problems.
In [5] feedforward neural networks are generated with a GA, using a direct encoding
scheme where every gene in a chromosome represents a connection between two
neurons. The problem of competing conventions is tackled here by introducing
connection specific distance coefficients in the genetic material. For each functional
mapping or phenotype, the structural mapping or genotype with the shortest amount of
connection lengths is preferred. This approach is also known as 'restrictive mating' [1]
and is one of the niching methods described in section 3.1.8. In this way, some of the
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 97
Jacob and Rehder [32] use a grammar-based genetic system, where topology creation,
neuron functionality creation (e.g. activation functions) and weight creation are split
up into three different modules, each using a separate GA. The modules are linked to
each other by passing on a fitness measure. The grammar used is such that a neural
network topology is represented as a string consisting of all the existing paths from
input to output neurons. This is not a grammar encoding method such as the ones
described in section 7.3.3 as no grammar rewriting or production rules are encoded.
In [26] Happel and Murre report an approach where modular neural networks are
generated using a direct encoding scheme. The system implements modularity, where
modularity is meant as the grouping of certain neurons in the network into a module.
When such a module of neurons is connected to another module, all the neurons in the
two modules are connected to each other. An advantage of using modular neural
networks is that the weight space of the network is reduced. This has a positive effect
on both the generalisation capability and the time needed to learn the network. The
networks used are made up of so called CALM modules and are used for unsupervised
categorisation.
Kitano [38], [39] uses a GA-based matrix grammar approach where chromosome
code grammar rewriting rules that can be used to build the connectivity matrix. A rule
in this grammar rewrites a single character into a 2x2 matrix. After a certain number
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 99
Gruau [24], [25] uses a graph grammar system called Cellular Encoding. The graph
grammar rules work directly on the neurons (called cells) and their connections and
include various kinds of cell divisions and connection pruning rules. The grammar
rules are coded in a tree structure, and a genetic programming system is used. The
values of the binary weights and of neuron bias values can be coded into the
chromosomes as well. For some problems the generated boolean networks are further
trained using a cross-over between back propagation and the so called bucket brigade
algorithm. The approach can generate networks that are highly modular, where
modularity is defined as follows: Consider a network N] that includes at many
different places, a copy of the same subnetwork N2. An encoding scheme is modular if
the code of Nj includes the code of N2, a single time. Experiments show that the
system can be used to generate modular boolean neural networks of large size. This
approach is therefore especially useful when the problem to be solved shows a great
deal of modularity in the repetitive use of functional groups.
Boers and Kuiper [4] use a graph grammar system based on a class of fractals called
L-systems. The chromosomes used in the genetic algorithm code the production rules
of this grammar. The system generates modular feedforward neural networks where
modularity is evident in the grouping together of neurons in a module (see also the
description of Happel and Murre's work in section 7.3.1). The networks generated are
again trained using a back propagation algorithm. Drawbacks include the need for a
repair mechanism because of possible faulty strings and the extremely long
converging times. The method does not scale up well to larger problems.
100 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks
In [8] and [55] a quite different approach is presented. Neural networks are viewed as
physical objects in a two-dimensional space and are represented by a single cell ([8])
and various parameters concerning the growth process of the cell and a set of rules for
cell reproduction. The translation from genotype to phenotype is a complex one where
the final network is generated from the single starting cell by means of axonal growth
and branching as well as cell division and migration. The neural network 'grows' out
of the starting cell(s). This interpretation function comes a lot closer to the
developmental process found in nature. Successive phases of functional differentiation
and specialisation can be observed in the development. Mutations are introduced in
the development and it is observed that changes in the phenotype due to these
mutations depend largely on what stage in the development they occur. The neural
networks are used to model organisms living in a two-dimensional world in which
they can move in the search for food and water.
8. Using Genetic Programming to
Generate Neural Networks
In this chapter, we discuss the use of a genetic programming algorithm using a direct
encoding scheme; also see [63]. This work is mainly based on [40] and, where a LISP
program was used to implement the algorithm and that showed good results when GP
was applied to generate a neural network that could perform the one-bit adder task. A
complete neural network, i.e. its topology as well as its weights, is coded as a tree
structure and is optimised in the algorithm.
A public domain genetic programming system called GPC++, version 0.40, has been
used [17]. This software package was written in C++ by Adam P. Fraser, University of
Salford, UK, and several alterations were made to use it for the application to neural
network design. The GPC++ system uses Steady State Genetic Programming (SSGP)
as discussed in section 3.2. The probability of crossover, pc, is always 1.0; the new
population is constructed using the crossover operator, after which mutation is
performed. The crossover operator swaps randomly-picked branches between two
parents, but creates only one offspring for each pair. There is no notion of age in the
SSGP system, which means that after a new member is made, it can be chosen
immediately afterwards to create a new offspring.
8.1 Set-up
The technique applied in [40] and [41] was used, where a neural network is
represented by a connected tree structure of functions and terminals. Both the
topology and the values of the weights are defined within this structure, and no
distinction is made between the learning of the network-topology and its weights.
The terminal set is made up of the data inputs to the network (D), and random floating
point constant atom (R). This atom is the source of all the numerical constants in the
network and these constants are used to represent the values of the weights. So:
T = (D,R)
101
102 Chapter 8. Using Genetic Programming to Generate Neural Networks
The neural networks generated by this algorithm are of the feed-forward kind. The
terminal set T for a two-input neural network is for example T = {DO, Dl, R}.
After some initial experimentation it was found that, for the problems under
investigation, the system performed much better if the arithmetic functions were not
used. So:
F={P,W)
The values of the weights are represented by a single random constant atom and their
values can only be changed by a one-point crossover or mutation performed on this
constant atom.
The graphical representation and the corresponding neural network are shown in
Figure 8.1. The condensation of the W-P tree initially drawn from the chromosome
into a fully connected feedforward network is illustrated in two stages.
• the root of the genetic tree must be a "list" function (L) of all the outputs of the
network
• the function below a list function must be the Processing (P) function
• the function below a P function must be the Weight (W) function
• below a W function, one of the functions/terminals must be chosen from the set
{P,D}, the other one must be {R}
These creation rules make sure that the created tree represents a viable neural
network. The root of the tree is a list function of all its outputs while the leaves are
either a data signal (D) or a numerical constant (R). This tree can then be translated
into a neural network structure as in Figure 8.1.
So, for example, a branch whose root (the crossover point) is a P function can never
be swapped with a branch whose root is a W function. Would this be allowed, then the
creation rules as described above would be violated and the genetic tree could no
longer be translated into a neural network. In [41], P functions and D terminals are
treated as being of different types, which means a branch whose root is a P function
can never be replaced by D terminal and vice versa.
Figure 8.2 A simple '2-2-2' feedforward neural network (left). The GPNN system
needs two separate sub-trees to represent this network (right).
106 Chapter 8. Using Genetic Programming to Generate Neural Networks
Figure 8.3 Representation of the '2-2-2' network in GPNN with two ADFs.
The two ADFs have P functions as their roots and have two arguments each: ARGO
and ARG1. In the example these arguments are instantiated with the data inputs DO
and Dl but instead of data inputs, the output value of some P function or even another
ADF function can also be used. The problem with a representation of this kind is that
if every sub-network that is called upon more than once is represented by an ADF, the
number of these ADFs can become very large. This number normally needs to be set
by the user a-priori. Another problem is that the number of arguments of each ADF,
just as for every other function, needs to specified in advance. However extensions to
the standard GP system have recently been made by Koza allowing the system to
automatically build ADFs when it needs them.
i=\ ,=\
Since a lower error must correspond to a higher fitness, the fitness of a chromosome x
is then calculated as:
fx) = Emllx-E{x)
The maximum performance error, Emax, is a constant value equal to the maximum
error possible, so that a network that has the worst performance possible on a given
training set (maximum error) will have a fitness equal to zero. When a threshold
function is used as the neurons' processing function, only output values of '0' or T
are possible. The range of fitness values is then very limited and it is impossible to
distinguish between many networks. In order to increase this range the output neuron
could be chosen to have a continuous sigmoid processing function.
In using a supervised learning scheme, there are many other ways to implement the
fitness function of a neural network. Instead of the sum of the square errors, for
example, we could use the sum of the absolute errors or the sum of the exponential
absolute errors. Another definition of the fitness could be the number of correctly
classified facts in a training set. The fitness function could also reflect the size (=
structural complexity) and the generalisation capabilities of the network. For example
smaller networks having the same performance on the training set as bigger networks
would be preferred, as they generally have better generalisation capabilities. The
generalisation capability of a network could be added to the fitness function by
performing a test on test data that lies outside the training data. These suggestions are
not implemented here.
108 Chapter 8. Using Genetic Programming to Generate Neural Networks
Parameter Setting
ADFs 0
creation depth 6
crossover depth 17
elitism on
: N (population size) 500
pc (crossover rate) 1.0
pm (mutation rate) 0.1
selection mechanism tournament (tournament size = 5)
8.6 Experiments with Genetic Programming for Neural Networks 109
No Automatically Defined Functions (ADFs) were used, as they did not seem
necessary for such a simple task.
Several runs were performed on this problem with solutions evolving between
generation 1 and generation 5. Figure 8.4 shows a solution that was found in a
particular run in generation 5. All solutions found had a number of neurons ranging
from 3 to 5. When the roulette wheel reproduction mechanism was used instead of the
tournament mechanism, the convergence to a solution took on average 2 generations
longer.
Figure 8.4 A generated neural network that performs the XOR problem
The GPNN system was extended with a bias input to every neuron by means of an
extra random constant (in the range [-4,4]) added to every P function. The effect of
this on the XOR problem was a somewhat slower convergence. The reason might be
that the search space is increased, while for a solution to this simple problem bias-
inputs are not needed. It should be noted that the GPNN system with this specific set
up cannot generate the 'minimal XOR network'. This network is pictured in Figure
8.5.
110 Chapter 8. Using Genetic Programming to Generate Neural Networks
Figure 8.5 The minimal XOR network. This is the neural network with the lowest
complexity (number of connections) that can perform the XOR problem
GPNN can not generate this network simply because the P functions are only allowed
to have two arguments (inputs), while for this particular network the output neuron has
three inputs. The GPNN settings can of course be changed so that the function set F
contains two P functions: one with two arguments, Vi(arg^,arg2) and one with three:
~P2.{argi,arg2,argT). The minimal XOR network can then be represented by a
chromosome using these two functions.
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
In effect this means that the first output has to solve the AND function on the two
inputs, and the second output the XOR function.
The same characteristics as used in the XOR problem were used. A solution to the
problem was found on all 10 runs between generation 3 and generation 8. One of them
is shown in Figure 8.6. The convergence is much faster than in [41], where a solution
was only found after 35 generations, also using a population of 500.
8.6 Experiments with Genetic Programming for Neural Networks 111
Figure 8.6 A generated neural network that performs the one-bit adder problem
As can be seen from the figure, the neural network found is indeed made up of an
AND and an XOR function. On average the generated neural networks had more than
just 5 neurons and the largest effective network had 20.
The results were poor. When the same settings as in the above experiments were used,
roughly half of the training set was classified correctly. Automatically Defined
Functions (ADFs) were introduced taking two, three and four arguments respectively,
but no improvements were observed. The function set was also extended with
processing functions P3 and P4 taking three and four arguments respectively. Again the
performance was still very poor.
Although GPNN was not able to find a solution to this problem, it should be noted
that GP has been found to be a very good classifier on the intertwined spirals problem.
In [40] a GP system gave a very good performance on this problem using the
following set-up: The terminal set, T, was made up of the two data inputs DO and Dl
and the usual real-valued terminal R:
T = {DO, D1,R)
112 Chapter 8. Using Genetic Programming to Generate Neural Networks
The function set consisted of the arithmetic functions +,-,*,%, the functions SIN and
COS and the function IFLTE (If Less Than or Equal to). The IFLTE function takes 4
arguments (branches) and is defined as: if (argi<arg2) then return arg3, else return
argA. So the function set F is:
No creation or crossover rules are needed and the fitness function is simply the
classification error on the intertwined spirals data set. This GP configuration gave
very good results on the intertwined spirals classification task and a 100% correct
classification on the data set is reported.
• There are severe restrictions on the network topologies generated: only tree
structured networks are possible.
• The learning of the topology and weights is done simultaneously within the same
algorithm. This has the drawback that a neural network with a perfectly good
topology might have a very poor performance and will therefore be thrown out of
the population just because of the value of its weights.
8.7 Discussion of Genetic Programming for Neural Networks 113
It is believed that the main reason why the GPNN approach fails to scale up to larger
size problems lies in the restrictions mentioned and the very large chromosome size
needed. An approach which overcomes some of the limitations of GPNN is discussed
in Chapter 10.
9. Using a GA to Optimise the Weights
of a Neural Network
This chapter describes experiments using a genetic algorithm for weight optimisation
of a feedforward neural network. When genetic algorithms are used to optimise the
structure of feedforward neural networks a separate learning algorithm such as back
propagation is often used to train the weights (see Figure 7.1). In the weight
optimisation module a separate genetic algorithm can be used instead of back
propagation, making the system a meta-level GA. The performance of a GA as a
neural network weight optimiser is investigated here. For certain problems (see e.g.
[50], [64]) genetic algorithms have proven to be comparable to or even better than
back propagation. In section 7.1 an overview was presented on the research in this
area. The best results were noted when a Steady State Genitor-type GA was used with
a real-valued coding of the weights. We have used a normal GA with 'non-
overlapping populations' and an altered replacement mechanism so that it can act as a
Steady State Genetic Algorithm. Since the main characteristic of a Genitor-type GA is
thought to be its extreme selective pressure or 'pushing force' of above average
individuals, its performance can be approximated by a normal GA with the
appropriate selection mechanism. The effect of the selective pressure on the GA
performance as a weight optimiser is also investigated here.
First a brief description of the GA software is given. After this the set-up of the GA is
discussed and experiments are presented where the GA weight optimiser is compared
to the standard back propagation algorithm. Finally the results are discussed.
114
9.1 Description of the GA Software 115
The SUGAL software mutation operator was changed so that a single gene is subject
to mutation with probability pm (and not a chromosome as was the case). This
probabilistic implementation of the mutation operator where every gene has to
undergo a 'test' to determine whether or not it should be mutated make the program
quite slow.
A second change was made concerning the selection of the pair of candidates. In
SUGAL it was possible for a single individual to be chosen both as the father and as
the mother. In such a case the offspring are simply exact copies of the parent no matter
what kind of crossover takes place. This has the effect of lowering the effective
crossover rate and in populations with one superfit individual it may easily lead to
premature convergence. The code was changed so that the father and mother
chromosome could not be one and the same.
9.2 Set-up
In this section the set-up of the GA is described for the implementation of neural
network, weight optimisation.
• Coding
The coding is chosen to be real-valued. A single chromosome represents all the
weights in the neural network (including the bias weights), where a single real-valued
gene corresponds to one weight-value. The nodes in the network are numbered from
'0' starting at the bias-unit, then the input units, the hidden neurons and finally the
output neurons. Even though the input units and the bias unit are not really neurons at
all, they will be referred to as such (as is common practice). The network architecture
is not restricted to a classic fully connected layer-model. However, the hidden neurons
are numbered in such a way that neurons with a higher index are 'higher' up in the
hierarchy of the network; i.e. neurons can only have outgoing connections to neurons
with a higher index. Figure 9.2 illustrates this. The indices of the weights represent the
order in which they appear in the chromosome. Incoming weights to a certain neuron
are grouped together in the chromosome representation.
• Initialisation
The initialisation of the weights is important when a GA is used to train a neural
network. Because the standard crossover operator for real-valued chromosomes leaves
the gene values intact, it can never introduce new values of weights and the available
genetic information is dictated by the initial values and by the mutation operator used.
When the standard mutation operator is used to simply replace genes by a new random
value in a certain range, this range and the range of initial gene values dictates the
boundaries of possible values the genes can ever obtain. The range of initialisation in
the GA weight optimiser therefore usually plays a more important role than in a hill-
climbing algorithm like back propagation. The initial values of the genes can be
chosen to be uniformly distributed within a certain range or normally distributed with
a certain mean and standard deviation.
118 Chapter 9. Using a GA to Optimise the Weights of a Neural Network
• Evaluation
The evaluation phase involves initialising the neural network with the set of weights
contained in the chromosome. The fitness value, f(x), is then simply the cumulative
squared error of the network on the training set where the outputs are compared to the
target output patterns:
, = l ;=1
The training is supervised. All the neurons in the network perform a weighted sum of
their inputs and produce as output the standard sigmoid function on [0,1] of this
weighted sum. So 0,.y(x) e [0,1]. Commonly target outputs will either have a value of
Oorof 1: Tue {0,1}.
• Stopping Criterion
In this implementation the stopping criterion is chosen to be the occurrence of a
chromosome in the current population corresponding to a neural network that
correctly classifies the complete training set within a certain tolerance. All outputs of
the network must be within this tolerance of their target values for the criterion to be
satisfied:
The default tolerance is set to 0.4, where it is assumed that all target outputs have a
value of either 0 or 1. The chosen network does not necessarily have to be the network
with the smallest error on the training set (the fittest chromosome), rather, it is the first
encountered within the stopping criterion. An alternative stopping criterion could be
when a network is found that has a fitness below a certain error value, but we have not
used this approach.
• Selection
Before selection is performed the fitness values of the individuals are normalised (or
remapped) using a normalisation method. Normalisation is implemented in SUGAL
by optionally altering the fitness values using some function (such as ranking), and
then normalising all fitnesses so that the total of the fitness values of the population
equals 1. Normalisation methods include inversion, where the fitness values are
inverted so that lower fitnesses take higher values and high fitnesses take low values,
linear ranking, where the fitness value becomes a linear function of the rank of the
chromosome, and geometric ranking, where the fitness is a geometric function of the
rank.
• Crossover
The standard one-point, two-point and uniform crossover operators are available.
Since entire genes are swapped, these operators can never change a gene value (a
120 Chapter 9. Using a GA to Optimise the Weights of a Neural Network
weight). An exception is the linear crossover operator (see section 6.5.1) which was
implemented as a special option as follows: the first offspring JC3 receives the average
values of all the genes of its parents; i.e. (xl+x2)/2. The other offspring is generated as
xA = {3x\-x2)l2.
• Mutation
As stated above, the standard mutation operators in SUGAL were changed. There are
two mutation operators available. Normal, 'uniform', mutation re-initialises the gene
with a random value. This new random value can be taken from a uniform distribution
within a certain range or from a normal distribution with a given mean and standard
deviation. Creeping, 'Gaussian', mutation is such that a normally distributed value
with a certain standard deviation is added to the current value of the gene. The
SUGAL code was extended so that both mutations could operate at the same time,
each with its own mutation rate.
• Replacement
All the replacement mechanisms as described in section 3.1.5 are available, i.e.
ranked/unranked and conditional/unconditional replacement. In unranked
replacement, the 'doomed' individuals are chosen randomly. With ranked replacement
the doomed are the least fit individuals of the population. When the replacement is
unconditional, candidates always replace the doomed individuals. In conditional
replacement the doomed individual is only replaced if its replacement is fitter.
SUGAL offers the ability to set the number of candidates that are generated during
each generation to any number Nc. Ranked unconditional replacement then becomes
an extended form of elitism where the worst Nc individuals of a population are
replaced each generation. When Nc is set to 1 (or 2) the GA is transformed into a
Steady State Genetic Algorithm. The SUGAL settings resulting in a Genitor-type GA
therefore are: Nc = 1, normalisation method = linear ranking, replacement mechanism
= ranked unconditional replacement.
9.3 Experiments
This section is concerned with experiments that were performed with the GA system
described above. Several neural network weight optimisation problems were tried and
the results were compared with the standard back propagation learning algorithm. All
neural networks used here have all their hidden and output neurons connected to a bias
unit that has a constant output of 1.
9.3 Experiments 121
9.3.1 Data-sets
• 4 to 4 Encoder Problem
The 4 to 4 encoder problem is a simple one to one mapping of all of the 16 possible 4
bit binary inputs to the outputs. The target output values are identical to the input
pattern for each training pattern. Table 9.1 shows the 4 to 4 encoder training data.
1111 1111
A '4-4-4' fully connected feedforward neural network was used for which the
backpropagation algorithm had no problems learning the data. The corresponding
chromosome length is: I = 4*4 + 4*4 + 4 + 4 = 40.
In the standard GA without elitism where all newly made individuals for the next
generation are evaluated, the number of evaluations per generation simply equals the
population size. In the GA used here this does not hold in general. Some individuals
pass on from one generation to the next unchanged and are not evaluated. For this
reason, during each GA run the number of evaluations needed to find the solution is
recorded. When comparing the GA and BP algorithms on a certain problem the
number of GA evaluations, or the number of passes through the neural network, will
simply be called iterations. Thus:
9.3 Experiments 123
9.3.3 Results
It is difficult to visualise the operation of a GA. In this section graphs are presented
that show the fitness of the best individual in the population versus the number of
generations. In contrast to a hill-climbing algorithm like BP this graphical
representation does not give much insight into the actual search of the GA.
SUGAL offers a measure of the diversity of the population at the end of each
generation. The diversity measure for a real-valued coding as is used here is the mean
of the standard deviations of each gene across the entire population. So:
It! ■
where: D = the diversity of the population
/ = the chromosome length
cr, = the standard deviation of gene i across the population
The settings used for this particular run are given in Table 9.2.
Parameter Setting
crossover type two point
elitism on
fitness normalisation reverse linear ranking with bias = 10.0
initialisation of population Normal distribution: N(0,5)
/I 40
mutation type creeping with N(0,1) distribution
~ N ' 50
Pc
Pc 0.8
08
pPm
m ~ 0.1
re-evaluation off
replacement mechanism ranked unconditional
selection mechanism Roulette
9.3 Experiments 125
A Normal (or Gaussian) distribution is characterised as N(/i,CT), with ji = the mean and
a = the standard deviation of the distribution. At generation 64 an individual was
found that correctly performed the 4 to 4 encoder problem subject to the required
tolerance on the target output values of 0.4. A total of 3195 evaluations (passes
through the network) was needed to find this solution. Over the course often runs with
the same settings the average number of evaluations needed to find a solution was
about 3500, corresponding to an average of 87 generations.
As can be seen from Figure 9.3 the diversity of the population remains fairly high
throughout the run. It drops from its initial value of about 5.0 to 1.0 at generation 40
and stays there due to the relatively high mutation rate.
Because the population size is so small, the effects of selective pressure and genetic
drift in the population are rather large. The population is in general very quickly
dominated by a single superfit individual. To get an idea of the effect of genetic drift
9.3 Experiments 127
alone, a run was performed without any selective pressure and with the genetic
operators crossover an mutation turned off (pc = pm - 0). The GA selects individuals
without preference and copies them into the next generation. The population
converged to a single individual in just 7 generations.
0 Effect of Mutation
The effect of the mutation operator was investigated to some extent. Runs were
performed with the normal mutation operator instead of the creeping one. It was
generally observed that this resulted in a poorer performance on the problem. When
both mutation operators were used at the same time, the results were about the same as
the situation where only the creeping mutation was used. The mutation operator
clearly plays a very important part in GA weight optimisation and the GA
performance depends greatly on the settings used.
As was also reported in [64] the GA converges to a solution even with a population
size as small as 5. Not all the runs converged, but the ones that did (about 80%)
needed on average about half the number of iterations (2000) as those in the case of a
population size of 50. For a 'normal' GA, convergence to a solution would normally
not be found with such a small population size since there simply is not enough
genetic diversity to maintain a proper search by means of intelligent hyperplane
sampling (formation of building blocks). As was also mentioned in [64] the fact that
the GA converges to a solution even with such a small population size strongly
suggests that the search is mainly performed by genetic hill-climbing (see section
6.1.1). Solutions were even found with a population size of 2, although in this case
about 50% of the runs did not converge.
128 Chapter 9. Using a GA to Optimise the Weights of a Neural Network
Despite the fact that BP has some problems in finding the global optimum for this
problem again it drastically outperforms the GA system in convergence time.
9.4 Discussion
The GA system has not been found to perform well on the task of feedforward neural
network weight optimisation. It is drastically outperformed by back propagation on
the problems investigated. This might however partially be caused by the nature of the
problems. Problems for which the BP algorithm has no difficulty in finding the
optimum are typically problems with a low level of epistasis resulting in a 'simple'
error landscape. Back propagation will not get 'trapped' in local minima for these
problems and it is not surprising that a hill-climbing algorithm such as BP will
outperform a global search algorithm like GA. Problems which do pose severe
convergence problems for back propagation may be better suited for the genetic
algorithm. It is reported in [50],[64] that a GA system very similar to the one
implemented here does outperform back propagation on some large size tasks that are
very difficult for BP.
Several facts seem to indicate that the genetic algorithm in this set-up does not
perform a global search through the weight space by means of intelligent hyperplane
sampling. Instead, the search seems to be focused around a single individual and
better solutions are generated by genetic hill-climbing. Reasons why the GA seems to
work better as a genetic hill-climber on weight optimisation problems very likely
include the competing conventions problem, caused by multiple chromosomal
representations coding identically functioning networks. By focusing the search
around a single individual this problem is avoided. Another reason why a global
search may not work very well is simply the extremely large size of the search space
for bigger sized problems.
Future work will need to be done in optimising the GA set-up for neural network
weight optimisation, possibly extending the set of genetic operators with ones that are
more problem specific. This can present an alternative in tackling the competing
conventions problems. Some good results have been reported in literature where
genetic operators were used that use some kind of gradient information of the error
landscape. Since competing conventions seem to be such a major problem for weight
optimisation with a standard GA, better results may be expected when niching
techniques such as restrictive mating are used, although this has not been investigated.
Since BP is very good at fine tuning potential solutions and the standard GA can
perform a global search in the problem space a hybridisation of the two seems natural.
130 Chapter 9. Using a GA to Optimise the Weights of a Neural Network
When both the structure and the weights of the network are coded in the chromosome,
the resulting system is best described by a Structured Genetic Algorithm (sGA). In
[11], where a direct representation was used, good results were reported using an sGA
on small problems such as the XOR or small decoder networks but it was found the
method did not scale up well to bigger problems. Instead of using a direct encoding
method, better results were expected using a grammar encoding and we investigated a
method based on Kitano's matrix grammar encoding in this context. Results are
compared to the matrix grammar system without weight encoding and to a system
implementing direct encoding to represent the structure of a neural network.
131
132 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
connectivity of the network, the bottom level the values of the weights and biases. The
network connectivity is represented by the connectivity matrix with that part of the
chromosome treated as a binary string.
In most of the GA approaches where both network structure and weights are subject to
genetic operations, the chromosome representing both parts is thought of as a long
binary string and is subject to the genetic operators of the algorithm. As far as the
genetic operators are concerned there is no distinction between the structural and the
parametric (weight) part. This distinction is only made when the chromosome is
translated into the actual neural network.
It is possible to make a distinction between the structural and the weight part of the
chromosome. When different codings (i.e. binary and real-valued) are used for the two
parts, this distinction must be made and non-homogeneous chromosomes are needed.
In [11] the best performance of the sGA algorithm was observed when the weights and
biases were coded as real-valued genes, as opposed to the binary coded structural part
of the chromosome. Genetic operations like crossover and mutation can now be
thought of as being either structural or parametric changes to the network depending
on what part of the chromosome they operate on. When the changes are structural, the
resulting offspring can inherit the set of weights from its parent. This process is called
'weight transmission' and will be described in the next section. A 'structural'
crossover is illustrated in Figure 10.1.
Figure 10.1 Abstract visualisation of structural crossover. The offspring inherit the
set of weights from their parents
weight W,j. This process is called 'reduced weight transmission' and it was found that
the optimum value of F depended very much on the problem. A training module was
used to learn the weights and it was found that the reduced weight transmission
mechanism speeded up learning by more than an order of magnitude compared to
starting with random weights. The idea is that, with weight transmission, the training
of the networks will generally start off from a better point in the weight space when
compared to starting at a random point, and that less training is required during
evaluation of the networks.
There are several ways to implement weight transmission. For example, the weights of
the offspring network can be set to a fraction of the corresponding weights of one or
both parents. The parent networks could first be checked to see which weights in the
complete weight set are actually in use. When the offspring network uses a connection
that is also in use by one or both of its parents, a fraction of the corresponding
weight(s) can be 'transmitted'. The problem is then to initialise weights that are not in
use by either of the parents. We choose to use a system where two parents produce
two offspring, and each offspring inherits a fraction F of a particular weight of one of
its parents. Normally, F is set to a default of 1, so the entire weight value is
transferred. For each weight of the offspring there is an equal chance of inheriting the
weight value from either parent. So after weight transmission an offspring will on
average have inherited 50% of its weights from parent 1 and 50% from parent 2. Other
options include allowing offspring to inherit all weights from a single parent or to let
the offspring's weights be an average of the weights of its parents. These options are
not investigated here. When no crossover is performed on the pair of candidates the
offspring are identical to the parents, including the set of weights.
When using a grammar encoding instead of a direct encoding method as above, the
reduced weight transmission concept may not work as well. This is because in a
grammar encoding scheme there is no one-to-one correspondence between the
structural and the weight space. The part of the chromosome representing the network
structure, coded by grammar rules, is in general much shorter than the part
representing the network's weights. When two network parent structures are involved
in reproduction there is no guarantee that the resulting offspring will use weights that
were used by both or any of its parents.
When a network is evaluated, the weight training starts from a point in the weight
space determined by the set of weights of one of its parents. Assuming the network
structure of the offspring is similar to the structures of the parents from which it has
134 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
inherited the set of weights, the learning will in general start off from a much better
point when compared to starting off with random weights. When the network structure
of the offspring is not similar to its parent's however, weight transmission may not be
very useful. In this case the set of weights received from the parents might not be
better than just starting off with random weights. Consequently such a network will in
general need more training to reach an optimal set of weights compared to a network
that starts off training much nearer to the optimum in weight space. This probably
presents the main difficulty in the weight transmission scheme. Assuming that the
amount of training a network receives is limited, weight transmission strongly favours
networks that are structurally close to their parents. It may therefore result more in a
local search through the structure space than a global one. If the optimal network
structure can be found by such a local search, this may not necessarily a bad thing. In
order to test every network fairly without favouring some, the concept of weight
transmission will have to be abandoned or the amount of training that a network
receives will have to depend on the position in weight space that the training starts
from. The latter will be almost impossible to realise since the distance between the
starting point and the optimum in weight space will in general not be known. Another
option is of course to give every network so much training that the optimum can
always be found. Weight transmission will then no longer be needed. But the purpose
of weight transmission was to bring down the amount of training in the first place.
When only those weights that are used by the network are encoded in the
chromosome, difficulty arises when structural crossover or mutation is performed. For
example, when an offspring produced by crossover uses a connection that was not
used by any of its parents, it cannot obtain the corresponding weight value from the
parents. Instead this newly created weight has to be initialised with a random value.
Performing reproduction on the weight space itself is not likely to be a viable option
here since in general there is no one-to-one correspondence between the weight strings
of two different chromosomes. When this method is chosen a variable length GA has
to be used.
Kitano uses a constant and a variable part within the chromosome. The constant part
does not change and consists of the final rewriting rules. It would seem that there is no
point in coding these into the chromosome and in our implementation these 16 final
rewriting rules are set in the system and are the same for every chromosome. The LHS
136 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
of these constant rules is a character from the set 'a' to 'p'. The RHS is one of the
possible 2 x 2 matrices consisting of O's and l's. Thus the final, constant, rewriting
rules are:
The higher level grammar rewriting rules are coded in the chromosome and are
subject to the genetic operators. The starting character is always 'Start', to make sure
the initial rewriting step can always be performed. The other positions of the
chromosome are characters in the range 'A' to 'p' A set of 5 characters defines a
rewriting rule starting from the initial rewriting rule where the LHS is always the
starting symbol 'Start'. By placing no restrictions on the rewriting rules, many rules
may be developed that rewrite the same character, or for some characters no rules may
be developed at all. Furthermore, many developed rules may never be used. Kitano
normally uses chromosomes with a length of 100, which means a number of 100 / 5 =
20 generated rewriting rules. Examples of developed rewriting rules are:
Start^[AbcA],A^[aacb],a-+[aAbb]
At the end of the M matrix rewriting cycles the connectivity matrix is formed out of
the acquired string. The size of this matrix and therefore the maximum size of the
network, the number of neurons, is predetermined to be 2M x 2M. The connectivity
matrix consists of 'l's and '0's. A T denotes a connection, a '0' no connection.
When, after the rewriting cycles, a position in the matrix is still a 'non-differentiated'
cell (i.e. neither a ' 1 ' nor a '0'), Kitano considers it to be dead and is therefore made
equal to '0'. In the connectivity matrix the first n rows and columns correspond to the
input nodes and the last m rows and columns to the output neurons.
The final network may need to be pruned because it is possible that nodes are created
that have no outgoing or no incoming link. Pruning is a repair mechanism and is one
of the possibilities to handle constraints in a GA. Other options include the
punishment of individuals that violate constraints (i.e. give them a poor fitness value)
or choosing the chromosomal representation in such a way that chromosomes that
violate constraints simply do not occur. Pruning may of course be combined with
punishment, so that the GA will prefer networks that need the least pruning.
10.3 The Modified Matrix Grammar 137
This method can be seen as a multiple level substitutional compressor (also known as
dictionary-based compressors), where the compressed string and the dictionary used
are generated using a GA and in the evaluation stage it is decompressed into a
connectivity matrix. The basic idea behind a substitutional compressor is to replace an
occurrence of a particular phrase or group of characters in a piece of data with a
reference to a previous occurrence of that phrase. In this context the starting character
'Start' could be seen as the compressed string and the rewriting rules as a
hierarchically ordered dictionary.
The structure of the chromosomes in this system is not the same as in a Structured
Genetic Algorithm (sGA). In an sGA a set of lower level genes is unique to one higher
level gene. In the approach described above, this is not the case.
Some problems also present in Kitano's system still remain. Many rules may never be
used and several characters could have identical rewriting rules, essentially making all
but one of them unnecessary. An idea could be to code rewriting rules in the
chromosome only when they are used, and leave them out otherwise. The problem
then is that when a character is referred to but has no rewriting rule, a (random) one
has to be made. Furthermore the restriction on matrix sizes of 2M x 2 still applies.
• Coding
This section describes how the characters are coded in the chromosome. Kitano uses
binary coding; e.g. 'a' = 0001, 'b' = 0010 etc. Depending on the GA software used, it
might be preferable to simply code the characters as symbols and we use this
representation. In effect it means that the crossover operator can only work on a group
of characters and thus will leave the characters themselves intact. Only mutation can
change the value of a character by re-initialising it with a random value. Using binary
138 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
coding the genetic operators can operate within the representation substring of a
character.
In implementing the system described above, we need as many alphabets as there are
rewriting cycles. Since there is a rewriting rule for every character of the alphabet, the
chromosome length defines the size of the alphabets or vice versa. For example a
connectivity matrix of size 8 requires 3 rewriting steps and therefore 3 alphabets, A%,
A2 and Ay.
A, = {A, ... }
A2= {a, ...,p}
A 3 ={1,0}
The starting symbol, 'Start', can be seen as the alphabet A0. A rewriting step at level
' 1' rewrites the starting symbol 'Start' into a 2 x 2 matrix consisting of characters of
alphabet A%. In general a rewriting step at level i consists of rewriting a character of
alphabet A, into a 2 x 2 matrix of characters of alphabet Ai+\. The characters of
alphabet A, will be denoted by: S\, ..., S'k, where £, is the cardinality of the alphabet.
So in the example above S\ corresponds to 'A', 523 corresponds to 'c', k2 - 16, /c3 = 2
and so on.
The last two alphabets are predefined and the same for any size matrix. Since we code
a rule for every character in an alphabet, the left-hand side (LHS) of every rule is
predefined and there is no need to code these in the chromosome. Thus a rewriting
rule is represented by its RHS consisting of 4 characters.
• Example
This is a simple example based on [38] of a chromosome representing a neural
network that can perform the XOR task. The network has two inputs and one output.
The system uses three rewriting cycles (M - 3) so that the size of the connectivity
matrix is: 2M x 2M = 8 x 8 . The alphabets used are:
A, = { A, B,C, D }
A2= { a, ...,p }
A3 = { 1 , 0 }
10.3 The Modified Matrix Grammar 139
This particular configuration can be described by k\ = 4, since the alphabets A2 and /43
are pre-defined. An example of a chromosome representing an XOR network is:
The fixed rewriting rules are not part of the chromosome, but are embedded in the
system and are identical to the ones used in Kitano's grammar; i.e.:
This chromosome is translated into the neuron connectivity matrix by means of the
following rewriting cycles:
Figure 10.2 An example of the rewriting cycles for a simple XOR network
140 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
All hidden and output neurons have a connection from the bias neuron that has an
constant activation value of 1. These connections are not represented in the
connectivity matrix. One entry in the connectivity matrix needed to be pruned. In the
connectivity matrix a connection between neurons 5 and 7 is encoded, but since
neither of these neurons are connected to any other neuron the connection is useless.
As can be seen above a rewriting rule for the character ' C is coded in the
chromosome but it is not actually used. Because only feedforward neural networks are
wanted that have no connections to input and no connections from output neurons,
only the highlighted upper-right part of the connectivity matrix is used. Therefore the
method is not very 'clean': some information that is contained in the chromosomes is
never used (i.e. the lower-left part of the matrix). The corresponding neural network is
shown in Figure 10.3.
Figure 10.3 The neural network constructed from the matrix grammar
matrix is made up of four. If it is desired that it must be possible to have no two sub-
matrices that are identical (e.g. A, = {A,B,C,D} or larger), the chromosome will
become very large for larger connectivity matrices. The cardinality of every alphabet
must then be equal to (or larger than) the number of corresponding sub-matrices in the
complete matrix. For example a 16x16 matrix would require the following (minimal)
configuration: k, = 4 (the number of 8 x 8 matrices), and k2 = 16 (the number of 4 x 4
matrices). The resulting chromosome would have a length 1 = 4 + 4*4 + 4*16 = 84. In
the case of a 32 x 32 matrix, the corresponding chromosome length would be: 1 = 4 +
4*4 +4*16 + 4*64 = 340. In general, the chromosome length needed for a system with
M rewriting cycles (i.e. a matrix of size 2Mx2M ) is:
M-\
1 = ^4"
n=l
Thus for larger size matrices the chromosomes will become large indeed. Still, this
chromosome length is smaller than that needed in the direct encoding scheme to code
the same size matrix. In the direct encoding scheme there exists a one-to-one
correspondence between the genes in the chromosome and connections in the neural
network. Each gene codes one connection. This scheme will be described in section
10.5.
The principal goal is to limit the alphabet sizes so that the chromosomes will be of
manageable length. The resulting connectivity matrix is then made up of sub-matrices
that cannot all be unique, with some sub-matrices being used more than once. This
does of course place restrictions on the neural network structures that can be
generated, since they must have some form of regularity in their connectivity matrices.
The severity of these restrictions depends of the alphabet sizes used. Many problems
can be very adequately solved however using neural networks with a high level of
regularity. The classic case of a fully connected feedforward network contains for
example a very high level of regularity in the connectivity matrix, as is illustrated in
Figure 10.4 which shows the 16x16 connectivity matrix of a fully connected '4-4-3'
neural network. In this particular case the 5th, 6"\ 701 and 8th columns and rows in the
matrix correspond to the four hidden neurons. Overall, significant reductions in
chromosome complexity may be obtained if relatively regular and repeating structures
are acceptable for the evolved neural networks.
• Competing Conventions
In a connectivity matrix as in Figure 10.4 a hidden neuron can be represented by any
one of the rows/columns 5 to 14. The first four and the last three rows/columns are
reserved for the input and output neurons. This means for example that the fully
(9\
connected '4-4-3' network can be represented by \A)= 126 different connectivity
In fact the competing conventions problem is even worse than the above analysis
indicates. Using the matrix grammar scheme, many different chromosomes can be
used to represent one and the same connectivity matrix. This is illustrated below
where two different chromosomes code the same 8x8 connectivity matrix pictured in
Figure 10.2 that corresponds to the XOR network.
The reason for this type of competing convention is that a position within the
chromosome does not correspond to a fixed position in the connectivity matrix.
• Representation
The matrix grammar representation scheme relies not just on two but on three
different spaces: the representation space (the chromosomes), the evaluation space
144 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
Figure 10.6 The three spaces used in the matrix grammar representation scheme.
Two examples of competing conventions are shown
As stated above, depending on the alphabet sizes used there will be a restriction on the
network structures generated. The system will in general not be able to generate any
arbitrary feedforward neural network structure. The connectivity matrix must contain
some kind of regularity, so, the evaluation space is only part of the complete problem
space, where the problem space is defined as the complete set of feedforward
networks subject to the appropriate number of in and outputs.
10.4 Combining Structured GAs with the Matrix Grammar 145
The first three entries in the parametric part of the chromosome correspond to the
incoming connections of neuron 3 (i.e. column 3 with the bias weight added). The
next four entries correspond to neuron 4 etc. The 6 entry corresponds to the
connection from neuron 3 to neuron 4. Since this connection is absent in the neural
146 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
network, the corresponding weight value is simply ignored (as are most of the weight
values contained in the chromosome). There are 26 (upper-right half of the matrix of
Figure 10.2) + 6 (bias weights of all neurons) = 32 weight values represented in the
parametric part of the chromosome while only 9 of these are in use by the network.
A separate back propagation training module is used to evaluate the neural networks.
After training, the parametric part of the chromosome is updated with the trained
weights. The parametric changes that result from running the training module are
carried onto future generations by means of weight transmission. No parametric
changes are performed within the GA itself. Only the structural part of the
chromosome is subject to the genetic operations. The working of this system is
illustrated in Figure 10.8.
Figure 10.8 The sGA system with a separate Back Propagation training module
10.4 Combining Structured GAs with the Matrix Grammar 147
Since only feedforward neural networks are generated in the present application, the
number of possible connections or the maximum complexity, Cmlx, when in total N
neurons are used is:
M2 -N N 2
-N N 2
-N
— - 2 2 2
where:
N = total n u m b e r of neurons
Nln = n u m b e r of inputs
NOM = n u m b e r of output neurons
T h e last t w o terms in Cmax indicate that input neurons are not allowed to have
incoming connections and that there are no outgoing connections from output neurons.
This means that in the connectivity matrix, columns corresponding to input neurons do
not have any entries (or the entries are simply discarded) in their columns and the
rows corresponding to the output neurons are empty.
10.4.2 Evaluation
When a chromosome is evaluated into a neural network the top level of the
chromosome needs to be translated into the connectivity matrix. This matrix is then
pruned so that there are no hidden neurons without incoming or outgoing connections.
After this step the matrix is transformed into a neural network which, after the training
phase (backpropagation), is then tested on a set of training patterns. The network uses
the values of the weights of the bottom level of the chromosome that correspond to the
connections used. The fitness value will reflect the error on this training set and can
optionally include a measure of the network's complexity. The amount of pruning that
was necessary can also be reflected in the fitness as a negative measure. In [66]
networks that were less complex (i.e. less connections) were preferred over more
complex ones when both networks achieved a comparable performance on the training
data. In this way, minimal complexity neural networks can evolve. After training, the
parametric part of the chromosome is updated with the trained values of the weights.
Back propagation training is performed for a set number of cycles, rather than the
more customary process of stopping at a required error level, since convergence
cannot be assumed for all networks. The optimal number will depend on the problem
and on the training set used. The back propagation module is a standard one using the
normal gradient descent weight updating rule with a momentum term. Default values
for the learning rate and the momentum term are 0.1 and 0.9 respectively. The module
uses a ■per-pattern' weight update mechanism, meaning that the weights are updated
after every presentation of a training pattern (and not after a presentation of the
complete training set).
10.5 Direct Encoding 149
where: E(x) = the cumulative squared error of the individual on the training set
Noul = the number of output neurons
C(x) = the number of connections in the network (including the bias
weights)
Cmax = the maximum complexity for the specific configuration
P(x) = the number of connections that had to be pruned
The genetic algorithm works in such a way that the fitness measure is minimised
instead of maximised. The relative weight of the complexity and pruning terms can be
set by a and /3. The optimal values are problem dependent and possibly quite hard to
find. Optionally a and/or /J can be set to 0 so that the corresponding term(s) do not
have any influence on the fitness.
The direct encoding scheme differs from the matrix grammar scheme described above
in the representation of the network structure. The parametric part of the chromosome
that codes the weights is identical. Direct encoding is implemented as a bit-string that
directly represents the neural network with a one-to-one correspondence between the
genes and the connections of the network. The bit-string is simply the upper-right half
of the connectivity matrix that defines the neural network structure. As with the matrix
grammar scheme the size for the matrix must be set a-priori. Thus the maximum
number of neurons is pre-defined. In contrast to the matrix grammar scheme however,
the direct encoding scheme is not restricted to matrices of the size 2 and any matrix
size can be used.
As an example of how the structural part of the chromosome is translated into a neural
network, the same XOR network of the last section is considered. The matrix size
150 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
used is the same: M = 8. The upper-right half of the connectivity matrix (the high
lighted part of Figure 10.2) consists of 27 bits. The structural part of the chromosome
contains these bits 'row-wise'. The network structure is now represented by the
following bitstring:
The matrix is then translated into the same XOR neural network structure as in Figure
10.3.
Advantages of using direct encoding over the matrix grammar method are that the
matrix size is not restricted to values of 2M. Furthermore the method is cleaner in that
only the upper-right half of the connectivity matrix is coded in the chromosome. A
disadvantage is that the chromosome length increases rapidly with network size and
that it offers no way to code certain regularities in the network structure. Since the
chromosome length equals the number of maximum feedforward connections given
the maximum number of neurons (i.e. the matrix size) and the number of in- and
outputs, this is given by Cmax (see last section). For example a 4-input, 3-output neural
network with a maximum total of 32 neurons requires a chromosome length of 487
with the direct encoding scheme.
The direct encoding scheme also suffers from competing conventions since a single
neural network structure may be represented by various connectivity matrices. In
contrast to the matrix grammar scheme however, there is a one-to-one correspondence
between chromosomes and connectivity matrices. In contrast to the matrix grammar
scheme it is not the case that one connectivity matrix can be represented by different
chromosomes.
70.6 Network Pruning and Reduction 151
Theoretical analysis (such as the Schema Theorem) suggests that for good
performance of the GA, functionally close genes should be close together on the
chromosome so that they are not easily disrupted by crossover. In the direct encoding
scheme this can be taken to mean that connections belonging to the same neuron
should be close together on the chromosome. Since the connections are coded in the
chromosome row by row this is true for the outgoing connections of a neuron. The
incoming connections of a neuron however can be located very far apart. This is
caused by the mapping of the two-dimensional network structure onto a one-
dimensional linear chromosome. Part of the information concerning the position of
connections in the network is lost. Remedies could be to use the two-dimensional
positional information in the genetic operators (crossover, mutation). In effect a
chromosome could be treated as a direct two-dimensional representation of the
connectivity matrix, and crossover could for example be implemented as swapping
parts of rows (incoming connections) and parts of columns (outgoing connections) or
even areas in the matrix (functional groups of neurons). These approaches have not
yet been attempted, but are suggested as directions for future research.
The network reduction stage is not implemented in the GA systems, although it would
somewhat reduce the computational cost required in the learning stage of the network.
Networks can be penalised for the amount of pruning necessary (the number of links
that needed to be removed) by setting the parameter f5 in the fitness function to an
appropriate value. The potential need for network reduction is in a way penalised by a
greater complexity term (regulated by a) in the fitness function. Assuming that both
networks on the right have a (near) equal error on the training set, the network after
pruning is preferred to the one before pruning with a setting of a > 0.
152 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
Figure 10.9 Network pruning (first step) and reduction (second step). In the current
set-up only network pruning is performed in the evaluation phase just before training
10.7 Experiments
In our preliminary experiments we have implemented an sGA system as described
above, where not only the structure but also the weights of the network are coded in
the chromosomes. The GA software used was a 'SUGAL', vl.O; see section 9.1 for a
description. The changes made to the mutation operators also apply here; i.e. a gene is
mutated with probability pm.
The matrix grammar approach is compared to the direct encoding scheme described in
the last section. The same data-sets that were introduced in section 9.3.1 are used.
10.7.1 Set-up
The structural part of the chromosomes uses symbolic coding and the weights are
coded using real-valued coding. The symbolic coding uses integer-valued genes taken
from the alphabet: {0,...,k-\}, where k is the alphabet size or cardinality. The coding
of the structural part is, in general, non-homogeneous in that it consists of different
parts each having its own alphabet size. These parts correspond to the rewriting rules
for one or more rewriting cycles. In the present implementation only homogeneous
chromosomes were used because implementing non-homogeneous chromosomes in
the GA software used is quite difficult. The alphabet size was normally set to k = 16.
When for example some part of the chromosome has an alphabet size of 4, the
character set is reduced from 16 to 4 during evaluation by the following rules. If the
10.7 Experiments 153
gene lies in the range 1,..,4 the corresponding character will have a value of 1, if it lies
in the range 5,...,8 the character will have a value of 2, etc.
10.7.2 Results
Figure 10.10 The "minimal XOR' neural network. This is the lowest-complexity
network structure that is able to solve the XOR problem
The number of back propagation cycles was initially set to 500. The pruning term in
the fitness function was 'turned off: p = 0.0. Experiments were performed with
several settings of the complexity measure a. It was observed that even for quite small
154 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
values of a, the GA system converged to a network that had no connections at all (i.e.
C{x) = 0). Such a network can still achieve a reasonable fitness value simply because
it's complexity is so low. Since these networks have nothing to offer in the GA search,
the fitness function was changed. If a network has no connections at all its (raw)
fitness, fix), is simply set to an extremely high value resulting in a normalised fitness
of effectively zero. The same could be done for networks that fall below a certain
level of complexity. In the case of the XOR problem all networks with a complexity
below seven could be eliminated this way. Since, in general, the minimal complexity
required to solve a problem is not known in advance this approach could not be
universally applied. However, in many cases it might be reasonable to assume that the
networks generated should have outgoing links from all inputs and incoming links to
all output neurons. The minimum level of complexity can then be set equal to the
number of inputs plus the number of outputs times two (incoming connections plus
connections to the bias unit). This approach was applied in our testing where the
minima] complexity was set to:
—
^min ' " i n "■" * - ' ' o u t
Of course this still does not guarantee that all the inputs and outputs are used, but it
avoids the generation of useless very low complexity neural networks.
It was observed that the performance of the system on the XOR problem depends very
much on the number of back propagation cycles. If this number is too low the neural
networks are not properly tested on their task. With a setting of a = 1.0 and for
example a number of BP cycles of 100, the GA always converges to a neural network
with only five connections (= C ^ + l ) . This network uses one hidden neuron which
has one connection to an input and one to the bias unit. It has a poor performance on
the training set: E(x) ~ 1 and two of the four training patterns are misclassified. With
the number of BP cycles set to 500 however, the GA finds the minimum XOR network
as pictured in Figure 10.10 on average in as little as 3 generations. The specific GA
settings are given in Table 10.1.
10.7 Experiments 155
Table 10.1 GA matrix grammar settings for NN optimisation concerning the XOR
problem with matrix size 8x8
Parameter Setting
a 1.0 ~
£
P 0o
BP cycles
BP cycles 500
500
coding
coding symbolic
symbolic
crossover
crossover type
type two point
two point
elitism on
on
fitness normalisation reverse linear ranking with bias = 10.0
/ 20
mutation type normal
N ' 50 ~
fa 0 8
Pc 0.8
Pj,
Pm 0.005
0_005
re-evaluation off
replacement mechanism unconditional
selection mechanism Roulette
No further experiments were performed on the XOR problem because it is very hard
to make any comparative statements from simulations on such a relatively simple
problem.
{'a', ..., 'p'} and£4=2: ('0', '1'). Since the (fixed) rewriting rules for the last cycle are
not coded in the chromosome, the chromosome consists of 4 + 4 * 4 + 1 6 * 4 = 84
symbols.
The matrix grammar and the direct encoding approach are compared on the neural
network optimisation problem concerning the iris data. Both systems used a maximum
number of neurons of 16 and were run for 200 generations. The GA settings are
shown in Table 10.2. The settings for both systems are identical except of course for
the chromosomal representation used.
Table 10.2 GA settings for NN optimisation using Iris data with matrix sizel6 x 16
Figure 10.11 shows the fittest individuals at the end of a typical run for both systems.
Both networks shown have hidden neurons that have one incoming link only (neuron
11 in the top network and neuron 12 in the bottom); they can be removed and the links
can be reorganised using network reduction (not shown) resulting in networks with a
complexity of 18 and 19 respectively.
10.7 Experiments 157
Figure 10.11 The best individuals of a run for the Iris data, matrix grammar vs direct
with the corresponding fitness, fix), error on iris training set, E(x), and complexity,
C(x). The values of the weights are not shown. Both networks misclass-ified one
training pattern. Every neuron is connected to the bias unit (not shown)
Both systems showed very similar behaviour on this problem. Convergence curves of
the best individual in the population vs generation were nearly identical. The best
individuals as shown above are just examples of one particular run of the GAs. The
particular run resulting in the top neural network is seen in Figure 10.12. To get an
idea of the computational requirements needed, the run took about 20 minutes on a
Sun-Sparc 4 workstation.
158 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
Figure 10.12 The GA run with matrix grammar system resulting in the top network
of Figure 10.11
In Figure 10.13 the same GA run is shown but this time the complexity, C(x), and the
error on the training set, E{x), of the fittest individual are shown as well as the mean
complexity and error in the population. The complexity and error of the fittest
individual do, of course, not have to respond to the lowest complexity and error values
found in the population.
It is interesting to note that the level of complexity of the fittest individual no longer
changes after about 70 generations. The search seems to be stuck at a certain network
structure and the only change in fitness is due to the further learning of the weights.
With the GA settings chosen, every individual of the population is re-evaluated at the
start of the generation. If the fittest individual remains the fittest over several
generations, due to elitism its structure will not be changed and at the start of every
generation its weights will be further refined in training resulting (in general) in a
further decrease in error and therefore a better fitness. Without re-evaluation however
very similar results were obtained. After the initial phase the only change in the fitness
of the fittest individual was caused by a decrease in error. Apparently a number of
individuals in the population share the same network structure and, when evaluated,
10.7 Experiments 159
one of them will replace the fittest. The search still seems to be mainly centred around
a single network structure.
Figure 10.13 The same GA run showing the complexity C(x) and error E(x) of the
fittest individual as well as the mean complexity and error of the population
The neural networks found on this problem using the particular GA settings had a low
level of complexity and generally used quite a few direct connections from input to
output neurons. The networks had on average a complexity of 20. When tested on the
iris test data the neural networks performed well. For example the top network in
Figure 10.11 produced an error of 5.46 on the test data with 4 patterns misclassified
(tolerance = 0.4). So 95% of the test set was correctly classified. Further training of
the network using back propagation did not improve on this (nor on the performance
on the training set itself). This performance is similar to a fully connected '4-4-3'
neural network (complexity = 35) that has been trained for 1000 cycles and produced
a 100% correct classification of the training set with an error of 0.003. This trained
network produces an error of 5.89 on the test data with 3 patterns misclassified.
So despite the relatively low level of complexity and the relatively poor performance
on the training set, the generalisation capabilities of the neural networks found with
160 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
the specific settings of the GA system are good. It is interesting to note that both the
networks of Figure 10.11 could not correctly classify all the patterns in the training
set. No matter how long they were trained using back propagation, one pattern
remained misclassified. It seems that the structure of the networks simply does not
allow for a 100% correct classification of the training data. For example in the case of
the top network, it could very well be that in order for the pattern in question to be
correctly classified connection(s) from the first input neuron are needed.
Several runs were performed for a= 5.0, 20.0 and 50.0. It was found that the setting a
= 5.0 produced results similar to those observed with a = 10.0 but the networks had a
somewhat higher level of complexity: on average C(x) = 25. With a = 20.0 the
average complexity was around 15. Most networks (80%) had error levels of around
3.0 but some runs produced networks with much higher errors (such as E(x) = 20) that
misclassified up to 20 training facts. Most of these networks had a very low level of
complexity. This effect was even stronger for the case a = 50.0. The average
complexity of the best individuals was 10 with some networks having as few as 4
connections. Some runs produced networks with a complexity of 11 using only one
hidden neuron that had a very similar performance on the training set as those pictured
in Figure 10.11 (E(x) ~ 3, 1 pattern misclassified).
10.7 Experiments 161
Figure 10.14 An example of a GA run with the same settings as in Figure 10.13 but
with a-0.0
Using a 32x32 matrix, the resulting networks had a much higher level of complexity.
On average after a run of 200 generations with the same settings as before the matrix
grammar GA system generated individuals that had a complexity of about 100 as
opposed to around 20 with the 16x16 matrix. After 500 generations this complexity
had dropped to a value of around 50. Figure 10.15 shows two GA runs with this
configuration, one with the matrix grammar scheme and one with the direct encoding
scheme.
162 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
Figure 10.15 Examples of two GA runs on the iris flower data with a matrix size of
32x32. One run is done with the matrix grammar scheme, the other one with the direct
encoding scheme. Apart from the chromosome lengths the same settings as in Table
10.2 were used
The direct encoding scheme on average generated networks with a somewhat higher
level of complexity: C(x) ~ 70 after 500 generations. The resulting connectivity
matrices needed large amounts of pruning when translated into neural network
structures for both systems. On average something like 100 entries in the matrices
needed to be removed.
Experiments were then performed with the pruning term in the fitness function 'turned
on'. A value of p= 0.01 was used. Both systems gave very similar results to the "non-
pruning" results, but the amount of final pruning was somewhat reduced to around 70.
Both systems have difficulty in minimising both the network complexity as well as the
amount of pruning that is needed for a matrix of this size. The matrix grammar scheme
is able to generate less complex networks than the direct encoding scheme.
10.7 Experiments 163
The networks found after 200 generations had a very poor performance on the training
set. The average cumulative error was around 30, with about 8 training facts
misclassified. The level of complexity of these networks was very low: on average
about 10 connections were used and without exception the networks did not use any
hidden neurons. The networks consisted purely of straight connections between inputs
and output neurons. A typical GA run for this configuration is shown in Figure 10.16.
The results can be explained by the lack of training that a network receives in the
evaluation phase. The number of back propagation training cycles per evaluation is
only one. This number is in fact misleading since every individual is re-evaluated at
the start of a generation and, with weight transmission this re-evaluation has the same
effect as training for two back propagation cycles. Without weight transmission it
simply re-trains the network and re-evaluation does not really serve any purpose. To
obtain an accurate estimate of the network's performance on the training set,
evaluation should consist of several back propagation runs, with each one starting with
different random weights. The average error can then be used as a more accurate
estimate of the performance. Clearly, without weight transmission, one back
propagation cycle is not enough to properly test the network structures on their given
task. The GA system does well in minimising the neural network structure but does so
by severely limiting the network's performance on the training set. The balance
between complexity and performance on the training data lies strongly in favour of
reduced complexity with this configuration.
164 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
Figure 10.16 An example of a GA run with the same settings as in Figure 10.13 but
without weight transmission. When evaluated the networks are instantiated with
random weights
As a comparison some runs were performed without weight transmission but with a
much higher number of back propagation cycles per evaluation. This number was set
to 50 and typical GA runs took several hours of CPU time. Figure 10.17 shows one
such run. The networks generated had an average complexity of about 11 with one
hidden neuron. While the performance on the training set was practically identical to
the networks found with weight transmission, the resulting complexity was somewhat
lower.
While it seems that with this configuration less complex networks are found when
compared to the situation without weight transmission with near identical training
errors, the computational time required is far larger. The effect of weight transmission
is such that many less back propagation cycles are needed in order to find a "suitable'
network structure. No exhaustive comparison was performed to determine the full
effect of chosen parameters. For example, the GA system with weight transmission
might improve if a somewhat larger number of BP cycles were used, and the GA
10.7 Experiments 165
system without weight transmission might not need as many as 50 BP cycles for an
identical convergence.
Figure 10.17 An example of a GA run with the same settings as in Figure 10.16 but
with a much larger number of back propagation cycles: BP cycles = 50
complexity of the networks generated during the course of a run. The complexity of
the best network of the final population is almost identical to the best network in the
initial population. These complexities differ between the two schemes, with the best
neural network generated using the matrix grammar scheme has a complexity of
around 730 while the direct encoding scheme generates networks with a complexity of
around 900. The same was found in general and not just in these particular runs.
Figure 10.18 Two GA runs on the radar classification data with a matrix of size
64x64, one with the matrix grammar scheme, the other one with direct encoding
The matrix grammar scheme generates networks with a better performance on the data
set, with an error of about 40 with about 40 outputs misclassified (of the total
12*240=2880 outputs of the whole training set). In contrast the direct encoding
scheme generated networks with an average error of about 120 with over 100 outputs
misclassified. These results could be described as encouraging. The error rates
remained fairly high and complexity was still recessive when compared to manually
tuned networks. However, the matrix grammar scheme in particular generated solution
networks that performed moderately well. Further work is required to identify GA
10.8 Discussion 167
settings that can produce better solutions to this problem and to other more complex
problems.
10.8 Discussion
Two ideas have been combined in this chapter for the optimisation of feedforward
neural network structures and their weights. A structured genetic algorithm has been
developed where both the network structure and the set of weights are coded in the
chromosomes. This set of weights is passed on to the offspring by means of weight
transmission. A matrix grammar scheme has been implemented to represent neural
network structures in a concise manner. It encodes (forced) regularities in the
connectivity matrix defining the network structure. A direct encoding scheme where
each entry in the (upper-right half of the) connectivity matrix is represented by one
gene, has also been implemented and results of the two systems are compared.
On the XOR problem the 'minimal XOR network' was generated by all systems. It
was found however that even on a relatively simple problem as this the settings of the
GA system, such as the number of back propagation cycles, have a major influence on
the performance of the system. With the iris data set, low complexity neural networks
were found that were still able to perform quite well on the training and test set.
Differences were only observed between the performances of the matrix grammar
encoding scheme and the direct encoding scheme when a larger size 32 x 32
connectivity matrix was used. The matrix grammar scheme was able to generate less
complex (i.e. 'better') networks. Results with the larger sized radar classification data
set were less positive. Neither the direct encoding nor the matrix grammar scheme
were able to decrease the complexity of the networks over the course of a GA run,
although moderately low error rates were obtained. The matrix grammar scheme also
produced networks with a lower complexity but this complexity was still very high.
The task can be learned very well with a fully connected neural network with about
half the complexity of the generated networks.
It was observed that weight transmission ensures that many less training cycles are
needed in the evaluation phase of the generated neural networks when compared to
randomisation of network weights at each generation. Weight transmission does
however not seem to be fair on all network structures generated, since networks with a
structure similar to their parents will be at an advantage. With or without weight
transmission, the number of training cycles used is a critical parameter and the optimal
168 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks
number depends strongly on the problem. In general a small training set will require
more training cycles than a large training set, but the amount of training necessary will
depend on the difficulty of the problem as well.
The matrix grammar scheme that was implemented here provides a way to encode
neural network structures in a concise manner by encoding (and forcing) regularities
in the corresponding connectivity matrix. One of the main drawbacks of the matrix
grammar approach is the restriction to matrices of size 2N x 2N which in turn specifies
the maximum number of neurons in the network. The direct encoding scheme does not
suffer from this restriction but it still requires a maximum number of neurons to be set.
For large problems, the chromosome length increases drastically using direct encoding
while it remains within reasonable length using the matrix grammar scheme. It was
found that the maximum number of neurons strongly influences the complexity of the
networks generated and both the matrix grammar scheme and especially the direct
encoding scheme have difficulty in generating 'small' networks within a large
connectivity matrix.
The fitness function was composed of a term measuring the network's performance on
the training set and one reflecting the network's complexity. The idea was to generate
low complexity networks that still perform well on the problem. It was observed that
the value of a, the measure of complexity in the fitness function, had a significant
influence on the neural networks generated. If set to a small value the complexity of
the networks is very large, if set to a large value, very small networks that perform
quite poorly on the data are generated. It is a difficult issue to find the optimal trade
off between the error on the training set and the complexity of the network.
Future work might include investigating the effect of letting not just the structure but
the weights be subject to genetic operators. One of the main issues is the chromosomal
representation of the neural network structures. It has been shown that the matrix
grammar scheme is able to generate somewhat less complex neural networks as
compared to the direct encoding scheme. There are several drawbacks to the matrix
grammar scheme however and there may well be other representation schemes such as
graph grammars that give better results.
11. Conclusions and Future Directions
Evolutionary computation has proven to be a very useful optimisation tool in many
applications. Determining its efficacy as an optimisation algorithm for feedforward
neural network architectures and/or weights was the goal of this research.
Similar to the work in [41], it has been shown that the genetic programming paradigm
can be used in a direct encoding scheme coding both the neural network architecture
as well as its weights, to generate neural networks that can perform the XOR and the
one-bit adder task. It was found however that the GP system does not scale up well to
larger real world applications. This is mainly due to the rapidly growing chromosome
sizes for larger problems and the restrictions of this approach as described in section
8.7. The main restriction is that only tree structured network architectures can be
generated. Many problems may be very hard or even impossible to solve using a tree
structured neural network. Genetic programming provides certain advantages over
standard genetic algorithms in that the size of the chromosomal representation is not
fixed and that it provides a way to efficiently code functional subgroups that may be
called upon more than once. A graph grammar encoding scheme has been successfully
used in GP to represent boolean neural network architectures [24] and a similar
169
170 Chapter 11. Conclusions and Future Directions
scheme may prove to be an efficient and concise way to code feedforward neural
networks in general.
A GA based matrix grammar encoding scheme was implemented and combined with
the idea of structured genetic algorithms where both the network architecture as well
as the set of weights are coded in the chromosomes. Weight values are passed on to
offspring networks by means of weight transmission. A direct encoding scheme was
also implemented where feedforward neural networks are directly represented by a
connectivity matrix. Using this direct encoding scheme, larger sized networks require
excessively large chromosomes, generally decreasing the GA performance as a
network optimiser. The matrix grammar encoding scheme encodes regularities in the
connectivity matrix resulting in chromosomes of much shorter length. Drawbacks of
both representation schemes include the need to specify a maximum number of
neurons in advance. This in turn specifies the size of the connectivity matrix. The
matrix grammar scheme poses an even more severe restriction on the matrix size since
only matrices of size 2N \2N (N= 1,2,3...) are allowed. The matrix grammar scheme
is also not very 'clean' in that it may code a lot of information that is never used and
there exist numerous ways to represent one and the same network structure resulting in
a competing conventions problem similar to the one in neural network weight
optimisation. Still, good results were obtained on a neural network optimisation
problem of medium size and both the matrix grammar and the direct encoding scheme
were able to generate low complexity neural networks that performed well on training
and on test data. With increasing network size the matrix grammar scheme gave
somewhat better results. Both schemes were unable to generate low complexity
networks on a large real-world classification problem, but investigations of this
problem were limited by the excessive computational effort required and the limited
time available. The effect of weight transmission is such that it reduces the amount of
training that is necessary to evaluate the networks. However, it does impose a
restriction on the networks generated in that neural networks that bear a close
structural resemblance to their parents will be favoured.
One of the main issues in neural network architecture optimisation is the chromosomal
representation of the neural network structure. A direct encoding scheme such as
Genetic Programming for Neural Networks is commonly found to have a poor scaling
performance. Grammar encoding provides an alternative. The general idea in grammar
encoding is to use some form of repetition or modularity in the network structure so
that a representation of manageable length is achieved. Kitano's matrix grammar
scheme and the matrix grammar scheme implemented here code repeated patterns in
Chapter 11. Conclusions and Future Directions 171
the connectivity matrix. Gruau's cellular encoding scheme codes cell-divisions and
connectivity mutations and uses repeated subnetworks that perform a certain function.
The latter has only been used for binary networks but it could well be that a similar
scheme may provide an efficient way to code neural network structures in general.
Finally, some comments are in order on the topic of evolutionary computation. So far
very little research has been performed on the generalisation capabilities (testing of
the solution on data outside the 'training set') of evolutionary computation
optimisation systems. The training set is meant here as the data set that is used to
evaluate the individuals on their task. Problems similar to the ones in learning
algorithms of neural networks apply: when to stop the evolutionary computation
algorithm, how to choose the training set and the problem of overfitting on the training
data.
In general it can be said that more foundational work is needed in the field of
evolutionary computation. The lack of a proper mathematical foundation results in a
trial and error based search for the optimal parameters without any formal guidelines.
Further investigation of methods for convergence analysis of GAs using for example
Markov chains seem likely to yield significant payoff. Techniques for visualisation in
evolutionary computation may also prove very beneficial to the field, since in general
the internal workings of the algorithms remain hidden to the user. With such
techniques it might even be possible for the user to intervene in the search and adjust
certain parameters on the run.
This page is intentionally left blank
References and Further Reading
[1] Alba, E., Aldana, J.F., and Troya, J.M., "Genetic Algorithms as Heuristics for
Optimizing ANN Design", International Conference on Artificial Neural Nets
and Genetic Algorithms (ANNGA93), Innsbruck, Austria, pp. 683-689, 1993.
[3] Angeline, P.J., Saunders, G. M. and Pollack, J.M., "An Evolutionary Algorithm
that Constructs Recurrent Neural Networks", IEEE Transactions on Neural
Networks, vol. 5, no. 1, 1994.
[4] Boers, E.J.W. and Kuiper, H., "Biological Metaphors and the Design of Modular
Artificial Neural Networks", Technical Report, Departments of Computer
Science and Experimental and Theoretical Psychology, Leiden University, The
Netherlands, 1992.
[8] Cangelosi, A., Parisi, D., and Nolfi, S., "Cell Division and Migration in a
'Genotype' for Neural Networks", Network: computation in neural systems, in
press.
173
174 References and Further Reading
[12] De Jong, K.A., Spears, W.M. and Gordon, D.F., "Using Markov Chains to
Analyze GAFOs", Foundations of GAs Workshop, (ftp.aic.navy.mil/pub/spears/
foga94), 1994.
[13] Eberhart, R.C., "The Role of Genetic Algorithms in Neural Network Query-
Based Learning and Explanation Facilities", IEEE International Workshop on
Combinations of Genetic Algortihms and Neural Networks (COGANN-92),
Baltimore, pp. 169-183, 1992.
[16] Fogel, D.B. and Fogel, L.J. (Guest editors), Special Issue on Evolutionary
Computation, IEEE Trans, on Neural Networks, Vol. 5, No. 1, January 1994.
[17] Fraser, A.P., "Genetic Programming in C++, A Manual for GPC++", Technical
Report 040, University of Salford, Cybernetics Research Institute, 1994.
[21] Goldberg, D.E. and Deb, K., "A Comparative Analysis of Selection Schemes
Used in Genetic Algorithms", in: Foundations of Genetic Algorithms, edited by
Rawlins, G.J.E., Morgan Kaufmann Publishers, pp. 69-93, 1991.
[22] Goldberg, D.E. and Segrest, P., "Finite Markov Chain Analysis of Genetic
Algorithms", Proceedings of the Second International Conference on Genetic
Algorithms (ICGA-87), pp. 1-8, 1987.
[24] Gruau, F., "Genetic Synthesis of Boolean Neural Networks with a Cell Rewiting
Developmental Process", IEEE International Workshop on Combinations of
Genetic Algortihms and Neural Networks (COGANN-92), Baltimore, pp. 55-74,
1992.
[26] Happel, B.L.M. and Murre, J.M.J., "Design and Evolution of Modular Neural
Network Architectures, Neural Networks, vol. 7, no. 6/7, pp. 985-1004, 1994.
[27] Harp, S.A. and Samad, T., "Genetic Synthesis of Neural Network Architecture",
Handbook of Genetic Algorithms, edited by Davis, L., Van Nostrand Reinhold,
pp. 202-221, 1991.
[28] Hassoun, M.H., Fundamentals of Artificial Neural Networks, MIT Press, 1995.
176 References and Further Reading
[31] Horn, J., "Finite Markov Chain Analysis of Genetic Algorithms with Niching",
Proceedings of the Fifth International Conference on Genetic Algorithms, San
Mateo, CA, pp. 110-117, 1993.
[33] Jain, L.C., "Hybrid Intelligent Techniques in Teaching and Research", IEEE
AES, Vol. 10, No. 3, March 1995, pp.14-18.
[34] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 2,
Journal of Network and Computer Applications, Academic Press, England, Vol.
19, Issue 2, April 1996.
[35] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 1,
Journal of Network and Computer Applications, Academic Press, England, Vol.
19, Issue 1, January 1996.
[36] Jain, L.C. (Editor), Electronic Technology Directions Towards 2000, ETD2000,
IEEE Computer Society Press, USA (Edited Conference Proceedings), Volume
1,2, May 1995.
[37] Kinnear, K.E. Jr., "Evolving of a Sort: Lessons in Genetic Programming", IEEE
International Conference on Neural Networks, vol.2, pp. 881-888, 1993.
[38] Kitano, H., "Designing Neural Networks Using Genetic Algorithms with Graph
Generation System",Complex Systems, vol. 4, pp. 461-476, 1990.
References and Further Reading 177
[41] Koza, J.R. and Rice, J.P., "Genetic Generation of both the Weights and
Architecture for a Neural Network", IEEE International Joint Conference on
Neural Networks, 1991.
[42] Lewin, B., Genes IV, Oxford University Press and Cell Press, 1990.
[43] Lohmann, R., "Structure Evolution in Neural Systems", Dynamic, Genetic, and
Chaotic Programming, edited by B. Soucek and the IRIS Group, John Wiley &
Sons, Chapter 15, pp. 395-411, 1992.
[44] Lund, H.H. and Parisi, D., "Simulations with an Evolvable Fitness Formula",
Technical Report PCIA-1-94, C.N.R., Rome, 1994.
[46] Maniezzo, V., "Genetic Evolution of the Topology and Weight Distribution of
Neural Networks", IEEE Transactions on Neural Networks, Vol. 5, No. 1,
January 1994.
[47] McDonnell, J.R. and Waagen, D., "Evolving Neural Network Connectivity",
IEEE International Conference on Neural Networks, San Fransisco, 1993.
[50] Montana, DJ. and Davis, L., "Training Feedforward Neural Networks Using
Genetic Algorithms", Proceedings of the Internatinal Conference on Artificial
Intelligence, pp. 762-767, 1989.
[51] Muhlenbein, H., Schomisch, M. and Born, J., "The Parallel Genetic Algorithm
as Function Optimizer", Parallel Computing, Vol. 17, pp. 619-632, 1991.
[53] Narasimhan, V.L. and Jain, L.C. (Editors), The Proceedings of the Australian
and New Zealand Conference on Intelligent Information Systems, IEEE Press,
1996.
[54] Nix, A.E. and Vose, M.D., "Modelling Genetic Algorithms with Markov
Chains", Annals of Mathematics and Artificial Intelligence #5, pp. 79-88, 1992.
[55] Nolfi, S. and Parisi, D., "Growing Neural Networks", Proceedings of Artificial
Life III, Santa Fe, New Mexico, 1992.
[57] Schiffmann, W., Joost, M. and Werner, R., "Application of Genetic Algorithms
to the Construction of Topologies for Multilayer Perceptions", International
Conference on Artificial Neural Nets and Genetic Algorithms (ANNGA93),
Innsbruck, Austria, pp. 676-682, 1993.
[58] Singer, M. and Berg, P., Genes & Genomes, A Changing Perspective, University
Science Books, Blackwell Scientific Publications, 1991.
[59] Soucek, B. and the IRIS Group, Dynamic, Genetic and Chaotic Programming,
John Wiley & Sons Inc., 1992.
References and Further Reading 179
[60] Van Rooij, A.J.F., Jain, L.C. and Johnson, R.P., "Neural Network Training
Using Genetic Algorithms", Guidance, Control and Fuzing Technlogy
International Meeting, 2nd TTCP, WTP-7, DSTO, Salisbury, Australia, 1 0 - 1 2
April, 1996.
[61] Vonk, E., Jain, LG. and Johnson, R., "Using Genetic Algorithms with Grammar
Encoding to Generate Neural Networks", IEEE International Conference on
Neural Networks, Perth, December, 1995.
[62] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Hibbs, R., "Integrating Evolutionary
Computation with Neural Networks", Electronic Technology Directions to the
Year 2000, IEEE Computer Society Press, pp. 135-141, 1995.
[63] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Johnson, R., "Automatic Generation
of a Neural Network Architecture Using Evolutionary Computation", Electronic
Technology Directions to the Year 2000, IEEE Computer Society Press, pp. 142-
147, 1995.
[64] Whitley, D., Starkweather, T. and Bogart, C, "Genetic Algorithms and Neural
Networks: Optimizing Connections and Connectivity", Parallel Computing, vol.
14, pp. 347-361, 1990.
[66] Zhang, B. and Muhlenbein, H., "Evolving Optimal Neural Networks Using
Genetic Algorithms with Occam's Razor", Complex Systems, vol. 7, no. 3, 1993.
This page is intentionally left blank
Index
A Foundations Of Genetic
Activation Functions 6 Algorithms 00
Artificial Neural Network; 3
Artificial Neuron 4 G
Automatically Defined GA Software 114
Functions 105 Gene Mutations 48
Generation 24
B Genetic Algorithms 17
Back Propagation 122 Genetic Operators 148
Binary Coding 83 Genetic Programming 35 , 101
Biological Background 43 Genetic Structures 43
Building Block Hypothesis 67 Genetically Programmed
Neural Network 102
Grammar Encoding 98
c
Chromosome Mutations 46
Coding 82 H
Creation Rules 104 Hybridisation Of
Crossover Rules 64,88, 104 Evolutionary Computation 91
D I
Direct Encoding 96, 149 Implementing GA's 79
Dual Representation 29 Intertwined Spirals 111
Inversion 89
E
Elitism 34 K
Evolutionary Kitano's Matrix Grammar 135
Algorithms 40
- Computation 17, 54,91 ,93 L
Extensions Of Genetic Algorithm 34 Learning Rules 9, 11
F M
Fitness Function 81, 106 Markov Chain Analysis 75
181
182 Index
o
One-Bit Adder 110 T
Operation Of Genetic Algorith ms 60 Tournament Selection 87
Optimisation Problem 19 Types Of Neural Networks 9, 13
Optimisation of Weights 114
w
P Walsh-Schema Transform 69
Parallel Genetic Algorithms 33 Weight Representation 135
Parametrised Encoding 98 Weight Transmission 132
Price's Theorem 74
Proportionate Reproduction 86 X
XOR 108
R
Real-Valued Coding 84
Advances in Fuzzy Systems — Applications and Theory Vol. 14
ISBN 981-02-3106-7 I
it IIIIII ii mini mi
■ I I M H I B I I I I I I I I I I ! IN IB
3449hc | 9 "789810H231064"