Professional Documents
Culture Documents
Delft, Netherlands
September 2003
Table of Contents
Preface ......................................................................................... v
1 Introduction .............................................................................. 1
i
Table of Contents
ii
Table of Contents
iii
Preface
Preface
This report is the final document on the thesis that I have done within the framework of the Master of
Science program at the faculty of Civil Engineering and Geosciences at Delft University of Technology.
This thesis was executed in cooperation with the Civil Engineering Informatics group and the
Hydrology and Ecology section of the department of Water Management at the subfaculty of Civil
Engineering.
The reason for this cooperation was that the thesis subject is a combination of a technique from the
field of informatics (Artificial Neural Networks) and a concept from the field of hydrology (Rainfall-
Runoff modelling). Artificial Neural Network model were examined, developed and tested as Rainfall-
Runoff models in order to test their ability to model the transformation from rainfall to runoff in a
hydrological catchment.
I would like to thank the following people that aided me during my investigation. From the Civil
Engineering Informatics group: prof. dr. ir. Peter van der Veer for his suggestion of the thesis subject
and dr. ir. Josef Cser for his inspired support. And from the section of hydrology: ing. Tom Rientjes for
his skilled and enthusiastic guidance and suggestions, and Fabrizio Fenicia, M. Sc. for providing me
with the data from the Alzette-Pfaffenthal catchment.
N.J. de Vos
Dordrecht, September 2003
v
Summary
Summary
Hydrologic engineering design and management purposes require information about runoff from a
hydrologic catchment. In order to predict this information, the transformation of rainfall on a
catchment to runoff from it must be modelled. One approach to this modelling issue is to use
empirical Rainfall-Runoff (R-R) models. Empirical models simulate catchment behaviour by
parameterisation of the relations that the model extracts from sample input and output data.
Artificial Neural Networks (ANNs) are models that use dense interconnection of simple computational
elements, known as neurons, in combination with so-called training algorithms to make their structure
(and therefore their response) adapt to information that is presented to them. ANNs have analogies
with biological neural networks, such as nervous systems.
ANNs are among the most sophisticated empirical models available and have proven to be
especially good in modelling complex systems. Their ability to extract relations between inputs and
outputs of a process, without the physics being explicitly provided to them, theoretically suits the
problem of relating rainfall to runoff well, since it is a highly nonlinear and complex problem.
The goal of this investigation was to prove that ANN models are capable of accurately modelling the
relationships between rainfall and runoff in a catchment. It is for this reason that ANN techniques
were tested as R-R models on a data set from the Alzette-Pfaffenthal catchment in Luxemburg.
An existing software tool in the Matlab environment was selected for design and testing of ANNs on
the data set. A special algorithm (the Cascade-Correlation algorithm) was programmed and
incorporated in this tool. This algorithm was expected to ease the trial-and-error efforts for finding an
optimal network structure.
The ANN type that was used in this investigation is the so-called static multilayer feedforward
network. ANNs were used either as pure cause-and-effect models (i.e. previous rainfall, groundwater
and evapotranspiration data input and future runoff output) or as a combination of this approach and
a time series model approach (i.e. also including previous runoff data as input).
The main conclusion that can be drawn from this investigation is that ANNs are indeed capable of
modelling R-R relationships. The ANNs that were developed were able to approximate the discharge
time series of a test data set with satisfactory accuracy. The information content of the variables,
which were included in the data set, complemented each other without significant overlap. Rainfall
information could be related by the ANN to rapid runoff processes, groundwater information was
related to delayed flow processes and evapotranspiration was used to discern the summer and winter
seasons.
Two minor drawbacks were identified: inaccuracies as a result of the fact that the time resolution of
the data is lower than the time scale of the dominant runoff processes in the catchment, and a time
lag in the ANN model predictions due to the static ANN approach.
The CasCor algorithm does not perform as well as hoped for. The framework of this algorithm,
however, can be used to embed a more sophisticated training algorithm, since this is the main
drawback of the current implementation.
vii
Introduction
1 Introduction
Artificial Neural Networks (ANNs) are networks of simple computational elements that are able to
adapt to an information environment. This adaptation is realised by adjustment of the internal
network connections through applying a certain algorithm. Thus, ANNs are able to uncover and
approximate relationships that are contained in the data that is presented to the network.
ANN applications are becoming more and more popular since the resurgence of these techniques in
the last part of the 1980s. Since the early 1990s, ANNs have been successfully used in hydrology-
related areas, one of which is Rainfall-Runoff (R-R) modelling [after Govindaraju, 2000]. The
application of ANNs as an alternative modelling tool in this field, however, is still in its nascent stages.
The reason for modelling the relation between precipitation on a catchment and the runoff from it is
that runoff information is needed for hydrologic engineering design and management purposes
[Govindaraju, 2000]. However, as Tokar and Johnson [1999] state, the relationship between rainfall
and runoff is one of the most complex hydrologic phenomena to comprehend. This is due to the
tremendous spatial and temporal variability of watershed characteristics and precipitation patterns,
and the number of variables involved in the modelling of the physical processes.
The highly non-linear and complex nature of R-R relations is a reason for empiricism being an
important approach to R-R modelling. Empirical R-R models simulate catchment behaviour by
transforming input to output based on certain parameter values, which are determined by a
calibration process. A calibration algorithm is often used to determine the optimal parameter values
that, based on input data samples, produce an output that as close as possible resembles a target
data sample.
Another R-R modelling approach, which opposes empirical modelling, is physically based modelling.
This approach is based on the idea of recreating the fundamental laws and characteristics of the real
world as closely as possible. Physically based modelling requires large amounts of data, since spatially
distributed data is used, and is characterised by long calculation times.
Certain ANN types can be used as typical examples of empirical modelling. Such ANNs can be seen as
so-called black boxes, in which a time series for rainfall is inputted and a time series for discharge is
outputted. The network is able to intelligently change its internal parameters, so that the target
output signal is approximated. This way the relationships between the input and output variable are
parameterised in the model structure and the ANN can make an output prediction based on new
input.
ANNs have proven to be especially good in modelling complex and non-linear systems. Other
important merits of these techniques are the short development time of ANN models, their flexibility
and the fact that no great expertise in a certain field is needed in order to be able to apply ANN
techniques in this field.
The main objective of this investigation is to prove that ANNs can be successfully used as R-R models.
It is for this reason that various ANNs are developed and tested on a data set from the Alzette-
Pfaffenthal catchment (Luxemburg). In order to be able to develop such ANN models, a firm
understanding of ANN fundamentals and information about past applications of ANNs in R-R modelling
was needed. It was for this reason that literature studies on both subjects have been performed. The
ANN model development was done in a Matlab environment, for which an ANN design tool was
modified to fit the demands of this investigation.
The time limit of this thesis makes for several limitations of the scope of this investigation. This
investigation only focuses on one ANN type: the so-called static multilayer feedforward network type.
Another obvious limitation is that only one catchment data set is examined.
Chapter 2 results from a literature survey on the topics of ANNs. ANNs are introduced by presenting
their basic theoretical framework, discussing some specific capabilities that will be used in this
investigation, and mentioning common merits and drawbacks of their application. The findings of
another literature survey, on ANNs in the hydrological field of Rainfall-Runoff (R-R) modelling, are
presented in Chapter 3. This chapter starts with a short introduction on the mechanisms that
1
Chapter 1
transform precipitation into discharge from a catchment and the most common way of modelling this
transformation. The position of ANNs in this modelling field is explained, after which several data and
design aspects for ANN R-R modelling are examined.
What is presented in Chapter 4 relates to the ANN software that was used in this investigation. A
Matlab-tool was modified, mainly in order to incorporate a special ANN algorithm (Cascade
Correlation). Chapter 4 discusses the implementation of this addition and other modifications of the
software tool.
Chapter 5 presents the application of ANN techniques on a data set from the Alzette-Pfaffenthal
catchment (Luxemburg). Various data and design aspects that arose are discussed in detail.
Furthermore, the performance of 24 ANN R-R models is presented. The chapter concludes with a
discussion of the best models that were found and highlights several aspects of their performance
using some additional tests.
The conclusions of this investigation are presented in the sixth and final chapter, as well as several
recommendations that the author would like to make.
2
Artificial Neural Networks
The conspectus offered by this chapter is by no means complete; it mainly focuses on the basic
principles of ANNs and on those techniques and types of ANNs that are capable of mapping relations.
As a result, many types of ANNs and ANN techniques are disregarded. For a more complete overview
the reader is referred to the works of Hecht-Nielsen [1990], Zurada [1992] and Haykin [1998].
From a mathematical point of view, ANNs can be called universal approximators, because they are
often able to uncover and approximate relationships in different types of data. Even though an
underlying process may be complex, an ANN can approximate it closely, provided that sufficient and
appropriate data about the process is available to which the model can adapt.
1
Hecht-Nielsen uses the term neural network in his definition. The author, however, will use the name
Artificial Neural Network. The latter term is nowadays more broadly employed because that way a clear
distinction is made between biological and artificial neural networks.
3
Chapter 2
The similarity between the nervous system and ANNs becomes clearer when comparing the
description of biological neurons above with the description of the ANN framework in 2.2.
4
Artificial Neural Networks
unsuccessful attempts to develop techniques that could solve problems on a larger scale were other
reasons for the severely diminished amount of research in the field of neurocomputing.
Halfway the 1980s, interest in ANNs increased significantly, thanks to J.J. Hopfield, who became
the leading force in the revitalisation of neural computing. During the following years, many of the
former limitations of ANNs were overcome. The improvements on existing ANN techniques in
combination with the increase in computational resources led to successful application of ANNs for
many problems. One of the most groundbreaking rediscoveries was that of backpropagation
techniques (which were conceived by Rosenblatt) by McClelland and Rumelhart in 1986. These
developments led to an explosive growth of the field of ANNs. The number of conferences, books,
journals and publications has expanded quickly since this new era.
ANNs are typically used for modelling complex relations in situations where insufficient knowledge of
the system under investigation is available for the use of conventional models, or if development of a
conventional model is too expensive in terms of time and money. ANNs have been applied in various
fields where this situation is encountered. Some examples of fields of work that show the broad
possibilities of ANNs are: process control (e.g. robotics, speech recognition), economy (e.g. currency
price prediction) and the military (e.g. sonar, radar and image signal processing).
In spite of this broad range of applications, it is safe to say that the field is still in a relatively early
stage of development.
Some of the relations between these components are visualised in Figure 2.2. This figure depicts a
schematisation of two artificial neurons and the transformations that take place between input and
output.
Let us assume a set of processing elements (neurons); at each point in time, each neuron ui has
an activation value, denoted in the diagram as ai (t ) ; this activation value is passed trough a function
fi to produce an output value oi (t ) . This output value can be seen as passing through a set of
unidirectional connections to other neurons in the system. What is associated with each connection is
a real number usually called the weight of the connection, designated wij which determines the
amount of effect that the first neuron has on the second. All of the inputs must then be combined by
some operator (usually addition), after which the combined inputs to a neuron, along with its current
activation value, determine its new activation value via a function Fi . Finally, the weights of these
systems can undergo modification as a function of experience. This is the way the system can adapt
its behaviour, aiming for a better performance.
2
The term neuron will be used from here on when referring to artificial neurons. The use of this more concise
term is justified by the fact that within the context of Artificial Neural Networks a reference to neurons obviously
bears reference to artificial neurons.
5
Chapter 2
Figure 2.2 - Schematic representation of two artificial neurons and their internal
processes [after Rumelhart, Hinton and McClelland, 1986]
Characteristics and examples of the above mentioned components of ANNs will be presented in the
following subsections in more detail. The basic structure of these sections is also based on the work of
Rumelhart, Hinton and McClelland [1986].
3
In some works the input units are referred to as input neurons within an input layer. Since these units serve
no purpose but to pass information to the network (without the transformation of data performed by regular
neurons), the author will label them input units and will disregard the whole of these units as a network layer.
6
Artificial Neural Networks
This output function is often either the identity function f ( x ) = x (so that the current activation value
is passed on to other neurons), or some sort of threshold function (so that a neuron has no effect on
other neurons unless its activation exceeds a certain value).
The set of current output values is represented by a vector o(t ) .
N.B.
The output function is related to what is often called the bias of a neuron. A situation where the
output function is equal to the identity function is referred to as a situation where no bias for the
neuron is used. A bias of 0.5 basically means that a threshold function is used for the output function
that the signal is only passed through the neuron if its input value exceeds 0.5.
7
Chapter 2
It is often convenient to use a matrix W for expressing all weights in the system, as the figure
below shows.
w w1n
11 w12 ...
W = w21 w22 ... w2 n
... ... ... ...
wN 1 wN 2 ... wNn
Figure 2.4 - Illustration of network weights and the accompanying weight matrix [after Hecht-
Nielsen, 1990].
Sometimes a more complex pattern of connectivity is required. A given neuron may receive inputs of
different kinds whose effects are separately summated. In such cases it is convenient to have
separate connectivity matrices for each kind of connection.
Connections between neurons are often classified by their direction in the network architecture:
- Feedforward connections are connections between neurons in consecutive layers. They
are directed from input to output.
- Lateral connections are connections between neurons in the same layer.
- Recurrent connections are connections to a neuron in a previous layer. They are directed
from output to input.
8
Artificial Neural Networks
9
Chapter 2
where is a parameter that defines the wideness of the Gauss curve, as illustrated below.
10
Artificial Neural Networks
2.2.8 Learning
Based on sample data that is presented to it during a training stage, an ANN will attempt to learn the
relations that are contained within the sample data by adjusting its internal parameters (i.e. the
weights of the connections in the network and the neuron biases). This means that the relations that
need to be approximated are parameterised in the ANN structure.
The way a network is trained is a basic property of an ANN; the values of several neuron properties
and the manner in which the neurons of an ANN are structured are closely related to the chosen
algorithm. The algorithm that is used to optimise these weights and biases is called training algorithm
or learning algorithm.
Training algorithms can be classified broadly into those comprising supervised learning and
unsupervised learning.
- Supervised learning works by presenting the ANN with input data and the desired correct
output results. This is done by an external teacher, hence the name of this method.
The network generates an estimate, based on the given input, and then compares its
output with the desired results. This information is used to help guide the ANN to a good
solution. Some learning methods do not present the actual desired value of the output to
the network, but rather give an indication of the correctness of the estimate. [after Dhar
and Stein, 1997]
N.B.
These learning methods have a clear relation with the process of calibration, which is used
in many conventional modelling techniques. This becomes clear when comparing the
above with what Rientjes and Boekelman [2001], for example, state: a procedure of
adjusting model parameter values is necessary to match model output with measured
data for the selected period and situation entered to the model. This process of
(re)adjustment and (re)calculating is termed calibration and deals about finding the most
optimal set of model parameters.
- ANNs being trained using an unsupervised learning paradigm are only presented with the
input data but not the desired results. The network clusters the training records based on
similarities that it abstracts from the input data. The network is not being supervised with
respect to what it is supposed to find and it is up to the network to discover possible
relationships from the input data and based on this make certain predictions of an output.
[after Dhar and Stein, 1997]
11
Chapter 2
Supervised and unsupervised learning can be further divided into different classes, as shown in Table
2.1 and Table 2.2. Performance learning techniques is the best known category of supervised
learning, as competitive learning is of unsupervised learning.
Supervised learning
Performance learning Coincidence learning
Backpropagation Hebbian learning
Methods based on statistical optimisation
algorithms:
o Conjugate gradient algorithms
o (Quasi-) Newtons algorithm
o (Reduced) Levenberg-Marquardt algorithm
Cascade-Correlation algorithm
Unsupervised learning
Competitive learning Filter learning
Kohonen learning Grossberg learning
Adaptive Resonance Theory (ART)
Only performance learning algorithms will be discussed in the following section since these are the
only algorithms used throughout this investigation.
4
The use of biases is not very common. Training of an ANN often only comes down to updating the network
weights. From this point on, the author will ignore biases in the discussion about the training process.
5
The name performance function is somewhat deceptive since it basically is a function that expresses the
value of the residual errors of the ANN. Since the function is minimized during ANN training the term error
function is preferable.
12
Artificial Neural Networks
Equation (1.9) is based on the error expression called Mean Square Error (MSE). The MSE error
measurement scheme is often used, because it has certain advantages. Firstly, it ensures that large
errors receive much greater attention than small errors, which is usually what is desired. Secondly, the
MSE takes into account the frequency of occurrence of particular inputs. The MSE is best used if
errors are near normally distributed. Other residual error measures can be more appropriate if, for
instance, evaluating errors that are not normally distributed or when examining specific aspects of a
process that require a different error measure. Examples of alternative error measures are the mean
absolute error (e.g. used if approximating the mean of a certain process is somewhat more important
than approximating the process in its complete range, i.e. including minima and maxima) and variants
of the MSE, such as the Rooted Mean Squared Error (RMSE). Consult 3.8.1 for the equations of these
errors.
Because y is a function of the weights in W the error function ( E ) also becomes a function of W
of the network being evaluated. For each combination of weights a different residual error arises.
These errors can be visualized by plotting them in an extra dimension in addition to the dimensions of
the weight space of the network. For example: assume a network with two weights, w1 and w2 . The
two-dimensional weight space can be expanded with a third dimension in which the residual error E
for each combination of the weights w1 and w2 is expressed. The result can be plotted as a three-
dimensional surface (as is done in Figure 2.12). The points on this error surface are specified by three
coordinates: the value of w1 , the value of w2 and the value of the error E for this combination of w1
and w2 .
The goal for learning algorithms is to find the lowest point on this surface, meaning the weight
vector where the residual error is minimal. We can visualize the effect of a good algorithm as a ball
rolling towards a minimum on the surface (see Figure 2.12).
Note that the shape of the error surface depends on the error function used.
13
Chapter 2
Figure 2.12 - Example of an error surface above a two-dimensional weight space. A good
training algorithm can be thought of as a ball rolling towards a minimum. [after Dhar
and Stein, 1997]
The starting point, from which a training algorithm tries to find a minimum, is determined by the initial
values of the weights in the network at the start of the training. These weights are often set at small
random values (see 3.7.1).
Performance learning algorithms can update the ANN weights right after processing each training
sample. Another possibility is updating the network weights only after processing the entire training
data set and making the accompanying calculations. This update is commonly formed as an average
of the corrections for each individual training sample. This method is called batch training or batch
updating. Past applications have proven this method to be more suitable if a more sophisticated
algorithm is used.
If batch learning is used, the error function that has to be minimized has the form
p n
E = ( tqh yqh )
2
(1.10)
q =1 h =1
where n is the number of output neurons and p the number of training patterns.
Batch updating introduces a filtering effect to the training of an ANN, which in some cases can be
beneficial. This approach, however, requires more memory and adds extra computational complexity.
In general, the performance of a batch-updating algorithm is very case-dependant. A good
compromise between step-by-step updating and batch updating is to accumulate the changes over
several, but not all, training pairs before the weights are updated.
N.B.
All learning algorithms attempt to find the optimal set of internal network parameters, i.e. the global
minimum of the error function. However, there may be more than one global minima of this function,
so that more than one parameter set exist that approximate the training data optimally. Besides global
minima, error functions often feature multiple local minima. It is important for an ANN researcher to
14
Artificial Neural Networks
realize that it is very difficult to tell with certainty whether a trained network has reached a local
minimum or a global minimum.
The following sections provide more details about various performance learning algorithms. The step-
by-step descriptions of these algorithms can be found in Appendix B.
Standard backpropagation
The best-known algorithm for training ANNs is the backpropagation algorithm. It essentially searches
for minima on the error surface by applying a steepest-descent gradient technique. The algorithm is
linearly convergent. The backpropagation architecture described here and in the accompanying
appendices is the basic, classical version, but many variants of this basic form exist.
Basically, each input pattern of the training data set is passed through a feedforward network from
the input units to the output layer. The network output is compared with the desired target output,
and an error is computed based on an error function. This error is propagated backward through the
network to each neuron, and correspondingly the connection weights are adjusted.
Backpropagation is a first-order method based on the steepest gradient descent, with the direction
vector being set equal to the negative of the gradient vector. Consequently, the solution often follows
a zigzag path while trying to reach a minimum error position, which may slow down the training
process. It is also possible for the training process to be trapped in a local minimum. [after
Govindaraju, 2000]
See Appendix A for the derivation of the backpropagation algorithm and Appendix B for a step-by-step
description of the backpropagation algorithm.
N.B.
One parameter used with (backpropagation) learning deserves special attention: the so-called learning
rate. The learning rate can be altered to increase the chance of avoiding the training process being
trapped in local minima instead of global minima. Many learning paradigms make use of a learning
rate factor. If a learning rate is set too high, the learning rule can jump over an optimal solution, but
too small a learning factor can result in a learning procedure that evolves too gradual. The learning
rate is an interesting parameter for ANN training. Some learning methods use a variable learning rate
in order to improve their performance.
Appendix B provides more mathematical detail about the learning rate. The parameter can be found
in several other weight updating formulas besides the backpropagation algorithm.
The conjugate gradient method, unlike standard backpropagation, does not proceed along the
direction of the error gradient, but in a direction orthogonal to the one in the previous step. This
prevents future steps from influencing the minimization achieved during the current step. It is proven
that any minimization method developed by the conjugate gradient algorithm is quadratically
convergent.
(Quasi-)Newton algorithms
According to Newtons method, the set of optimal weights that minimizes the error function can be
found by applying:
w ( k + 1) = w (k ) H k 1 g k (1.11)
15
Chapter 2
where H k is the Hessian matrix (second derivatives) of the performance index at the current values
of the weights and biases:
2 E (w ) 2 E (w ) 2 E (w )
...
w1
2
w1 w2 w1 wN
2 E (w ) 2 E (w ) 2 E (w )
...
H k = 2 E (w ) = w2 w1 w2 2 w2 wN (1.12)
w =w (k )
... ... ... ...
E (w )
2
E (w )
2
2 E (w )
...
wN w1 wN w2 wN 2 w = w ( k )
Newtons method can (theoretically) converge faster than conjugate gradient methods. Unfortunately,
the complex nature of the Hessian matrix can make it resource-intensive to compute.
Quasi-Newton methods offer a solution to this problem with less computational requirements: they
update an approximate Hessian matrix at each iteration of the algorithm, thereby speeding up
computations during the learning process. [after Govindaraju, 2000]
Levenberg-Marquardt algorithm
Like other quasi-Newton methods, the Levenberg-Marquardt algorithm was designed to approach
second-order training speed without having to compute the Hessian matrix. If the performance
function has the form of a sum of squares, then the Hessian matrix can be approximated as
H = JT J (1.14)
and the gradient can be computed as
g = JT e (1.15)
e1 e1 e1
w ...
w2 wN
1
e2 e2 e2
...
J = w1 w2 wN (1.16)
... ... ... ...
eP eP
...
eP
w1 w2 wN
The Jacobian matrix contains first derivatives of the network errors with respect to the weights and
biases. The Jacobian matrix is less complex to solve than the Hessian matrix.
16
Artificial Neural Networks
One problem with this method is that it requires the inversion of matrix H = J T J , which may be ill-
conditioned or even singular. This problem can be easily resolved by the following modification:
H = JT J + I (1.17)
This method represents a transition between the steepest descent method and Newtons method. It
makes an attempt at combining the strong points of both methods (fast initial convergence and
fast/accurate convergence near an error minimum, respectively) into one algorithm.
Quickprop algorithm
The Quickprop algorithm, developed by Fahlman [1988], is a well-known modification of
backpropagation. It is a second-order method based on Newtons method. The weight update
procedure depends on two approximations: first, that small changes in one weight have relatively little
effect on the error gradient observed at other weights; second, that the error function with respect to
each weight is locally quadratic. Quickprop tries to jump to the minimum point of the quadratic
function (parabola). This new point will probably not be the precise minimum, but as a single step in
an iterative process the algorithm seems to work very well, according to Fahlman and Lebiere [1991].
Cascade-Correlation algorithm
Fahlman and Lebiere developed the Cascade-Correlation algorithm in 1990. The Cascade-Correlation
algorithm is a so-called meta-algorithm or constructive algorithm. The algorithm not only trains the
network by minimizing the network error by adjusting internal parameters (much like any other
training algorithm) but it also attempts to find an optimal network architecture by adding neurons to
the network.
A training cycle is divided into two phases. First, the output neurons are trained to minimize the
total output error. Then a new neuron (a so-called candidate neuron) is inserted and connected to
every output neuron and all neurons in the preceding layer (in effect, adding a new layer to the
network). The candidate neuron is trained to correlate with the output error. The addition of new
candidate neurons is continued until maximum correlation between the hidden neurons and error is
attained.
Instead of training the network to maximize the correlation between the output of the neurons and
the output error, one can also choose to train to minimize the output error of the ANN. This variant of
Cascade Correlation is mostly used in function approximation applications.
17
Chapter 2
x Mapping ANN y
f
x n1 y m1
Figure 2.13 - General structure for function mapping ANNs [after Ham and Kostanic,
2001].
The problem addressed by ANNs with mapping capabilities is the approximate implementation of a
bounded mapping or function f : A R n R m , from a bounded subset A of n -dimensional
Euclidean space to a bounded subset f [ A] of m -dimensional Euclidean space, by means of training
on examples ( x1 , t1 ) , ( x 2 , t 2 ) , ... ( xk , t k ) of the mappings action, where t k = f ( x k ) [after Hecht-
Nielsen, 1990]. Mapping networks can also handle the case where noise is added to the examples of
the function being approximated.
The approximation accuracy of a mapping ANN is measured by comparing its output ( y ) for a
certain input signal ( x ) with the target values ( t ) from the data set.
Hecht-Nielsen [1990] states that the manner in which mapping networks approximate functions can
be thought of as a generalization of statistical regression analysis. A simple linear regression model,
for example, is based on an estimated linear functional form, from which variations occur by different
slope and intercept parameters, which are determined using the construction data set. The biased
function form and variations thereof in an ANN model are less well defined:
Regression analysis techniques require the researcher to choose the form of a function to be
fitted to data, while ANN techniques do not.
ANNs have many more free internal parameters (each trainable weight) than corresponding
statistical models (as a result, they are tolerant of redundancy).
What is important to realize is that in both cases the form of the function f will not be revealed
explicitly. The function form is implicitly represented in the slope and intercept parameters in the case
of linear regression analysis and in the networks internal parameters in the case of ANNs.
There are several types of ANNs that can be designated as mapping networks. The author, however,
will follow the strict definition of mapping networks presented above. This results in an exclusion, for
example, of the so-called linear associator networks (which can be seen as simplified mapping
networks) and the so-called self-organizing maps (which can be seen as unsupervised learning
variants of standard mapping networks).
The following two subsections will focus only on the most commonly used function mapping ANNs:
standard feedforward networks, radial basis function networks and temporal networks6.
6
Other ANNs that exhibit mapping capabilities exist (e.g. the counterpropagation network [Hecht-Nielsen,
1990]), but have been disregarded here because they are seldom used.
18
Artificial Neural Networks
learning algorithm. All other ANN architecture parameters (number of neurons in each layer, activation
function, use of a neuron bias, et cetera) may vary.
Multilayer perceptrons
Feedforward networks with one or more hidden layers are often addressed in literature as multilayer
perceptrons (MLPs). This name suggests that these networks consist of perceptrons (named after the
Perceptron neurocomputer developed in the 1950s, discussed in 2.1.3).
The classic perceptron is a neuron that is able to separate two classes based on certain attributes of
the neuron input. Combining more than one perceptron results in a network that is able to make more
complex classifications. This ability to classify is partially based on the use of a hard limiter activation
function (see 2.2.7). The activation function of neurons in feedforward networks, however, is not
limited to just hard limiter functions; sigmoid or linear functions (see 2.2.7) are often used too. And
there are often other differences between perceptrons and other types of neurons. From this we can
conclude that the name MLP for multilayer feedforward networks consisting of regular neurons (not
perceptrons, which are neurons with specific properties) is therefore basically incorrect.
To avoid misunderstandings, the author will not use the term MLP for a standard feedforward
networks with one or more hidden layers (unless of course their neurons do function like the classic
form of the perceptron).
Backpropagation networks
Feedforward networks are sometimes referred to with a name that is derived from the employed
training algorithm. The most common learning rule is the backpropagation algorithm. An ANN that
uses this learning algorithm is consequently referred to as a backpropagation network (BPN).
One must bear in mind, however, that different types of ANNs (other than feedforward networks)
can also be trained using the backpropagation algorithm. These networks should never be referred to
as backpropagation networks, for the sake of clarity. It is for the same reason, that the author will not
use a term such as backpropagation network in this report, but will refer to such an ANN by its
proper name: backpropagation-trained feedforward network.
The hidden layer consists of a number of neurons and internal parameter vectors called centres,
which can be considered the weight vectors of the hidden neurons. A neuron (and thus a centre) is
added to the network for each training sample presented to the network.
The input for each neuron in this layer is equal to the Euclidean distance between an input vector
and its weight vector (centre), multiplied by the neuron bias. The transfer function of the radial basis
neurons typically has a Gaussian shape (see 2.2.7). This means that if the vector distance between
input and centre decreases, the neurons output increases (with a maximum of 1). In contrast, radial
basis neurons with weight vectors that are quite different from the input vector have outputs near
zero. These small outputs only have a negligible effect on the linear output neurons.
If a neuron has an output of 1 the weight values between the hidden and output layer are passed
to the linear output neurons. In fact, if only one radial basis neuron had an output of 1, and all others
had outputs of 0's (or very close to 0), the output of the linear output layer would be the weights
between the active neuron and the output layer. This would, however, be an extreme case. Typically
several neurons are always firing, to varying degrees.
Summarising, a RBF network determines the likeness between an input vector and the networks
centres. It consequently produces an output based on a combination of activated neurons (i.e. centres
that show a likeness) and the weights between these hidden neurons and the output layer.
The primary difference between the RBF network and backpropagation lies in the nature of the
nonlinearities associated with hidden neurons. The nonlinearity in backpropagation is implemented by
a fixed function such as a sigmoid. The RBF method, on the other hand, bases its nonlinearities on the
19
Chapter 2
data in the training set [after Govindaraju, 2000]. The original RBF method requires that there be as
many RBF centres (neurons) as training data points, which is rarely practical, since the number of
data points is usually very large [after Chen et al., 1991]. A solution to this problem is to monitor the
total network error while presenting training data (adding neurons), and to stop this procedure when
the error does no longer significantly decrease.
RBF networks are generally capable of reaching the same performance as feedforward networks
while learning faster. On the downside, more data is required to reach the same accuracy as
feedforward networks. According to Chen, Cowan and Grant [1991], RBF network performance
critically depends on the centres that result from the inputted training data. In practice, these training
data are often chosen to be a subset of the total data, which suitably samples the input domain.
Temporal ANNs
Figure 2.14 - A classification of ANN models with respect to time integration [modified after Chappelier
and Grumbach, 1994]. The pages that are referred to are the pages on which these temporal ANN
examples are discussed.
With respect to the integration of the time dimension into ANN models, the first option is not to
introduce it at all but to leave time outside the ANN model (which is consequently named a static
network). Models that incorporate this method are called tapped delay line models. This method
comes down to inputting a window of input series to a network, i.e. P ( t ) , P ( t 1) , ... , P ( t m ) .
P ( t ) represents one of the inputs at time t and m the memory length. The total of input neurons
increases with the length of the memory used. Presenting an ANN with a tapped delay line basically
means that the temporal pattern is converted to a spatial pattern, which can then be learned by a
static network.
This method can also be combined with one of the dynamic network types that are discussed
below. This is typically the case if predicting multiple time steps ahead, which is discussed from page
23 on.
The introduction of the time dimension in a neural model by incorporating it in the ANN architecture
(which means the ANN becomes a dynamic network) can be made at several levels. First of all, time
can be used as an index of network states. The preceding state of neurons is preserved and
20
Artificial Neural Networks
reintroduced at the following step at any point in the network. Order is the only property of time used
when working with these sequences. Chappelier and Grumbach [1994] call this an implicit
presentation of time into the models. This method basically means that the neurons of a layer within
an ANN can be connected to neurons of the preceding layer, the succeeding layer and the layer itself.
These types of models are referred to as context models or partially recurrent models.
Note that the weight updating for a context model is not local, in the sense that updating of a
single weight requires the manipulation of the entire weight matrix, which in turn increases the
computational effort and time.
A step further in the introduction of the time dimension in an ANN is to represent it explicitly at the
level of the network, i.e. by introducing some delays of propagation (time weights) on the connections
and/or by introducing memories at the level of the neuron itself. These models are referred to as fully
recurrent models. Algorithms to train these dynamic models are significantly more complex in terms of
time and storage requirements.
In the case of time implementation at the network level, ANNs use the combination of an array to
represent the connection strength between two neurons of consecutive layers (instead of a single
weight value), and internal delays. Elements of the array are the weights for present and previous
inputs to the neuron. Such an array is called a Finite Impulse Response (FIR).
What is finally mentioned in the classification diagram above, is time at the neuron level. This
method requires a continuous approach, which will not be discussed here.
Because of the recurrent connections in dynamic networks, variations of the regular training
algorithms must be used when training a dynamic network. Two well-known examples of dynamic
learning algorithms are the Backpropagation Through Time (BPTT) algorithm [Rumelhart et al., 1986]
and the Real-Time Recurrent Learning (RTRL) algorithm [Williams and Zipser, 1989].
Within the structure of the neuron the past values of the input are established by the way of
the time delays shown in Figure 2.15 (for p < m ). The total number of weights required for
the single neuron is ( p + 1) n .
21
Chapter 2
Figure 2.15 - Basic TDNN neuron with n connections from input units and p delays on
each input signal (k is the discrete-time index) [after Ham and Kostanic, 2001].
The single-neuron model can be extended to a multilayer structure. The typical structure of
the TDNN is a layered architecture with only delays at the input of the network, but it is
possible to incorporate delays between the layers.
ANNs using FIRs can be seen as closely related to static ANNs using a time window (TDNNs),
since a FIR is basically a window-of-time input to a neuron. The difference is that DTLFNNs
provide a more general model for time representation because FIRs are distributed through
the entire network.
22
Artificial Neural Networks
Figure 2.17 - The SRN neural architecture (where z-1 is a unit time delay)
[after Ham and Kostanic, 2001]
The context units in Figure 2.17 replicate the hidden-layer output signals at the previous time
step, that is x ' ( k ) . The purpose of these context units is to deal with input pattern
dissonance. The feedback provided by these units basically establishes a context for the
current input x ( k ) . This can provide a mechanism within the network to discriminate
between patterns occurring at different times that are essentially identical.
The weights of the context units remain fixed. The other network weights, however, can be
adjusted using the backpropagation algorithm with momentum (see Appendix B for details).
23
Chapter 2
forecasts (Figure 2.18). This method has proven useful for local modelling approaches,
discussed in 3.2.3, but if a global modelling approach is taken this method can be plagued
by the accumulation of errors [after Bon and Crucianu, 2002].
Figure 2.18 - The recursive multi-step method. New estimated outputs are shifted
through the input vector and old inputs are discarded. All neural networks are identical.
[after Duhoux et al., 2002]
2. Chaining ANNs;
One can also chain several ANNs to make a multi-step ahead prediction (Figure 2.19). For
a time horizon of p, a first network learns to predict at t+1, then a second network is
trained to predict at t+2 by using the prediction provided by the first network as a
supplementary input. This procedure is repeated until the desired time horizon p is
reached. [after Bon and Crucianu, 2002]
Figure 2.19 - Chains of ANNs: beginning with a classical one-step ahead predictor, the
outputs are inserted in a next one-step ahead predictor, by adding the one-step ahead
prediction to the input vector of the subsequent predictor. [after Duhoux et al., 2002]
24
Artificial Neural Networks
output layer, each of which represents one time step to be forecasted (Figure 2.20). There
can be as many as p output neurons. Training is done by using an algorithm that punishes
the predictor for accumulating errors in multi-step ahead prediction (e.g. the
Backpropagation Through Time algorithm).
This method can provide good results, especially if it is assisted by some form of
implementation of time into the network architecture (e.g. recurrent connections or FIRs).
Zealand, Burn and Simonovic [1999] claim that ANNs have the following beneficial model
characteristics:
+ They infer solutions from data without prior knowledge of the regularities in the data; they
extract the regularities empirically. This means that when ANN techniques are used in a
certain field of work, relatively little specific knowledge of that field is demanded for the
development of that model because of the empirical nature of ANNs. This demand is certainly
higher when developing models using conventional modelling techniques.
+ These networks learn the similarities among patterns directly from examples of them. ANNs
can modify their behaviour in response to the environment (i.e. shown a set of inputs with
corresponding desired outputs, they self-adjust to produce consistent responses).
+ ANNs can generalize from previous examples to new ones. Generalization is useful because
real-world data are noisy, distorted, and often incomplete.
+ ANNs are also very good at the abstraction of essential characteristics from inputs containing
irrelevant data.
+ They are non-linear, that is, they can solve some complex problems more accurately than
linear techniques do.
25
Chapter 2
+ Because ANNs contain many identical, independent operations that can be executed
simultaneously, they are often quite fast.
As mentioned earlier, ANNs belong to the family of parallel distributed processing systems,
which are known to be faster than conventional models. This is of course dependant on the
efficiency of the ANN.
ANNs have several drawbacks for some applications too [modified after Zealand, Burn and Simonovic,
1999]:
- ANNs may fail to produce a satisfactory solution, perhaps because there is no learnable
function or because the data set is insufficient in size or quality.
- The optimal training data set, the optimum network architecture, and other ANN design
parameters cannot be known beforehand. A good ANN model generally has to be found using
a trial-and-error process.
- ANNs are not very good extrapolators. Deterioration of network performance when predicting
values that are outside the range of the training data is generally inevitable. Pre-processing
data (discussed in 3.5.2) can help reducing this performance drop.
- ANNs cannot cope with major changes in the system because they are trained (calibrated) on
a historical data set and it is assumed that the relationship learned will be applicable in the
future. If there were major changes in the system, the neural network would have to be
adjusted to the new process.
- It is impossible to tell beforehand which internal network parameter set (i.e. collection of
network weights) is the optimal set for a problem. Training algorithms often do a good job of
finding a parameter set that performs well, but this is not always the case, e.g. when coping
with a very complex error surface for a problem. In addition to this problem, it is also very
difficult to tell whether a training algorithm has found a local or a global minimum.
Another problem is that for different periods in time or for different dominating processes
described in the training set, there will likely be sets of parameters that give a good fit to the
test data for each one of these situations and other sets giving good fits by a mixture of all
the periods or processes [Beven, 2001]. The different optima may then be in very different
parts of the parameter space, making matters complicated for choosing the optimal ANN.
The lack of explainablility of ANN model results is one of the primary reasons for the sceptical
attitude towards application of ANN techniques in certain fields. The lack of physical concepts and
relations is a reason for many scientists to look at ANNs with Argus eyes. For ANNs to gain wider
acceptability, it is increasingly important that they have some explanation capability after training has
been completed. Most ANN applications have been unable to explain in a comprehensive meaningful
way the basic process by which ANNs arrive at a decision. [Govindaraju, 2000]
26
Artificial Neural Networks
Table 2.3 - Review of ANN performance on various aspects [modified after Dhar & Stein, 1997].
2.4.2 Overtraining
An often encountered problem when applying ANN techniques is called overtraining. Overtraining
effects typically result from a combination of three (often complementary) causes:
1. Using an ANN architecture, which is too complex for the relations that are to be modelled;
2. Overly repetitive training of an ANN;
3. Training an ANN using an inappropriate training data set.
A large (and therefore complex) ANN architecture opposed to relatively simple information, to which
the ANN model adapts, is an example of a poor ratio between the number of model parameters and
the complexity of data information content. As a result, the chance of overparameterisation occurring
will increase.
Points 2 and 3 are reasons for overtraining, because too much similar information is presented to an
ANN. The ANN model adapts its internal parameters to this information, resulting in a rigid model that
succeeds in approximating the relations presented in the training data, but fails to approximate the
relations in other data sets with slightly different data values.
Basically, the network adjusts its internal parameters based on not only the essential relations
associated with the empirical data, but also unwanted effects in the data. This can result in a model
27
Chapter 2
with poor predictive capability. These unwanted effects could be associated with either measurement
noise or any other features of the data associated with additional relations or phenomena that are not
of any interest when designing a model. [Ham and Kostanic, 2001]
Because the network picks up and starts to model little clues that are particular to specific
input/output patterns in the training data, the network error decreases and the performance improves
during the training stage. In essence, the network comes up with an approximation that exactly fits
the training data, even the noise in it [Dhar and Stein, 1997]. As a result of overtraining, the
generalisation capability of the network decreases.
Figure 2.21 shows an example of an overtrained ANN. If the goal of the network would be to
approximate the training data (i.e. approximate the crosses in the figure), the ANN model would be
performing outstandingly. However, the goal of the ANN is not just to approximate the training data,
but to mimic the underlying process. The crosses in the figure represent measurements of a stochastic
time-dependant output variable that is to be estimated. Since the training data are but a finite length
sample of this stochastic variables data set (which is theoretically infinite), the crosses present only
one realisation of the stochastic variable that is to be estimated.
In this case, the process output, which can be described by a time-dependant stochastic output
variable, is assumed to be known. This output is a result of certain values of input variables and
describes an evolution of a process in time (i.e. it is a time series); in this case it is a sine function.
This implies that if the means of an infinite number of realizations of the stochastic output variable,
given the same values of the input variables, would be plotted, it will look like a time series with a
periodic mean, namely the dashed line in Figure 2.21. This is the line that actually has to be
approximated by the ANN model. The only clues the model gets for completing this task are the
training data (the values of the input variables and the accompanying crosses in the figure below).
Since the ANN model generally has no information on other realisations of the input and output
variables, the result is a rigid model that only responds adequately to values that are very similar to
training data values.
Figure 2.21 - An overtrained network tends to follow the training examples it has been
presented and therefore loses its ability to generalize (approximate the sine function).
[after Demuth and Beale, 1998]
28
Artificial Neural Networks
A potential solution for the overtraining problem is to keep a second set of data (labelled training test
data or cross-training data) separate and used for periodically checking the network approximation of
this set versus the approximation of the training set. The best point for stopping the training is when
the network does a fairly good job on both data sets (pointed out in Figure 2.22).
The reason why this method will result in a model with a better performance is that instead of
relying on only one realisation of the stochastic output variable (just the training data), the model can
now adapt to two realisations. If the ANN model does a good job on both data sets, this means that
the model approximates the mean of those two realisations. Therefore, the model approximation is
theoretically closer to the true mean of the stochastic output variable (the sine function) than when
approximating using one realisation.
Making use of a second or third cross training data set would (theoretically) even further improve
an ANN models generalization capacity. This approach, however, is often discouraged because of its
large data demand.
Figure 2.22 - Choosing the appropriate number of training cycles [after Hecht-
Nielsen, 1990]
Another possible way of preventing overtraining is called regularization. This method involves
modifying the error function of performance learning algorithms. For example, if the MSE is used as
error function, generalization can be improved by adding a term that consists of the mean of the sum
of squares of the network weights and biases:
MSEREG = MSE + (1 ) MSW (1.19)
where
1 n 2
MSW = wj
n j =1
(1.20)
Using this performance function will cause the network to have smaller weights and biases, and this
will force the network response to be smoother and less likely to overtrain. [after Demuth and Beale,
1998]
One final important remark can be made about this discussion on overtraining: the output of the
process (i.e. the ideal time series) will, in practice, often be unknown. It is therefore impossible to
conclude overtraining from an excessively accurate approximation of the training data alone.
Assuming that an ANN model shows good training results, but fails to achieve high accuracy on
other data sets, how can an ANN model developer know whether his/her model is overtrained, or the
model is just plain wrong? Unfortunately, this question cannot be answered with certainty because of
the low transparency of ANN model behaviour.
Nevertheless, as the theory on cross-training proves, this drawback does not devaluate the
significance of keeping a separate training test set. Even if overtraining is not expected, applying
cross-training is a wise choice, for it will reduce the risk of it occurring.
29
Chapter 2
2.4.3 Underfitting
Underfitting is another effect, closely related to overtraining, that occurs as a result of improperly
training an ANN. If network training is stopped before the error on the training data and the cross-
training data is minimal (e.g. before the stopping point that is depicted in Figure 2.22), the network
does not optimally approximate the relations in this data. A common cause of underfitting is that a
modeller stops the training too early, for instance by setting a maximum number of training epochs
that is too low, or a training error goal that is too high. Also, a short data set should be used several
times in the training phase so that an ANN has enough epochs to learn the relations in the data.
Practically speaking there is a minor underfitting effect on most if not all trained ANNs. The
reason for this is that a learning algorithm is often unable to reach the global minimum of a complex
error function. And even if this global minimum is reached, it probably does not have that same
coordinates (i.e. weight values) as the error function over the training and the cross-training data, let
alone over the training, the cross-training and the validation data.
30
ANN Design for Rainfall-Runoff Modelling
Figure 3.1 - Schematic representation of the hydrological cycle (highlighting the processes on and
under the land surface). The dark blue and light blue areas and lines indicate an increase in
surface water level and groundwater level due to precipitation.
The driving force behind the hydrological cycle (shown in Figure 3.1) is solar radiation. Most water on
earth can be found in seas and oceans. When this water evaporates, it is stored in the atmosphere. As
31
Chapter 3
a result of various circumstances, this water vapour can condensate, form clouds and eventually
become precipitation.
Precipitation can fall directly on ocean or seas, or on rivers that transport it to seas and ocean. A
number of possibilities exist for water that falls on land: water can be intercepted by vegetation and
evaporate, water can flow over the land surface towards a water course (or evaporate before it has
reached it) and water can fall on the land surface and infiltrate in the soil (or evaporate before it has
infiltrated).
Infiltration brings water in the unsaturated zone. Infiltrated water can be absorbed by vegetation,
which brings the water back into the atmosphere through transpiration. When the water content of
this soil reaches a maximum, infiltrated water percolates deeper in the soil where it reaches the
subsurface water table.
The soil beneath the water table is saturated with water, hence its name: saturated zone. Water
from the saturated zone that contributes to catchment runoff is part of groundwater runoff. The
process of groundwater flowing back into water courses is called seepage.
A network of water courses guides the water towards a catchment outlet. In describing the relation
between rainfall and runoff, the runoff response of a catchment due to a rainfall event is often
expressed by an observed hydrograph in the channel system at the catchment outlet that must be
interpreted as an integrated response function of all upstream flow processes. [Rientjes, 2003]
The response to a rainfall event shown in a hydrograph consists of three distinguishable sections
(see Figure 3.2).
1. A rising limb (BC) - discharge by very rapid surface runoff processes
2. A falling limb (CD) - discharge by rapid subsurface processes
3. A recession limb (DE) - discharge by groundwater processes
100
C
80
Discharge in m3/sec
60
40 Storm Flow
D
20
A E
B Base flow
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time in days
32
ANN Design for Rainfall-Runoff Modelling
The surface of a hydrograph can be divided into two parts. Some of the discharge presented in the
hydrograph would have occurred even without the rainfall event. This is represented by the lower
surface in the hydrograph, typically referred to as base flow. Base flow mainly constitutes the total of
delayed flow processes (e.g. groundwater flows). The upper part represents the so-called storm flow
component. This flow consists of all rapid flow processes that contribute to the catchment runoff. The
separating between storm and base flows is artificial, but it is often thought of as depicted by the
dashed line in Figure 3.2.
Figure 3.3 - Schematic representation of cross-sectional hill slope flow [Rientjes and
Boekelman, 2001]
Surface runoff
Surface runoff is that part of the runoff, which travels over the ground surface and
through channels to reach the catchment outlet. [Chow et al., 1988]
Below is a list of the flow processes that make up surface runoff [modified after Rientjes and
Boekelman, 2001]:
Overland flow is flow of water over the land surface by means of a thin water layer (i.e. sheet
flow), or as converged flow into small rills (i.e. rill flow). Most overland flow is in the form of
rill flow, but hydrologists often find it easier to model overland flow as sheet flow. There are
two types of overland flow: Horton overland flow and saturation overland flow.
1. Horton flow is generated by the infiltration excess mechanism (shown in Figure 3.4).
Horton [1933] referred to a maximum limiting rate at which a soil in a given condition
can absorb surface water input (f in the figure Figure 3.4). Under the condition that
rainfall rate (P) exceeds this saturated hydraulic conductivity of the top soil and that
rainfall duration is longer than the ponding time of small depressions at the land
surface, water will flow downslope as an irregular sheet or to converge into rills of
overland flow. This flow is known as Horton overland flow or concentrated overland
flow. The aforementioned depression storage does not contribute to overland flow: it
either evaporates or infiltrates later. The amount of water stored on the hillside in the
33
Chapter 3
process of flowing downslope (the light blue area in the figure below) is called the
surface detention.
Horton overland flow is mostly encountered in areas where rainfall intensities are high
and where the soil hydraulic conductivities are low while the hydraulic resistance to
overland flow is small (e.g. bare slopes or only covered by thin vegetation) [Rientjes,
2003]. Paved urban areas offer the most obvious occurrence of this mechanism.
2. Another form of overland flow, namely saturation overland flow, is caused by the
saturation excess mechanism. This flow is generated as the soil becomes saturated due
to the rise of the water table to the land surface or by the development of a saturated
zone due to the lateral and vertical percolation of infiltration water above an impeding
horizon [Dunne, 1983].
This phenomenon is typically encountered at the bottoms of hillslopes (which are
often areas around streams and channels), especially if the storage capacity is small
due to the presence of a shallow subsurface. This flow process can also occur as a
result of the rise of the water table under perched flow conditions (i.e. a combination of
the processes shown in Figure 3.5 and Figure 3.6).
Figure 3.5 - Saturation overland flow due to the rise of the perennial water
table [after Beven, 2001]
Note the difference between the two overland flow generating mechanisms: in the case of the
infiltration excess mechanism the subsoil becomes saturated by infiltrated water from the land
surface (saturation from above), while in the case of saturation excess mechanism the subsoil
becomes saturated due to a rise of the water table (saturation from below).
Stream flow is defined as the flow of water in streams due to the downward concentration of
rill flow discharges in small streams.
Channel flow occurs when water reaches the natural or artificial catchment drainage system.
Water is transported through main channels, in which runoff contributions from the various
runoff processes are collected and routed.
34
ANN Design for Rainfall-Runoff Modelling
Subsurface runoff
Subsurface runoff is that part of precipitation, which infiltrates the surface soil and
flows laterally through the upper soil horizons towards streams as ephemeral,
shallow, perched groundwater above the main groundwater level. [Chow et al.,
1988]
Below is a list of the flow processes that make up subsurface runoff (also called interflow) [modified
after Rientjes and Boekelman, 2001]:
Unsaturated subsurface flow is generated by infiltration of water in the subsurface. It takes
place in flow conditions that are subject to Darcys law and where flow is thus governed by
hydraulic pressure gradients and soil characteristics. Since the variations of soil moisture
contents in the vertical direction are much larger than in horizontal direction, the direction of
unsaturated subsurface flow is predominantly in the vertical direction.
Runoff contributions due to unsaturated subsurface flow are very small and generally of no
significance for the total catchment runoff.
Perched subsurface flow (Figure 3.6) occurs in perched (saturated) subsurface conditions
where water flows in lateral directions and where water flow is subject to (lateral) hydraulic
head gradients. Perched subsurface flow is generated if the saturated hydraulic conductivity
of a given subsurface layer is significantly lower than the overlaying soil layer. As a result of
the difference in conductivity, the movement of infiltrated water in vertical direction is
obstructed and the infiltrated water is drained laterally in the overlaying, higher permeable
layer.
Runoff contributions due to perched subsurface flow can be significant.
Macro pore flow is characterised as a non-Darcian subsurface flow process in voids, natural
pipes and cracks in the soil structure. Macro pores can be caused by drought, animal life,
rooting of vegetation or by physical and chemical geological processes. Water flow is not
controlled by hydraulic pressure gradients, but occurs at atmospheric pressure. Macro pore
flow that is not discharged as subsurface runoff will recharge the unsaturated zone of the
groundwater system.
Macro pore flow travels through cracks, voids and pipes in the subsoil, and therefore has a
much shorter response time than flow through a continuous soil matrix where Darcian
conditions determine the flow process. Bypassing great parts of the unsaturated soil profile,
this macro pore flow can cause a groundwater system to quickly become recharged after a
rainfall event. The same mechanism can, in addition, contribute to the generation of perched
subsurface flow. [Rientjes, 2003]
Groundwater runoff
Groundwater runoff is that part of the runoff due to deep percolation of the
infiltrated water, which has passed into the ground, has become groundwater, and
has been discharged into the stream. [Chow et al., 1988]
Groundwater runoff is the flow of water in the saturated zone. It is generated by the percolation of
infiltrated water that causes the rise of the water table. Below are the descriptions of the two flow
components, in which groundwater flow can be separated, as presented by Rientjes [2003].
35
Chapter 3
Rapid groundwater flow is that part of groundwater flow that is discharged in the upper part
of the initially unsaturated subsurface domain.
Delayed groundwater flow is discharged groundwater in the lower part of the saturated
subsurface, which was already saturated prior to the rainfall event.
36
ANN Design for Rainfall-Runoff Modelling
N.B.
The variable source area concept (mentioned in Figure 3.7) is illustrated in Figure 3.8. This concept
states that the size and location of the areas that contribute to runoff are variable. The reason for this
is that the mechanisms of runoff generation depend on ground surface properties, geomorphological
position and geology, and spatial variability in these attributes. This results in differences in the runoff
contributed from different locations, or only part of the surface area of a watershed contributing to
runoff. The source area that contributes to runoff can also vary at within storm time scales and at
seasonal time scales. [Tarboton, 2001]
Figure 3.8 - Variable source area concept [after Chow et al., 1988]. The small
arrows in the hydrographs show how the streamflow increases as the variable
source extends into swamps, shallow soils and ephemeral channels. The
process reverses as streamflow declines.
37
Chapter 3
The goal of this section is to explain the classification of the many types of R-R models into physically
based, conceptual and empirical models. 7
If a model represents a system with specified regions of space (i.e. the system is partitioned in spatial
units of equal or non-equal sizes), it is called a distributed model (see figure below). Physically based
models often use two-, but sometimes three-dimensional distributed data.
Because of this data distributed approach, the data demand for these models is typically very large.
For a numerical model of this kind, the model data must include not only the values of the properties
of physiography, geology and/or meteorology at all spatial units in the system, but also the location of
7
This section focuses on continuous stream flow models only. There is also a group of single-event models,
mostly used when simulating extreme rainfall events. These models are often more simplified than continuous
models, because they merely consider the extreme events in a continuous process.
38
ANN Design for Rainfall-Runoff Modelling
the model boundary and the types and values of the mathematical boundary conditions [after Rientjes
and Boekelman, 1998].
This type of models is also referred to as white box models (as opposed to black box models, cf.
3.2.3). A well-known example of a physically based R-R model is the SHE (Systme Hydrologique
Europen), depicted in Figure 3.10.
The approach for taking spatial distribution of variables and parameters in a catchment into account
differs between conceptual models. Some, but not many, of these models use distributed modelling in
the same way physically based models do, others use the lumped method used by empirical models.
A compromise between the two can also be used: semi-distributed modelling divides the catchment
area in spatial units that share one or more important characteristics of the area. For example, the
area of a catchment can be divided into smaller subcatchments, or into areas that have about the
same travel time to the outlet point of a catchment.
Conceptual models are the most frequently used model types in R-R modelling. Another name for
these models is grey box models, because they are a transition between physically based (white box)
39
Chapter 3
and empirical (black box) models. Well-known examples of conceptual R-R modelling are storage
models such as cascade models and time-storage models such as the Sacramento model.
Dibike and Solomatine [2000] declare that physically based models and conceptual models are of
greater importance in the understanding of hydrological processes, but there are many practical
situations where the main concern is with making accurate predictions at specific locations. In such
situations it is preferred to implement a simple black box model to identify a direct mapping between
the inputs and outputs without detailed consideration of the internal structure of the physical process.
On the downside, empirical models have certain drawbacks concerning their applicability. Because the
parameters of a black box model (e.g. the regression coefficients) are based on an analysis by usage
of historical data of a certain catchment, a model becomes catchment dependant. The time period
over which the model remains valid and accurate has to be looked at critically as well. For example, if
changes in climate or catchment (e.g. land use) cause a model to perform poorly, it has to be
recalibrated and validated using data from the new situation.
The spatial distribution of the input variables and parameters in the model area is not taken into
account by empirical models. Therefore, the models are called lumped models and they represent a
system as a whole and therefore treat a model input, e.g. rainfall in the catchment, as a single spatial
input. [after Rientjes and Boekelman, 1998]
A well-known example of empirical R-R modelling is the Multiple Linear Regression (MLR) model.
ANNs are also typical examples of black box models.
A special form of black box R-R models are models that make predictions based merely on analysis of
historical time series of the variable that is to be predicted (e.g. runoff). Only the latest values of this
variable are used for prediction; hoe many values are exactly used depends on the memory length the
model uses. Since time series models are easy to develop, they are often used in preliminary
analyses.
A fundamental difference between these types of black box models is that time series models make
predictions based only on the latest values of the variable and regular black box models base their
predictions on the complete time series. Time series models are therefore labelled local models,
opposed to the global approach of other black box models.
Typical examples of time series models are:
ARMAX (auto-regressive moving average with exogenous inputs),
Box-Jenkins method.
Since ANNs are black box models, they can also serve as time series models (e.g. both model input
and model output are based on catchment output). This investigation will examine the application of
ANN as cause-and-effect models for R-R relations (e.g. the model input relates to catchment input
and model output to catchment output), the application as time series models for discharge, as well
as a combination of the global and local techniques.
40
ANN Design for Rainfall-Runoff Modelling
ANNs are typical examples of empirical models. The ability to extract relations between inputs and
outputs of a process, without the physics being explicitly provided to them, suits the problem of
relating rainfall to runoff well, since it is a highly nonlinear and complex problem. This modelling
approach has many features in common with other modelling approaches in hydrology: the process of
model selection can be considered equivalent to the determination of appropriate network
architecture, and model calibration and validation is analogous to network training, cross training and
testing [Govindaraju, 2000].
ANNs are considered one of the most advanced black box modelling techniques and are therefore
nowadays frequently applied in R-R modelling. It was, however, not until the first half of the 1990s
that the earliest experiments using ANNs in R-R hydrology were carried out [French et al., 1992; Halff
et al., 1993; Hjemfelt and Wang, 1993; Hsu et al., 1993; Smith and Eli, 1995].
Govindaraju [2000] states that a broad classification into two categories of research activities after
ANNs in R-R modelling can be made:
The first category of studies is that where ANNs were trained and tested using existing
models. The goal of these studies is to prove that ANNs are capable of replicating model
behaviour. That same model generates all of the necessary data. These studies may be
viewed as providing a proof of concept analysis for ANNs.
Most ANN-based studies fall into the second category, the ones that have used observed R-R
data. In such instances, comparisons with conceptual or other empirical models have often
been provided.
Most studies report that ANNs have resulted in superior performance as opposed to traditional
empirical techniques. However, some of the previously discussed drawbacks of ANNs (see 2.4), such
as extrapolation problems or problems with defining a training data set are still often encountered.
One issue that especially bothers hydrologists is the limited transparency of ANNs. Most ANN
applications have been unable to explain in a comprehensibly meaningful way the basic process by
which networks arrive at a decision. In other words: an ANN is not at all able to reveal the physics of
the processes it models. This limitation of ANNs is even more obvious in comparison to physically
based R-R modelling approaches.
Although the development effort for ANNs as R-R models is small relative to physically based R-R
models, one must take care not to underestimate the difficulty of building such a model. ANN model
design in the field R-R modelling is subject to many (ANN-specific and hydrology-specific) difficulties,
some of which are discussed in detail in the following sections (3.4 - 3.8).
41
Chapter 3
make better choices regarding the input variables for proper mapping. This will, on the one hand, help
in avoiding loss of information (e.g. if key input variables are omitted), and, on the other hand,
prevent unnecessary inputs from being inputted to the model, which can result in diminishing network
performance.
Numerous applications have proven the usefulness of a trial-and-error procedure in determining
whether an ANN can extract information from a variable. Such an analysis can be used to determine
the relative importance of a variable, so that input variables that do not have a significant effect on
the performance of an ANN can be trimmed from the network input vector, resulting in a more
compact network.
When dealing with the hydrological situation where rainfall is the driving force behind runoff
generation (another possibility is snowmelt), rainfall input seems the most logical variable to present
to an ANN. There are several possible ways of presenting rainfall data, such as:
Rainfall intensity;
The amount of rainfall per time unit is the most common way of expressing rainfall
information.
Variables that are closely related to the effect of rainfall on runoff are:
Evaporation;
Effective rainfall is the rainfall minus the evaporation. Effective rainfall should be a better
indicator of the real-world input of water into the catchment than just rainfall, but evaporation
is often not easily determined; it involves a variety of hydrological processes and the
heterogeneity of rainfall intensities, soil characteristics and antecedent conditions [Beven,
2001]. Evaporation data are a good addition to precipitation data, because the information
content of these variables complement each other, resulting in a more accurate
representation of catchment input than precipitation alone.
Temperature data (see below) are often used instead of evaporation data since temperature
is a good indicator of evaporation and, moreover, because its data availability is much higher
than that of evaporation.
Wind direction.
The direction of the wind is often equal to the direction of the rainfall development. The
shape of the hydrograph can be very dependant on this direction. For instance: a rainstorm
travelling from the catchment outlet to the catchment border opposing this outlet can result in
a relatively flat and long hydrograph. A rainstorm travelling over the catchment in the
opposite direction can result in a short hydrograph with a high peak.
Wind information can, for example, be presented to the model by categorizing wind
directions into classes and assigning values to these classes: 0= wind direction is equal to
governing flow direction of catchment, 1= wind direction is lateral to flow direction and 2=
wind direction is opposite to flow direction.
Instead of rainfall, the origin of runoff water can lie in snowmelt (often especially during spring, when
the temperature rises and accumulated snow will melt). If snowmelt is a significant driving force in a
catchment, the following variables can be inputted to an ANN:
Snow depth;
Cumulative precipitation over the winter period;
Winter temperature index.
42
ANN Design for Rainfall-Runoff Modelling
The winter temperature index represents the mean temperature over the winter period and
therefore gives information about the accumulation of snow during this period.
The amount of water in the upper layers of the catchment soil is a good indicator of the hydrological
state of a catchment (see 3.1). The following variable can therefore be helpful when predicting
runoff:
Groundwater levels;
The groundwater level in the catchment soil indicates the amount of water that is currently
stored in the catchment. This information can be useful for an ANN model in two ways:
1. Determining the effect of a rainfall event;
A rainfall event on a dry catchment (e.g. at the end of the summer) will result in less
discharge than a rainfall event on a catchment with high groundwater levels (e.g. at
the end of the winter).
2. Determining the amount of base flow from a catchment.
As explained in 3.1, the groundwater flow processes determine the base flow from a
catchment. Groundwater values can be indicators of the magnitude of these
groundwater flows.
Another variable that may aid an ANN when relating rainfall to runoff is:
Seasonal information;
Providing an ANN model with seasonal information can help the network in differentiating the
hydrological seasons. The most common way of providing seasonal information is by inputting
it indirectly through a variable which contains this information. Examples of such variables are
temperature and evaporation.
The difficulty in choosing proper input variables, however, lies not only in selecting a set of variables
specific for the situation, but also in selecting variables that complement each other without
overlapping one another. Overlap in information content (i.e. redundancy in input data) results in
complex networks, thereby increasing the possibility of overtraining occurring and decreasing the
chances of training algorithms of finding an optimal weight matrix.
43
Chapter 3
The question whether the training data sufficiently represents the operational phase.
The following two aspects should be considered:
1. The statistics of an ideal data set should be equal to those of the input variables
in the operational phase of the model. It is easy to realise that ANN performance
will decrease when presenting it with data with a different mean that it has been
trained to model. This goes for all measures of location (e.g. mean), spread (e.g.
range, variance) and asymmetry (e.g. coefficient of skewness).
A sufficient range of the training data is especially imperative. ANNs have
proven to be poor extrapolators. Therefore, an ANN R-R model will probably not
be able to accurately predict extreme runoffs in the wet season if it only has been
trained using data from the dry season.
44
ANN Design for Rainfall-Runoff Modelling
made a prediction, the trend can be added to the predicted data, thereby relieving
the ANN model of the task of modelling the trend.
Some trends such as seasonal variations can be accounted for by the non-linear
mapping capabilities of an ANN, under the condition that information about this
trend is presented to the network. The most common way of dealing with
seasonal variation is to present a time series that implicitly contains seasonal
information (e.g. evaporation or temperature).
One way of scaling data is amplitude scaling: the data is scaled so that its minimum and maximum
value will lie between two suitable values (most often between 0 and 1 or between 1 and 1). For
example, the input or output variables can be divided the maximum value present in the pattern,
thereby linearly scaling the data to a range of 0 to 1.
According to Smith [1993], amplitude scaling to a smaller range (e.g. 0.05 to 0.95, 0.1 to 0.9 or 0.2
to 0.8) than from 0 to 1 can be used to avoid the problem of output signal saturation that can
sometimes be encountered in ANN applications. Scaling to a range of 0 to 1 implies the assumption
that the training data contains the full range of possible outcomes for the training data, which is often
not at all true.
This scaling method can be written as:
( X u fact min)
X n = FMIN + ( FMAX FMIN ) (2.2)
fact max fact min
where Xu and Xn represent the variable to be scaled down and its scaled down value respectively,
FMIN and FMAX represent the minimum and maximum of the scaling range and fact min and fact
max are the minimum and maximum value in the X vector.
Applications in hydrology may also benefit from asymmetrical scaling. Since overestimation of
discharge values is by far more likely than underestimation, amplitude scaling to a range of e.g. 0.05
to 0.8 may result in better approximations of hydrographs.
Another common way of amplitude scaling, which is often applied in hydrology, is log-scaling. This
scaling method can be described by the following equation:
X n = ln ( X u ) (2.3)
Other examples of scaling processes are called mean centering and variance scaling.
Assuming that the input patterns are arranged in columns in a matrix A, and that the target vectors
are arranged in columns in a matrix C, the mean centering process involves computing a mean value
for each row of A and C (i.e. there are as many means as there are input and output neurons). The
mean is subsequently subtracted from each element in the particular row for all rows in both A and C.
Variance scaling involves computing the standard deviations for each row in A and C. The associated
45
Chapter 3
standard deviation is then divided into each element in the particular row for all rows in both A and C.
[after Ham and Kostanic, 2001]
Mean centering and variance scaling can be applied together or separately. Mean centering can be
important if the data contains biases and variance scaling if the training data are measured with
different units. For both mean centering and variance scaling, however, the rule is: if A is scaled, then
so should C be.
46
ANN Design for Rainfall-Runoff Modelling
A uniformly or normally distributed randomization function is often used to set the initial weight
values. These random initial weights are commonly small.
An example of a more advanced technique is the Nguyen-Widrow initialization method. This method
generates initial weight and bias values for a layer, so that the active regions of the layer's neurons
will be distributed approximately evenly over the input space. [after Demuth and Beale, 1998]
Graphical methods
The following graphical performance criteria, as proposed by the World Meteorological Organisation
(WMO) in 1975, are suited for the error evaluation procedure of a R-R model:
A linear scale plot of the simulated and observed hydrograph for both the calibration and the
validation periods;
Double mass plots of the simulated and observed flows for the validation period;
A scatter plot of the simulated versus observed flows for the verification period.
The following performance measures are numerical expressions of what can also be concluded from a
visual evaluation of the hydrograph.
47
Chapter 3
Other (non-graphical) performance measures, originating from the field of statistics, are presented
below.
(Q )
K 2
k Q k
MSE = k =1
(2.4)
K
( )
K
Qk Q k
RMSE = k =1
(2.5)
K
Q k Q k
MAE = k =1
(2.6)
K
In the above three equations, k is the dummy time variable for runoff; K is the number of data
elements in the period for which the computations are to be made and Qk and Q k are the observed
and the computed runoffs at the kth time interval respectively.
Se/Sy
This statistics is the ratio between the standard error estimate (Se) to the standard deviation (Sy).
1 K
( )
2
Se = Qk Q k
v k =1
(2.7)
Se is the unbiased standard error of estimate, v is the number of degrees of freedom and is equal to
the number of observations in the training set minus the number of network weights and Qk and Q k
are the observed and predicted values of output, respectively.
The standard deviation (Sy) is calculated using the following equation:
K 2
(Q k Q)
Sy = k =1
(2.8)
K 1
Se represents the unexplained variance and is usually compared with the standard deviation of the
observed values of the dependant variable (Sy). The ratio of Se to Sy , called the noise-to-signal ratio,
indicates the degree to which noise hides the information [after Gupta and Sorooshian, 1985]. Is Se is
significantly smaller than Sy, the model can provide accurate predictions of y. If Se is nearly equal or
larger than Sy, then the model prediction will not be accurate. [Tokar and Johnson, 1999]
48
ANN Design for Rainfall-Runoff Modelling
Fo F F
R2 = =1 (2.9)
Fo Fo
where Fo is the initial variance for discharges about their mean given by
K
Fo = ( Qk Q )
2
(2.10)
k =1
and F is the residual model variance, i.e. the sum of the squares of the differences between the
observed discharges and the model estimates, which is
( )
K 2
F = Qk Q k (2.11)
k =1
In these equations, k is the dummy time variable for runoff; K is the number of data elements in the
period for which the computations are to be made, Qk and Q k are the observed and the computed
runoffs at the kth time interval respectively and Q is the mean value of the runoff for the calibration
period.
Many performance measures that are based on statistical theories have the following drawbacks:
Peak magnitudes may be predicted perfectly, but timing errors in the prediction can cause the
residuals to be large (see Figure 3.11).
The residuals at successive time steps may be autocorrelated in time (the first peak in Figure
3.11). Simple methods using summation of squared errors are based on statistical theories, in
which predictions are considered independent and of constant variance. This is often not the
case when using hydrological models.
49
Chapter 3
Instead of relying blindly on performance measures, a good visual evaluation of the hydrograph is
obviously imperative. On the other hand, complex hydrograph evaluations require a good performance
measure. Because no performance measure is ideal, a set of different measures is often used. Ideally,
the features of the measures chosen should complement each other without overlapping one another.
The measures that are used should provide useful insights into a models behaviour in different
situations (e.g. the RMSE for peak flows, the MAE for low flows, Nash-Sutcliffe for overall
performance). Other measures penalise models that have excessive numbers of parameters (e.g. AIC
and BIC). Using more than one performance measures also allows comparisons with other studies
(there being no universally accepted measure of ANN skill) [after Dawson et al., 2002].
The following questions, mentioned in the chapter introduction on page 31, encapsulate the most
important aspects of ANN R-R modelling:
What information is to be provided to the model and in what form?
What is the ideal ANN type, ANN architecture and training algorithm?
What is the best way to evaluate model performance?
During the literature study on ANN R-R modelling (on which this chapter is mainly based) insights
were acquired mainly from previous examinations by other investigators that can help answer
these questions.
The available data will be closely investigated before applied to an ANN model since important
information about the R-R relationships in a catchment can be gathered from them. Additionally,
errors in the data will have to be fixed and missing data filled in.
Trial-and-error procedures will have to be followed to determine the importance of (combinations
of) the various variables as ANN inputs. These variables can be time series like, for instance,
precipitation and discharge but also new variables that are derived from them, such as the rainfall
index (see 3.4.2 on input variables) or the natural logarithm of the discharge (see 3.5.2 on pre-
processing and post-processing of data).
The choice of ANN type discussed in 3.6.1 is limited to the possibilities of the software in this
investigation (see Chapter 4). The optimal values for ANN design parameters (such as training
algorithm, activation function and number of hidden neurons) will generally have to be found using
trial-and-error procedures.
A meta-algorithm or constructive algorithm will also be tested to examine the capabilities of these
types of algorithms in determining an optimal ANN architecture. The most common algorithm was
50
ANN Design for Rainfall-Runoff Modelling
chosen: Cascade-Correlation (CasCor). See subsection 2.2.8 for a brief description of the meta-
algorithm and Appendix B for a more detailed definition of the algorithm.
The evaluation of model performance will comprise a combination of graphical interpretations and
performance measures. The most important criterion will be simply the visual interpretation of a linear
scale plot of the target values and the model approximations over the validation period. The
performance measures that will be used are the RMSE (a good overall performance indicator that
punishes a model for not approximating peaks) and the R 2 or Nash-Sutcliffe coefficient (a good
overall performance indicator that gives the opportunity of universal model comparison). The fourth
method that will be used is a scatter plot of the simulated values versus the target values.
Some questions, which were raised during the review of ANN R-R modelling, will be further examined
during this investigation:
Are the extrapolation capacities of an ANN model as bad as in other investigations?
Groundwater data can be a possible indicator for slow catchment runoff response (base flow)
and rainfall for fast runoff response (surface runoff). Is an ANN model capable of extracting
these relations from the available data? And do these variables complement each other in
terms of information content about catchment runoff behaviour, or do they introduce a
degree of redundancy if they are both used as model inputs?
Is the amount of available training data sufficient for an ANN model to learn the R-R
relationships in the catchment?
What are the advantages and disadvantages of using an ANN model purely as a global model
or as a time series model (see 3.2 on empirical modelling)? Is there possibly a good
compromise between the two model approximations?
51
Chapter 4
Figure 4.1 - Screenshot of the original CT5960 ANN Tool (version 1).
8
M-files are ASCII text files that contain lines of Matlab programming language. The file extension is .M,
hence their name.
52
Modification of an ANN Design Tool in Matlab
This so-called CT5960 ANN Tool was chosen to serve as a basis for a customized Matlab tool. The
main reason for using a custom tool was that this allowed the author to make use of the Cascade-
Correlation (CasCor) algorithm in Matlab. This algorithm, discussed in 2.2.8 and Appendix B, offers
several advantages and disadvantages over traditional learning algorithms that the author wished to
explore by making comparisons between CasCor networks and ANNs based on traditional learning
techniques.
A custom tool was necessary since the CasCor algorithm is not included in the latest version of the
Neural Network Toolbox (Toolbox version 4.0, Matlab version 6) and is therefore also not included in
the standard ANN design tool (NNTool) offered by the Neural Network Toolbox for Matlab. Embedding
of the CasCor algorithm in this NNTool was also not considered an option since it is not available as
open-source software and therefore cannot be modified.
The original CT5960 ANN Tool (from here on referred to as version 1, as opposed to the new,
modified version: version 2) offers the possibility to construct, train and test static feedforward multi-
layer ANNs.
ANN architecture
Since the hydrological problems for which the tool was designed are not very complex, the number of
hidden layers had been limited to two. The user can choose between one, two or no hidden layers
and can freely choose the number of neurons of which possible hidden layers consists.
Four types of transfer functions for each layer can be chosen: two sigmoid functions, a purely linear
function and a saturating linear function (see 2.2.7).
53
Chapter 4
Figure 4.2 - Screenshot of the new CT5960 ANN Tool (version 2).
54
Modification of an ANN Design Tool in Matlab
Various changes
Other changes in the tool include:
The CT5960 ANN Tool now performs several checks while a user goes through the procedure of
constructing an ANN. This way, the number of general error messages has been reduced. Some
parts of the GUI become disabled whenever a user selection invalidates certain design parameters
or when a certain feature cannot be used at a certain point in the procedure yet. Other times the
user is shown pop-up message boxes that give information about, for example, limitations of the
tool.
The ANN-specific technical nomenclature used in the GUI of the tool has been changed to
correspond with the nomenclature used in this report.
The GUI design has been updated. In spite of additional buttons and pop-up menus, the tools
screen size has been reduced. Version 1 of the tool also needed to be initialized after start-up
(this was done by pressing the Initialize button depicted in Figure 4.1). This initialization
procedure is now automatically run when the tool is started.
Implementation method
The main additional feature offered by version 2 of tool is the possibility to construct a Cascade-
Correlation (CasCor) network. This algorithm is not included in the Neural Network Toolbox. The two
possibilities for implementing this algorithm into the CT5960 ANN Tool (and their advantages and
disadvantages) were:
1. Creating the customized learning algorithm and the accompanying network architecture in
Matlabs Neural Network Toolbox format. According to Demuth and Beale [1998], the object-
oriented representation of ANNs in the Neural Network Toolbox allows various architectures to
be defined and allows various algorithms to be assigned to those architectures.
+ All other algorithm and network types in the CT5960 ANN Tool were implemented in
the Neural Network Toolbox format. This congruence would probably make it less
complex to embed the algorithm in the M-files of version 1 of the tool.
+ The Neural Network Toolbox standard offers several built-in algorithms, functions and
training parameters to be applied to an ANN. By implementing the CasCor algorithm in
55
Chapter 4
Matlabs standard format for ANNs these built-in features can be used freely in
combination with the algorithm and the accompanying network.
- The author found it impossible to determine a priori if the format used by the Neural
Network Toolbox offered enough freedom to implement the CasCor algorithm.
Especially the Toolbox ability to handle algorithms that intervene in the network
architecture. No way was found to invalidate this uncertainty; previous
implementations of the CasCor algorithm in Matlab were not found during the
literature survey nor did the Matlab Help section offer any conclusive information on
this.
2. Programming a separate M-file with a custom implementation of the algorithm and network
architecture.
+ Complete freedom in the implementation of the algorithm (in terms of data structures,
algorithm input and output, training algorithm variations, et cetera). This freedom can
especially be important when examining variations of the standard algorithm and
when having to build additional features into the algorithm.
- Several algorithms, functions and training parameters would have to programmed,
because the built-in Matlab equivalents of these features are not compatible with a
custom implementation of a CasCor ANN. The most complex of these features would
undeniably be the training algorithm with which the CasCor network updates its
weights.
The uncertainty of the Neural Network Toolbox format capabilities were a great drawback in
considering the first method. Moreover, the flexibility offered by programming a custom
implementation seemed very beneficial. This was because future additions and modifications of the
algorithm seemed likely to occur, since the author intended to test several variations of the CasCor
algorithm. As a result of the apparent importance of the disadvantage of the first method and the
advantage of the second, there was an inclination towards the second method.
The final decision was made after the author encountered a free software package (Classification
Toolbox for Matlab) offered by the Faculty of Electrical Engineering of Technion, Israel Institute of
Technology [Stork and Yom-Tov, 2002]. This toolbox contained an M-file, presumably containing an
implementation of the CasCor algorithm that was not based on the Neural Network Toolbox format.
After this discovery the choice was made to program a custom implementation of the CasCor
algorithm in an M-file using the contents if the Classification Toolbox M-file as a framework.
Appendix C contains the original M-file from the Classification Toolbox.
56
Modification of an ANN Design Tool in Matlab
Figure 4.3 - The Cascade Correlation architecture, initial state and after adding
two hidden units. The vertical lines sum all incoming activation. Boxed
connections are frozen, X connections are trained repeatedly. The +1 represents
9
a bias input to the network . [after Fahlman and Lebiere, 1991]
To every one of the network connections a weight is assigned to express the importance of the
connection. The weight matrix of this network structure therefore is as follows:
9
The bias in this CasCor network is different from the traditional bias discussed earlier (see 2.2.1). The bias
in the CasCor network is an input bias (a constant input), as where the traditional bias functions as a threshold
value for the output of a neuron.
57
Chapter 4
The number of rows in the weight matrix is equal to Ni + 1 + Nh (input units + bias + hidden
neurons) and the number of columns to Nh + No (hidden units + output units).
The network structure as programmed in the Classification Toolbox M-file describes a network in
which all neurons are connected to all preceding neurons, but not in the way Fahlman [1991]
described. This inaccurate form of the CasCor algorithm can be depicted as:
There is no connection weight between hidden neurons, with which the connection value is multiplied.
(The weight matrix therefore has a different form than that of the original CasCor algorithm.)
However, there is an operation between the two neurons. This operation, depicted by the blue line, is
a subtraction of the preceding neurons output value. In the case of more than two hidden neurons,
all preceding neurons output values are subtracted. The usefulness of this operation (instead of the
original multiplication with a connection weight) is questionable.
The M-file from the Classification Toolbox was used as a framework for a custom implementation of
the CasCor algorithm. This approach saved time because (despite the flaws of the core of the CasCor
algorithm) the M-file structure could stay largely the same. Various functions, procedures and
variables could be copied directly from this framework version to the customized version. One minor
drawback of the Classification Toolbox implementation of the CasCor algorithm was that it was limited
to only one output neuron (see Figure 4.4). This shortcoming is yet to be resolved.
The diagram below shows what is programmed in the authors version of the CasCor algorithm M-
file in the form of a Program Structure Diagram (PSD).
58
Modification of an ANN Design Tool in Matlab
The function F is a subroutine for calculating the output of the CasCor network:
59
Chapter 4
Figure 4.7 - CasCor network with two input units (Ni=2) and two hidden
neurons (Nh=2).
The improvements over the backpropagation algorithm without variable learning rate and momentum
were minor. It was for this reason that a new training algorithm was embedded in the CasCor
algorithm M-file. The choice of which training algorithm to implement depended on two factors: first,
the performance of the training algorithm; and second, the amount of work required for programming
the algorithm.
The algorithm that was chosen for implementation was the Quickprop algorithm (see 2.2.8 for a
short description of the algorithm and Appendix B for details). This algorithm seemed relatively easy
to implement and is known as a significant improvement over standard backpropagation.
The algorithm that was constructed is a modification of the traditional Quickprop algorithm and is
based on the article in which Fahlman [1988] introduced the algorithm and on a slight modification of
it by Veitch and Holmes [1990].
60
Modification of an ANN Design Tool in Matlab
61
Chapter 4
network output ( Pk and P ), a variable that expresses the loss of generalisation ( GL ) and a variable
k
that expresses the loss of goodness on a data set ( VL ). These variables are defined by:
t 't k +1...t Etrain (t ')
Pk (t ) = 1000 1 (3.2)
k min t 't k +1...t ( Etrain (t '))
(t ) = 10 max (G (t ')) 1 G (t ')
P (3.3)
k t 't k +1...t train k t 't k +1...t
train
Ecross
GL(t ) = 100 1 (3.4)
E
cross ,optimal
62
Modification of an ANN Design Tool in Matlab
These three training algorithms are all non-constructive algorithms. Therefore, an appropriate network
architecture had to be chosen for these algorithm to train on. Based on former experiences and rules
of thumb, the following network architecture was used:
- Two-layer ANNs (one hidden layer);
- Five hidden neurons;
- Hyperbolic tangent activation functions in hidden layer, linear activation function in output
neuron.
All algorithms were trained using their standard training parameters, as defined in Matlab. The CasCor
algorithm was trained using a learning rate of 2.
The data set was split up as follows: 50% training data, 30% cross-training data and 20%
validation data.
In one test the ANNs were used as time-series models. The natural logarithm of the discharge
( ln(Q ) ) is predicted using its three previous time steps. The goal of the other tests was to
approximate the relationship between two correlated variables. The most obvious variables were
chosen: precipitation and discharge. The three last values of the precipitation were used to predict the
discharge at the following time step. Table 4.1 shows the results for the best of 5 runs of each
algorithm.
Table 4.1 - Comparison of CasCor algorithm with three other training algorithms.
GDx CGb L-M CasCor
RMSE 0.512 0.339 0.325 0.329
Time series
R^2 (%) 61.0 83.5 83.4 84.1
RMSE 4819 4864 4825 4872
Correlated variables
R^2 (%) 17.8 18.3 22.9 19.0
These tests seem to indicate that the current implementation of the Cascade-Correlation algorithm is
functioning as it is supposed to. No errors were encountered during these tests and the performance
seems to keep up with other training algorithms.
Chapter 5 will provide more details about the algorithms performance than this short review. A
sensitivity analysis on several algorithm parameters is presented in 5.4.2. Some minor performance-
related modifications of the algorithm are finally discussed in 5.5.
63
Chapter 5
5 Application to Alzette-Pfaffenthal
Catchment
Data from a part of the Alzette catchment in Luxemburg has been utilized for developing and testing
various ANN R-R models. A short description of the catchment is given in 5.1, after which some data
processing aspects are explained in 5.2. Section 5.3 presents a hydrological analysis of the data.
The process of ANN design is elaborated in 5.4. This section concludes with a review of 32 ANN R-R
models. Discussion of these models and some additional tests can be found in the fifth and final
section of this chapter.
The main criterion for model performance is the RMSE. The Nash-Sutcliffe coefficient ( R 2 ) is the
second most important. The graphical interpretation of the linear plot of the targets versus the model
simulations, however, can always overrule these measures. Scatter plots of the targets versus the
simulations are also sometimes presented, but these are unlikely to be a reason for the rejection of a
model.
The results of ANN performance tests that are presented in this chapter are often the best results of a
number of tests. Sometimes these test runs are separately mentioned in a table, but often that which
is presented is the most representative and good-performing ANN that was found after about three to
five test runs.
Several specific abbreviations and notations are used in this chapter to be able to concisely present
test setups and test results. Refer to the Notation section at the end of this report for an explanation
of these notation methods.
64
Application to Alzette-Pfaffenthal Catchment
65
Chapter 5
4.5
4
Groundwater level Fentange
3.5
2.5
1.5
1
2 2.5 3 3.5 4 4.5 5
Groundwater level Dumontshaff
66
Application to Alzette-Pfaffenthal Catchment
4.5
3.5
2.5
2
1 1.5 2 2.5 3 3.5 4 4.5 5
Groundwater level Fentange
The only problem that remained was the occurrence of synchronous gaps in the two data
sets. These hiatuses have been filled by linear interpolation between the last known and
subsequent known value of the groundwater level. The resulting time series are shown in
Figure 5.6 and Figure 5.7.
5
simulated data
original data
4.5
4
Groundwater level Fentange
3.5
2.5
1.5
1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.6 - Groundwater level at location Fentange. The red line is the original time
series, the blue line are simulated values (using the polynomial equation and the linear
interpolation process).
67
Chapter 5
5
simulated data
original data
4.5
Groundwater level Dumontshaff
3.5
2.5
2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.7 - Groundwater level at location Dumontshaff. The red line is the original time
series, the blue line are simulated values.
68
Application to Alzette-Pfaffenthal Catchment
100
350
90
300
80
250 70
60
200
# of occurences
# of occurences
50
150
40
30
100
20
50
10
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11
Q 4
ln(Q)
x 10
Figure 5.8 - Probability function of discharge data. Figure 5.9 - Probability function of the natural
logarithm of discharge data.
Four tests were done to determine whether using the natural logarithm of the discharge as output is
useful. These tests were done with the Levenberg-Marquardt and the CasCor algorithm. The first two
tests use only rainfall data as input, the latter two also use a groundwater time series.
N.B.
The results using lnQ have been post-processed in order to make the performance measures
comparable. Undoing the natural logarithm transformation is realised using the following equation:
output = eoutput (4.2)
L-M CasCor
RMSE 4687 5068 L-M: 4 hidden neurons, tansig
1 CasCor: LR= 2
R^2 (%) 23.8 15.7
RMSE 5294 5389
2 1: P at -2 -1 0, Q at +1
R^2 (%) 0.3 -0.7
RMSE 3550 3788 2: P at -2 -1 0, lnQ at +1
3 3: P and GwF at -2 -1 0, Q at +1
R^2 (%) 59.6 42.6
RMSE 3694 3853 4: P and GwF at -2 -1 0, lnQ at +1
4
R^2 (%) 42.4 36.5
RMSE 3392 3645 L-M: 8 hidden neurons, tansig
5 5: P, ETP and GwF at -4 to 0, Q at +1
R^2 (%) 67.7 46.5
6: P, ETP and GwF at -4 to 0, lnQ at +1
RMSE 3303 3750
6
R^2 (%) 59.5 41.6
69
Chapter 5
4
x 10
4
3.5
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 3392.0077
3 R2: 67.6978
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
70
Application to Alzette-Pfaffenthal Catchment
Using Q instead of lnQ produces much better results if the prediction is based purely on rainfall input.
This is also the case if groundwater data is added as an ANN model input. In the latter case, however,
the test results using lnQ show a relative increase in performance compared to test results where Q is
used.
The reason for this can be found when examining the probability distribution plots of Q, lnQ, P and
GwF. The task of finding relationships between data can be made less difficult for an ANN by using
data that have probability distributions that show similarities. The distributions of P and Q show more
similarities than those of P and lnQ, which is why the results of test 1 are better than those of test 2.
The groundwater time series, however, is more easily related to the lnQ time series than to the Q time
series (cf. tests 2 and 4). The reason for this lies in the fact that there is more similarity between the
distributions of lnQ and GwF than between the distributions of Q and GwF. The same effect is
noticeable when adding ETP as an input. In that case, the model using lnQ as output even
outperforms the one using Q in terms of the RMSE (cf. test 5 and 6). The results of the latter two
tests have been plotted in Figure 5.10 and Figure 5.11. The model using lnQ has a lower RMSE, but
can not be considered a much better model, since peak discharges are not predicted very well. It
does, however, predict low flows better.
Concluding: the more input variables that do not have the same probability distribution as Q are
used, the more lnQ appears to be a better output variable to use. The point at which lnQ is preferable
is yet unknown. It is for this reason that both output variables (Q and lnQ) will be further tested in the
remainder of this investigation.
1000 80
900
70
800
60
700
50
600
# of occurences
# of occurences
500 40
400
30
300
20
200
10
100
0
0 1 1.5 2 2.5 3 3.5 4 4.5 5
0 5 10 15 20 25 30 35 40 45
GwF
P
Figure 5.12 - Probability function of rainfall data. Figure 5.13 - Probability function of
groundwater data at location Fentange.
Another variable that was created and tested for the same reason was lnETP. This variable contains
the natural logarithm of ETP. Tests showed no improvements in the prediction of both Q and lnQ
when lnETP was used as an input instead of ETP. The reason for this is that the distribution of ETP
(Figure 5.13) is both closer to the distribution of Q (Figure 5.8) and lnQ (Figure 5.8) than the
distribution of lnETP (Figure 5.14). This once again demonstrates the validity of the aforementioned
premise about the advantage of using similar probability distributions for input and output variables.
71
Chapter 5
120 60
100 50
80 40
# of occurences
# of occurences
60 30
40 20
20 10
0 0
0 2 4 6 8 10 12 -5 -4 -3 -2 -1 0 1 2 3
ETP lnETP
Figure 5.14 - Probability function of ETP. Figure 5.15 - Probability function of lnETP.
The rainfall variable has not been transformed. The large number of zero values in this time series
makes that a transformation like the ones above results in a probability distribution that is useless
because of infinite values.
Figure 5.16 shows the daily rainfall time series and Figure 5.17 shows the cumulative rainfall over
time. Winter and summer seasons are separated by the dashed red lines. Extreme rainfall events
seem to take place mainly in the winter (1996, 1998 an 1999). Other rainfall events, however, seems
to be distributed equally over summer and winter periods (the constant derivative of the cumulative
precipitation proves this). Fortunately, there is no clear trend in the rainfall time series (as an ANN
model trouble dealing with this, see 3.5.1).
72
Application to Alzette-Pfaffenthal Catchment
45
w96 s97 w97 s98 w98 s99 w99 s00 w00 s01 w01
40
35
30
25
P
20
15
10
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.16 - Daily rainfall in mm over time. (Red dotted lines separate the hydrological
seasons.)
w96 s97 w97 s98 w98 s99 w99 s00 w00 s01 w01
5000
4500
4000
3500
3000
cumulative P
2500
2000
1500
1000
500
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.17 - Cumulative rainfall in mm over time. (Red dotted lines separate the
hydrological seasons.)
The discharge values over time have been plotted in Figure 5.18. This figure shows that most of the
catchment discharge takes place during the winter periods.
73
Chapter 5
4
x 10
4.5
w96 s97 w97 s98 w98 s99 w99 s00 w00 s01 w01
4
3.5
2.5
Q
1.5
0.5
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.18 - Daily discharge values in l/s over time. (Red dotted lines separate the
hydrological seasons.)
In the figure below, both the rainfall (blue) and runoff (green) have been plotted over a short period
of time. This detail shows that the rainfall peaks and the runoff peaks often coincide. However,
sometimes the runoff response is distributed over the time step of maximum rainfall and the
subsequent time step. The response of the catchment in the form of runoff due to rapid runoff
processes takes place within a day (and likely within just a few hours).
Since the data time interval is one day for all variables, it can be concluded that the timescale of
the available data is somewhat large in comparison with the catchment response. As a result of this,
the timing of the runoff peak prediction will be less accurate.
74
Application to Alzette-Pfaffenthal Catchment
4
x 10
50 5
Precipitation (blue)
Runoff (green)
0 0
780 785 790 795 800 805 810 815 820 825 830 835 840 845 850
A double-mass curve of rainfall and runoff (Figure 5.20) plots the cumulative rainfall versus the
cumulative runoff. The periodically increasing and decreasing of the derivative of the blue line is a
result from what has been observed above: the discharge is respectively high in the winter and low in
the summer, while the rainfall is approximately constant.
6
x 10
8
5
cumulative Q
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
cumulative P
Figure 5.20 - Double-mass curve of rainfall and discharge. (The red line is simply
given as a straight-line reference.)
75
Chapter 5
The reason for this behaviour lies in the combined effects of two phenomena:
Seasonal variation in evaporation:
w96 s97 w97 s98 w98 s99 w99 s00 w00 s01 w01
10
6
ETP
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.21 - Evapotranspiration over time. (Red dotted lines separate the
hydrological seasons.)
w96 s97 w97 s98 w98 s99 w99 s00 w00 s01 w01
4.5
3.5
GwF
2.5
1.5
1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 5.22 - Groundwater level at location Fentange over time. (Red dotted lines
separate the hydrological seasons.)
Concluding, it can be stated that the hydrological regime in the Alzette-Pfaffenthal catchment is
defined by rainfall and evaporation. The low net precipitation (precipitation minus evapotranspiration)
in the summer period makes that the infiltration excess mechanism does not occur, so that water can
infiltrate and groundwater is replenished. During the wintertime this stored water quickly runs off as a
result of the saturation excess mechanism. The high net precipitation during the rest of the winter
76
Application to Alzette-Pfaffenthal Catchment
causes the infiltration excess mechanism to occur, which is why the groundwater level stays low and
runoff is high in this period.
Rainfall
The cross-correlation between the rainfall and runoff time series was examined in order to be able to
determine the effect of previous rainfall values on current discharge values. Figure 5.23 shows a plot
of this cross-correlation expressed as a standardized coefficient.
0.75
0.7
0.65
Cross-correlation (standardized coefficient)
0.6
0.55
0.5
0.45
0.4
0.35
0.3
-25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
Time lag
Figure 5.23 - Cross-correlation between rainfall and runoff time series, expressed by a
standardized correlation coefficient.
The correlation between rainfall and runoff quickly decreases when the time lag grows. A time lag of 0
shows a very high correlation, indicating the importance of the rainfall within the same time interval
as the discharge. This has also been displayed in Figure 5.19. The rainfall information from the
current time step alone is therefore unlikely to produce a perfect approximation of the discharge a
time step (one day) ahead.
77
Chapter 5
A new variable was created: RI. This variable contained a so-called rainfall index, described in 3.4.2.
The memory length for the RI was chosen 15. The coefficient for each value is set equal to the cross-
correlation coefficient in the figure above, divided through the sum of these coefficients. The rainfall
index could be an indicator of delayed flow processes.
Using the RI as additional input data to the model besides the rainfall time series seems to bring
about only little improvements (cf. 2 and 5). It can be concluded that this variable is not a very good
indicator of delayed flow processes.
Evapotranspiration
In the following tests the best way to provide the ANN with evapotranspiration information was
investigated. A new variable containing the net rainfall (Pnet) was created by subtracting the
evapotranspiration data from the rainfall data.
The best way to present evapotranspiration to the ANN R-R model is to simply use the
evapotranspiration series or the natural logarithm of this series as network input. Pre-processing by
subtracting evapotranspiration from rainfall even deteriorates the model performance. The reason for
this is probably that the evapotranspiration time series also indirectly provides the model with
seasonal information. This information contained in the evapotranspiration data is partially cancelled
out when it is subtracted from the rainfall data.
78
Application to Alzette-Pfaffenthal Catchment
Groundwater
The influence of the two available groundwater series on the model predictions has also been tested.
The groundwater data information seems to be of great value to the ANN model, especially in
combination with the rainfall data. The groundwater time series probably is an indicator for delayed
runoff processes and therefore complement the rainfall series, which probably is an indicator for rapid
runoff processes. This statement will be verified using additional tests in 5.5. A comparison between
the results from these tests and the tests using the rainfall index also shows that groundwater is a
much better indicator of delayed flow processes than the rainfall index.
The GwF time series carries more information about runoff than the GwD time series. The logical
reason for this is that Fentange is located more downstream the Alzette river than Dumontshaff, and
therefore is a better indicator for runoff at the catchment outlet. Using GwD as additional input
besides GwF does not seem to help the ANN model (cf. tests 4 and 5). The two groundwater time
series probably show a great deal of overlap in their information content. This is in accordance with
the fact that many GwF data was generated using its correlation with GwD and vice versa.
Discharge
Discharge data is often available in real-world applications of ANN models. Since previous discharge
values are obviously correlated to future discharge data, it seems logical to use them as ANN model
inputs. Figure 5.24 shows the autocorrelation in the discharge time series.
79
Chapter 5
0.95
0.85
0.8
0.75
0.7
0.65
0.6
0.55
-25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
Time lag
N.B.
Using previous discharge values as model inputs means that the ANN R-R model can no longer be
classified as a pure cause-and-effect model. It is then partially a time series model. This is an
important distinction, because cause-and-effect models and time series models represent two
completely different approaches in empirical modelling (respectively global versus local empirical
modelling, see 3.2.3).
Some tests were done to determine how many previous discharge time steps are of value to an ANN
model in predicting the discharge at the following time step.
The additional value of a larger number of previous values grows to a maximum. The reason for
stagnating of this performance lies mainly in the fact that the autocorrelation decreases as the time
interval grows. The reason for deteriorating performance (cf. L-M, test 2 and 3) lies in the fact that
the information content of the additional variables overlaps that of each other and that of the
previously used variables. Since the inputs used in test 3 contain the same information as that in test
2, there must be an ANN using the inputs from test 3 that is able to produce the same result as test
2. Such a network, however, is hard to find because the redundancy in the input data introduces a
degree of overtraining.
The performance of the time series prediction seems satisfactory, but a closer look at the time series
prediction (the result of test 1 is shown in Figure 5.25) shows an obvious flaw in the models
80
Application to Alzette-Pfaffenthal Catchment
approximation. Using previous discharge values as inputs results in a prediction that seems shifted in
time. Another characteristic of this prediction is that it fails to approximate the peak values (as well as
most minimum values).
4
x 10
Target Values
Network Prediction
2
Q
1.5
0.5
The reason for this time lag problem is explained in Figure 5.26. Suppose the ANN model has only
received the T0 as input variable and has T+1 as target output. The model has to apply a
transformation to the T0 values to produce an approximation of T+1. Two different situations can be
distinguished:
1. The T0 is descending. The T0 value generally should be transformed so that the absolute
value of the outcome is smaller than the T0 value.
2. The T0 ascending. The T0 value generally should be transformed so that the absolute
value of the outcome is larger than the T0 value.
The transformation of the model needed in situation 1 is contradictory with the needed transformation
in situations 2. If the ANN is unable to distinguish the two situations, it will choose a compromise:
instead of making the output value bigger or smaller than the T0 value, it will let it be approximately
the same value. This causes the prediction of T+1 by the model (T+1p) to be very similar to the T0
line.
But even if the model is able to distinguish the two situations the time lag effect would occur: at
the first extreme value, the T0 line is descending (situation 1). This situation prescribes that the T0
value should be transformed so that the absolute value of the outcome is smaller than the T0 value. If
this is done, the effect is still a lagged extreme value, as shown in Figure 5.26.
The problem of all situations mentioned above is the word generally. The response to these situations
is indeed generally correct. The response is dictated by the ANN weights, which means that these
weights are generally correct and therefore produce the smallest error. This is why the training
algorithm determines the weights as they are. As an inevitable consequence, the time lag effect
occurs.
81
Chapter 5
5 T0
T+1
4 T+1p
3
The other problem in time series modelling is failing to approximate peak values. This is a result of
inputting previous discharges at more than one time step ago into the ANN model.
Suppose a network also has the T-1 variable as a model input. Besides the correlation with T0, the
T+1 variable now also has a correlation with T-1. As can be seen in Figure 5.27, the value of T-1 is
often more to the mean value of all lines. The positive correlation between T+1 and T-1 therefore
causes the approximation of T+1 to be more near the mean value of the T+1 line than the maximum
or minimum of the T+1 line. Hence, the ANN model is less able to approximate the peak values the
more the model focuses on variables further back in time. If we force the model to focus on a variable
further back in time by presenting only the T-3 value as input and T+1 as target value, the extreme
values are approximated badly, as can be seen in Figure 5.28.
6
T-1
5 T0
4 T+1
T+1p
3
82
Application to Alzette-Pfaffenthal Catchment
4
x 10
Target Values
Network Prediction
2.5
RMSE: 4454.8592
R2: 11.4922
2
Q
1.5
0.5
100 120 140 160 180 200 220 240 260 280
Time Points Test Set
This subsection will be concluded with an examination of a combination between global and local
empirical modelling. This method comes down to combining input variables such as rainfall and
groundwater (global modelling) with input containing information about the time series itself (local
modelling).
The goal of these tests is to find out if an ANN using rainfall, groundwater and evapotranspiration
data as inputs can be made to perform better by adding previous values of the discharge (preferably
without introducing the time lag problem mentioned above).
Table 5.7 - Comparative tests of a cause-and-effect model and various combinations of cause-and-effect
and time series models.
CGb L-M CasCor CGb, L-M: 12 hidden neurons, tansig
RMSE 3403 3445 3415 CasCor: LR=8
1
R^2 (%) 57.7 66.7 56.3 Predicting Q at +1:
RMSE 3069 3060 3102 1: P, GwF and ETP at -4 to 0
2
R^2 (%) 72.0 73.0 70.9 2: P, GwF and ETP at -4 to 0, Q at 0
RMSE 3164 2980 3202 3: P, GwF and ETP at -4 to 0,
3
R^2 (%) 70.9 72.5 70.6 Q at -2 -1 0
RMSE 3091 3007 3054 CGb, L-M: 4 hidden neurons
4
R^2 (%) 64.7 74.0 73.5 4: Q at 0
Test 3 showed that the ANN is often unable to approximate extreme values due to the addition of Q
at the time instance -2 and -1. Tests 2 and 3 both showed the time lag problem, which is discussed
above. No way of preventing this problem has been found.
83
Chapter 5
Training algorithm
The following table shows the results of the testing of several training algorithms. Since the
performance of some algorithms varies with the complexity of the ANN architecture, a test
architecture was chosen that is representative for the problem under investigation (based on the best
test results so far):
Prediction: lnQ at +1
Input: P, ETP and GwF at -8 -6 -4 -2 -1 0
16 hidden neurons, tansig
Table 5.8 - Results of comparative training algorithm tests. The bold-faced values mark
the best result of the six test runs for each training algorithm.
Conclusion:
The Levenberg-Marquardt algorithm is the most consistently good performing algorithm. Another
algorithm that stands out is the BFG algorithm. The similar performance of the various Conjugate
Gradient algorithms is quite good, except for the scaled version (sCG). Despite its high score in the
first run, the Backpropagation (GDx) algorithms performance is not considered satisfactory; the very
good performance in run 1 looks like a fluke.
Transfer function
Several transfer functions were tested in combination with the following ANN:
Prediction: Q at +1
Input: P, ETP and GwF at -8 -6 -4 -2 -1 0
Training: L-M and BFG
16 hidden neurons
Table 5.9 - Results of comparative transfer function tests.
L-M run 1 L-M run 2 L-M run 3 BFG run 1 BFG run 2 BFG run 3
RMSE 3797 3797 3797 3801 3841 3792
purelin
R^2 (%) 43.0 38.7 39.0 38.7 36.8 42.8
RMSE 3466 3684 3620 3468 3583 3589
satlins
R^2 (%) 64.7 45.9 52.3 53.9 52.4 50.1
RMSE 3498 3606 3398 3839 3601 3506
logsig
R^2 (%) 56.9 54.3 66.8 34.4 53.9 58.1
RMSE 3560 3428 3400 3741 3511 3486
tansig
R^2 (%) 62.1 59.9 60.9 48.9 57.7 59.4
84
Application to Alzette-Pfaffenthal Catchment
Conclusion:
The symmetrical saturated linear transfer function (satlins) produces surprisingly good results,
considering its linear nature. As mentioned in 2.2.7, the non-linearities in transfer functions is
supposed to make possible the mapping of non-linearities by ANNs. The hyperbolic tangent and
logarithmic transfer function also produce satisfying results, as expected.
Error function
Figure 5.29 shows ANN predictions in case of using respectively the Mean Squared Error (MSE) and
the Mean Absolute Error (MAE) as error functions on which the ANN is trained (see 2.2.8 for an
explanation of the goal of the error function). These predictions were obtained from the best of 10
runs using each of the error measures. The ANN that was used is as follows:
Prediction: Q at +1
Input: P, ETP and GwF at -8 -6 -4 -2 -1 0
Training: L-M
16 hidden neurons, tansig
4
x 10
4
RMSE: 3511.0828
3.5 MSE
R2: 61.1382
3 RMSE: 3736.5881
MAE R2: 51.0763
2.5
1.5
0.5
-0.5
0 50 100 150 200 250 300 350 400
Time Points Test Set
Figure 5.29 - Best model performance using the MSE and MAE as error function for ANN
training.
Conclusion:
Theoratically, the MSE should be better in approximating peak values than the MAE, since the error
function amplifies large errors. Such large errors should most often occur at points where the target
time series shows a high peak and the model is unable to follow. This is indeed often the case (the
RMSE that uses the same amplification of errors is lower). This is the reason for preferring the MSE
error function over the MAE, even though the difference between the two error measures is not too
big as can be concluded from the figure above.
ANN architecture
The following table shows the results of several tests on different ANN architectures. The number of
hidden layers in the CT5960 ANN Tool is limited to two. The network that was used is similar to that
in the previous tests:
85
Chapter 5
Prediction: Q at +1
Input: P, ETP and GwF at -8 -6 -4 -2 -1 0
Training: L-M and BFG
L-M run 1 L-M run 2 L-M run 3 BFG run 1 BFG run 2 BFG run 3
RMSE 3431 3466 3516 4520 3956 3686
2+0
R^2 (%) 61.8 66.7 56.4 7.0 26.1 51.5
RMSE 3510 3386 3492 3591 3468 3572
4+0
R^2 (%) 60.0 61.2 58.4 50.5 56.9 51.4
RMSE 3290 3735 3386 3485 3426 3546
8+0
R^2 (%) 69.7 51.0 64.4 57.6 59.6 51.9
RMSE 3458 3385 3516 3610 3515 3452
16+0
R^2 (%) 56.8 56.8 62.9 48.2 49.6 49.5
RMSE 3459 3529 3713 3587 3694 3568
32+0
R^2 (%) 58.9 56.8 52.7 46.6 40.2 53.5
RMSE 3823 4159 3716 3598 3711 3658
64+0
R^2 (%) 49.5 35.5 51.2 46.6 49.8 53.2
RMSE 3820 3363 3559 3673 3586 3523
8+2
R^2 (%) 34.1 65.3 53.6 56.3 52.2 50.1
RMSE 3658 3418 3427 3512 3789 3503
8+4
R^2 (%) 52.3 64.9 63.9 56.9 49.8 57.8
RMSE 3395 3459 3518 3516 3362 3577
8+8
R^2 (%) 62.8 60.6 61.3 58.9 60.2 56.8
RMSE 3641 3519 3595 4152 3512 3516
8+16
R^2 (%) 76.0 60.7 56.1 19.7 59.0 57.2
RMSE 3579 3891 3664 3528 3759 3997
8+32
R^2 (%) 53.5 51.8 56.8 46.9 42.8 32.8
Conclusion:
What can be concluded from the first six tests is that the network performance does not keep
increasing with the number of hidden neurons in the network. At some point, the generalisation
capability of the ANN starts to decrease as a result of the overtraining effect. The overtraining effect is
due to the large number of parameters in proportion to the information content of the data (as has
been discussed in 2.4.2). These tests prove the validity of the statement by Shamseldin [1997], that
in some cases the information carrying capacity of data does not support more sophisticated models
or methods.
The difference in performance using three-layer ANNs instead of two-layer networks is very small.
Provided that the number of neurons in the second hidden layer is not too small or too large in
comparison with the number of neurons in the first hidden layer, a three-layer network could be able
to produce marginally better results.
The input and output variables that were used are the same as in many of the tests above:
Prediction: Q at +1
Input: P, ETP and GwF at -8 -6 -4 -2 -1 0
86
Application to Alzette-Pfaffenthal Catchment
The CasCor algorithm is quite sensitive about the learning rate parameter. Small values seem to result
in somewhat higher errors: the algorithm is having trouble finding minima on the error surface
because its steps are too small. Higher values make that the algorithm is taking steps that are too big,
thereby passing over minima. The ideal value of the learning rate seems to depend on the data used,
but values of 2 to 4 seem suitable for most situations.
87
Chapter 5
ANNs 3 and 4 exhibited large differences in performance on training and cross-training data. The early
cross-training stops on these models indicate overtraining effects. The causes of this effect are clear:
ANN 3 has a too complex network architecture;
ANN 4 has two input variables that show large information overlap (GwF and GwD).
ANNs 6, 7 and 12 showed the same overtraining effects too (to a smaller degree). This was also
caused by a large number of inputs and the relatively complex network architectures.
The reason for network 15 showing little or no overtraining effects, despite its complex ANN
structure, is probably because the ANN easily recognises the input of Q at time 0 as an important
indicator for Q at time +1 and devaluates the rest of the ANN inputs and connections.
88
Application to Alzette-Pfaffenthal Catchment
89
Chapter 5
3.5
RMSE: 3288.8351
3 R2: 77.2443
2.5
2
Q
1.5
0.5
-0.5
0 50 100 150 200 250 300 350 400
3.5
RMSE: 3004.8236
2
3 R : 82.8062
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
90
Application to Alzette-Pfaffenthal Catchment
From these ANN designs can be concluded that the ideal memory length for the rainfall and
evapotranspiration data is approximately 4 or 5 time steps. The ideal memory length for the
groundwater data from location Fentange is a few time steps longer.
Using these memory lengths results in an ANN R-R model with about 15 input variables. The best
network architecture that was found for this model has two hidden layers. A number of 6 to 8 neurons
in the first hidden layers and 6 to 4 in the second hidden layers produced the best results. Larger
networks show signs of the overtraining effect.
The Levenberg-Marquardt training algorithm is undoubtedly the best available algorithm for this
problem. The BFG algorithm sometimes shows good results on complex ANN architectures, but L-M is
the most consistently good performing algorithm.
The effect of the type of transfer function is small. The tansig function was generally chosen as
transfer function, because theoretically it is the best function in non-linear applications.
Data resolution
As was shown in 5.3, the Alzette-Pfaffenthal catchment has a response time (the time between the
rainfall peak and the discharge peak) that is probably shorter than day. Because the process scale is
smaller than the time resolution of the data, the exact response time is unknown.
This small response time as opposed to the larger time intervals in the data cause the data
information content to be somewhat insufficient. An ANN model that has to predict a discharge based
on rainfall information that is longer than the response time back in time, is unlikely to have enough
information to do a very accurate simulation.
Figure 5.32 shows an approximation by ANN model 9 of the discharge at the current time step (T0),
given the data of the input variables at the current time step and a few steps back in time. This model
represents the ideal situation in which the time intervals of the data are zero. This model is able to
closely approximate the target discharge values. From this can be concluded that if the time scale of
the data would be smaller than a day, the best ANN models approximation would become better
(more like the approximations in the figure below).
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 1895.9824
3 R2: 95.1747
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
The time lag effect (discussed in 5.4.1) occurred in all ANN models that had previous discharge
values as model input. The error that is caused by this phenomenon is related to the time resolution
of the data: the larger the time resolution of the data in proportion to the time scale of the system,
91
Chapter 5
the more significant the time lag error will be. The lag in the predictions of a day can be clearly seen
in the figures, but it is small enough for the RMSE and Nash-Sutcliffe coefficient to be quite high.
3.5
RMSE: 3004.8236
3 R2: 82.8062
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
3.5
RMSE: 3007.9465
3 R2: 74.0012
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
Concluding, it can be stated that the combinations of global and empirical models that were tested,
tended to make themselves act like a local empirical model. The reason for this is that the data that
was used allows moderate performance from a global model (RMSE of about 3300) and quite good
performance from a local model (RMSE of about 3000).
92
Application to Alzette-Pfaffenthal Catchment
3.5
3.5
3
3
2.5
2.5
predictions
predictions
1.5
1.5
1
0.5
0.5
0
-0.5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
targets 4
x 10 targets 4
x 10
Figure 5.35 - Scatter plot of predictions and Figure 5.36 - Scatter plot of predictions
targets (ANN 9). and targets (ANN 18).
The following two plots show the approximations of ANN models 9 and 18 over the complete time
series (i.e. the training data, cross-training data and the validation data). The best approximation of
the discharge time series naturally is during the training phase (first half of the time series). These
plots also show that the peak predictions are too low.
The peak in the validation data set (just before time step 1600) is larger than any peak presented
to the model in the training phase. The model was not in any case able to extrapolate beyond the
range of the training data. This was to be expected since previous applications have already shown
that ANNs are bad extrapolators.
4
x 10
4
RMSE=2705
3.5
R2= 67.0
2.5
2
Q
1.5
0.5
-0.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
93
Chapter 5
4
x 10
4
3.5
RMSE= 2390
R2= 77.6
3
2.5
2
Q
1.5
0.5
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
This inability to approximate peaks could be a result of inappropriate pre-processing and post-
processing of data. The linear amplitude scaling to -0.9 to 0.9 has been changed to -0.8 to 0.8 for the
following test.
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 3389.6936
3 R2: 59.8816
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
The data processing limits are not the cause of the underestimation of extreme values, as this figure
shows. The performance actually deteriorates when the smaller scaling limits are used.
94
Application to Alzette-Pfaffenthal Catchment
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 4492.0895
3 R2: 32.6686
2.5
2
Q
1.5
0.5
-0.5
0 50 100 150 200 250 300 350 400
Time Points Test Set
3.5
RMSE: 4264.7031
3 R2: 19.5914
2.5
2
Q
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
The rainfall data clearly helps the model approximate peak runoff values. This fact proves that an ANN
model uses the rainfall data as an indicator for future discharge peaks. The groundwater, on the other
hand, data helps the model estimate the magnitude of low discharges. These observations are in
accordance with the theory of rainfall to runoff transformation, discussed in 3.1.
From this can be concluded that:
the rainfall time series mainly contains information about storm flows;
the groundwater time series mainly contains information about base flows;
the ANN model is able to extract the relations between these two time series and the
discharge time series from the data.
95
Chapter 5
3.5
RMSE: 4441.0739
3 R2: 21.386
2.5
2
Q
1.5
0.5
-0.5
0 50 100 150 200 250 300 350 400
Time Points Test Set
The reason for this similarity is because the multi-step ahead prediction barely uses the rainfall data
input. The reason for this is the low correlation between the discharge and the rainfall at two time
steps ago.
The conclusion is that the same reason that causes the ANN predictions to be not very accurate
(too large a time scale of the data compared to the time scale of catchment response), also causes
multi-step ahead predictions to be very inaccurate.
CasCor comparisons
Prediction: Q at +1
Input: P, ETP and GwF at -4 to 0
Regular training algorithms 8+4 hidden neurons, tansig
Table 5.16 - Results of regular versus CasCor training algorithm tests.
CasCor L-M BFG sCG CGb GDx
RMSE 3503 3290 3441 3271 3415 3661
R^2 (%) 53.5 67.1 64.7 66.8 59.0 44.4
The CasCor algorithm cannot keep up with the performance of the more sophisticated algorithms like
L-M, BFG or sCG. It is clear that the embedded Quickprop algorithm is an improvement over the
backpropagation algorithm with momentum and variable learning rate (GDx). The current limiting
factor of the CasCor algorithm is most likely the training algorithm. A more sophisticated algorithm like
L-M would improve ANN performance because the weights are trained better.
Split sampling
Some tests were run with ANN model 9 in order to examine the impact of a change in split- sampling
of the data. The first test was done using a 70%-10%-20% distribution for the training, cross-training
and validation data, the second test uses a 30%-50%-20% distribution.
96
Application to Alzette-Pfaffenthal Catchment
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 3877.0578
3 R2: 35.8424
2.5
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
Figure 5.43 - ANN model 9 simulation after split sampling the data in 70%-10%-20%.
The model is unable to accurately predict runoff values due to the overtraining effect. This is the
result of too small a cross-training data set.
4
x 10
4
Target Values
Network Prediction
3.5
RMSE: 3739.6756
3 R2: 44.5028
2.5
1.5
0.5
0
0 50 100 150 200 250 300 350 400
Time Points Test Set
Figure 5.44 - ANN model 9 simulation after split sampling the data in 30%-50%-20%.
The model is unable to accurately predict runoff values due to the fact that the information content of
the small training data set is not large enough to learn all relationships from it.
97
Chapter 6
A main aspect of the data on which ANN model performance depends is the length of the time
series. The Alzette-Pfaffenthal catchment time series length (1887 daily values) proved
sufficient for the ANN models to learn the relationships in the data. This was more or less
expected since the data comprises five years that all show the most important characteristics
of an average hydrological year.
The application of ANN models to the Alzette-Pfaffenthal catchment suffered from two main
drawbacks. The first drawback is related to ANN techniques: time lag problems when using
ANNs as time series models. The second problem is data related: an inappropriate time
resolution of the available data in proportion to the time scale of the R-R transformation in the
catchment.
The first problem is an inevitable result from the application of a static ANN as a time series
model. The correlation between the current time step (t=0) and the time step that is to be
predicted (e.g. t=+1) causes the prediction of a variable to be more or less the same as the
current value of that variable. This results in a prediction that looks shifted in time (in this
case, the prediction becomes one step lagged in time relative to the target). The significance
of this effect is related to the time resolution of the data, since the time lag is as large as the
time intervals of the data.
The second problem is caused by the discrepancy between the time resolution of the data
and the time scale of the dominant flow processes in the catchment. The time between runoff
generation in the Alzette river and the rainfall event that caused it, is often less than a day.
This was concluded from the coinciding peaks in the rainfall and runoff time series. The best
possible indicator for the prediction of discharge at one time step ahead is the rainfall at the
current time step. In the case of the Alzette-Pfaffenthal catchment data, the correlation
between the rainfall and the runoff has decreased significantly over this period, because this
period (one day) is longer than the overall response time of the catchment (less than a day).
In other words: the ANN model finds it hard to see a discharge peak coming, because the
rainfall that causes this peak often did not yet fall onto the catchment, according tot the data.
The prediction of discharge for multiple time steps ahead is inaccurate, also because of this.
ANN R-R models can be used as pure cause-and-effect models and as time series models. The
cause-and-effect approach (also known as global empirical modelling) means that the input to
the ANN model consists of variables that are correlated to runoff, such as rainfall,
groundwater levels et cetera. The time series approach (also known as local empirical
modelling) uses the latest values of the discharge as model input.
The performance of ANNs that were used as global empirical models is related to the low
time resolution of the data (the second problem discussed above). Local empirical models are
capable of better results in terms of error measures but are subject to the time lag problem
(the first problem discussed above). As a result of this improved performance, the ANNs
combining global and local modelling (i.e. ANNs using both discharge correlated variables and
previous discharge values as input) that were tested tended to act like local empirical models.
The time lag phenomenon was not tempered by the input of discharge correlated variables
such as rainfall.
98
Conclusions and Recommendations
ANN R-R models were able to relate the rainfall and groundwater data to respectively rapid
and delayed flow processes. The information content of the rainfall and groundwater time
series complemented each other nicely.
Pre-processing and post-processing of data in the form of scaling is often necessary for
transfer functions to function properly. Additional processing techniques, however, can also
prove useful. One of the findings of this investigation is that if the probability distributions of
input variables show similarities with the probability distribution of the output variable, an
ANN model can learn the relationships between these variables more easily.
The development of an ANN R-R model is not very demanding from a modeller. A few basic
guidelines for ANN design, some insight in the catchment behaviour, good data from a
catchment and an amount of trial-and-error tests should suffice in being able to make an ANN
model. Interpretation of training, cross-training and validation results, however, requires a
firm understanding of the workings of ANN techniques.
Summarising, the approximation of validation data by ANN models is quite good (see Figure 5.30 and
Figure 5.31 on page 90), despite certain drawbacks. ANNs have been proven to be capable of
mapping the relationships between precipitation and runoff. The physics of the hydrological system
are parameterised in the internal structure of the network. The low transparency of such
parameterised relations often leads to discussions on the usefulness of ANNs. ANNs are indeed
generally not very good in revealing the physics of the hydrological system that is modelled. (A
counterexample of this is the separation of the effects of inputting rainfall data and groundwater data,
which is discussed above.)
On the other hand, providing insights should not be the main goal of ANN application. The focus
should be on the positive aspects of ANNs: easy model development, short computation times and
accurate results.
6.2 Recommendations
A higher time resolution of data in proportion to the system time scale would enhance the
performance of ANNs that are used as global empirical models. The importance of the time lag effect
in local empirical models will also diminish because the lag is as large as the time intervals used in the
data.
A higher spatial resolution could also be beneficial to ANN models. In this investigation, only one
precipitation time series was used, representing the lumped rainfall over the catchment. Using several
time series from spatially distributed measurement stations could be useful in ANN R-R modelling.
The time lag problem can possibly countered by using dynamic ANNs instead of static networks with a
window-in-time input. Fully of partially recurrent networks (discussed in 2.3.4) could be used for this
dynamic approach. A different software tool would have to be used since the CT5960 ANN Tool only
supports static ANNs.
The main limiting factor in the performance of the CasCor algorithm seems to be the training
algorithm that is embedded in it, which currently is the Quickprop algorithm. A more sophisticated
algorithm such as the Levenberg-Marquardt algorithm would undoubtedly increase the CasCor
algorithms capability to find good weight values, and thus produce lower model errors.
The automated stopping criteria that were used in this investigation (developed by Prechelt [1996])
should be tested on more complex data in order to be able to make a conclusive statement about
their performance.
99
Glossary
Glossary
Activation level See: state of activation
Activation function See: transfer function
ANN architecture The structure of neurons and layers of an ANN.
A network of simple computational elements (known as neurons ) that is
Artificial Neural Network
able to adapt to an information environment by adjustment of its internal
(ANN)
connection strengths (weights ) by applying a training algorithm .
Backpropagation Family of training algorithms, based on a steepest gradient descent training
algorithms algorithm.
The total of delayed water flows from a catchment. Visualised as the lower
Base flow
part of a catchment hydrograph.
Training method that updates the ANN weights only after all training data
Batch training
has been presented.
A treshold function for the output of a neuron.
Bias Or: a constant input signal in an ANN (used, for instance, in a CasCor
network).
Cascade Correlation A meta-algorithm (or: constructive algorithm) that both trains an ANN and
(CasCor) constructs an appropriate ANN architecture.
An R-R model that makes several assumptions about real-world behaviour
Conceptual R-R model and characteristics. Midway between empirical R-R model and physically
based R-R model .
Method of preventing overtraining; during the training process a separate
Cross-training cross-training data set is used to check the generalisation capability of the
network being trained.
Dynamic ANN An ANN with the dimension of time implemented in the network structure.
Early-stopping
Methods of preventing overtraining by breaking off training procedures.
techniques
An R-R model that models catchment behaviour based purely on sample
Empirical R-R model
input and output data from the catchment.
Epoch A weight update step.
An ANN that only has connections between neurons that are directed from
Feedforward ANN
input to output.
Function mapping See: mapping
Global empirical Pure cause-and-effect modelling, using differing model input and output
modelling variables.
Hidden neurons Neurons between the input units and the output layer of an ANN.
Hydrograph Graphical presentation of discharge in a water course.
Input units Units in an ANN architecure that receive external data.
Internal ANN parameters Weights and biases in an ANN.
Learning algorithm See: training algorithm
Learning rate A training parameter that affects the step size of weight updates.
Pure time series modelling, using previous values of a variable in order to
Local empirical modelling
predict a future value.
Mapping (or: function Approximation of a function. This approximation is represented in the
mapping) workings of the function model.
Simple computational element that transforms one or more inputs to an
Neuron
output.
100
Glossary
Training effect that results in an ANN that follows the training data too
Overtraining
rigidly and therefore loses its generalisation ability.
Perceptron A specific type of neuron, named after one of the first neurocomputers.
A training method that is the best-known example of supervised learning .
Performance learning It lets an ANN adjust its weights so that the network output approximates
target output values.
Physically based R-R
An R-R model that represent the physics of a hydrological system as they a
model
Quickprop A training algorithm that is a variant of the backpropagation algorithm.
Radial Basis Function
A two-layer feedforward ANN type that has mapping capabilities.
(RBF) ANN
Dividing a data set into separate data sets for training , validation and
Split sampling
possibly cross-training .
State of activation (or:
Internal value of a neuron, calculated by combining all its inputs.
activation level)
The total of rapid water flows from a catchment after a precipitation event.
Storm flow Visualised as the upper part of the peak of a catchment hydrograph. Also
see: base flow .
An ANN training method that presents the network with inputs as well as
Supervised learning
target outputs to which it can adapt. Also see: unsupervised learning .
The process of adapting an ANN to sample data. Also see: cross-training
Training
and validation .
Training algorithm (or: An algorithm that adjusts the internal parameters of an ANN in order to
learning algorithm) adjust its output to training data that is presented to the network.
A function in which a neuron's state of activation is entered and that
Transfer function
subsequently produces the neuron's output value.
Training effect that results in an ANN that generalises too much, because it
Underfitting
has not taken full advantage of the training data.
An ANN training method that presents the network only with input data to
Unsupervised learning
which it can adapt. Also see: supervised learning .
The process of testing a trained ANN on a separate data set in order to
Validation
check its performance.
A value that represents the strength of the connection between two
Weight
neurons.
101
Notation
Notation
Variables
ETP Evapotranspiration
GwD Groundwater level at location Dumontshaff
GwF Groundwater level at location Fentange
lnETP Natural logarithm of evapotranspiration, ln(ETP)
lnQ Natural logarithm of Q, ln(Q)
P Rainfall
Pnet Net rainfall, (P minus ETP)
Q Discharge at location Hesperange
RI Rainfall Index
Algorithms
BFG Broyden-Fletcher-Goldfarb-Shanno algorithm
CasCor Cascade-Correlation training algorithm
CGb Powell-Beale variant of the Conjugate Gradient training algorithm
CGf Fletcher-Reeves variant of the Conjugate Gradient training algorithm
CGp Polak-Ribiere variant of the Conjugate Gradient training algorithm
GDx Gradient Descent training algorithm (backpropagation) with momentum and variable
learning rate.
L-M Levenberg-Marquardt training algorithm
sCG Scaled Conjugate Gradient algorithm
Transfer functions
Logsig Logarithmic sigmoid transfer function
Purelin Linear transfer function
Satlins Symmetrical saturated linear transfer function
Tansig Hyperbolic tangent transfer function
Error functions
MAE Mean Absolute Error
MSE Mean Squared Error
RMSE Rooted Mean Squared Error
Other abbreviations
ANN Artificial Neural Network
FIR Finite Impulse Response
GUI Graphical User Interface
R-R Rainfall-Runoff
102
List of Figures
List of Figures
Figure 2.1 - A biological neuron .................................................................................................... 4
Figure 2.2 - Schematic representation of two artificial neurons and their internal processes [after
Rumelhart, Hinton and McClelland, 1986] ............................................................................... 6
Figure 2.3 - An example of a three-layer ANN, showing neurons arranged in layers........................... 7
Figure 2.4 - Illustration of network weights and the accompanying weight matrix [after Hecht-Nielsen,
1990]. ................................................................................................................................. 8
Figure 2.5 - Linear activation function. .......................................................................................... 9
Figure 2.6 - Hard limiter activation function. .................................................................................. 9
Figure 2.7 - Saturating linear activation function. ........................................................................... 9
Figure 2.8 - Gaussian activation function for three different values of the wideness parameter......... 10
Figure 2.9 - Binary sigmoid activation function for three different values of the slope parameter. ..... 10
Figure 2.10 - Hyperbolic tangent sigmoid activation function. ........................................................ 11
Figure 2.11 - Example of a two-layer feedforward network. .......................................................... 13
Figure 2.12 - Example of an error surface above a two-dimensional weight space. [after Dhar and
Stein, 1997] ....................................................................................................................... 14
Figure 2.13 - General structure for function mapping ANNs [after Ham and Kostanic, 2001]. ........... 18
Figure 2.14 - A classification of ANN models with respect to time integration [modified after Chappelier
and Grumbach, 1994]. ........................................................................................................ 20
Figure 2.15 - Basic TDNN neuron. [after Ham and Kostanic, 2001]. ............................................... 22
Figure 2.16 - Non-linear neuron filter [after Ham and Kostanic, 2001]............................................ 22
Figure 2.17 - The SRN neural architecture [after Ham and Kostanic, 2001]..................................... 23
Figure 2.18 - The recursive multi-step method. [after Duhoux et al., 2002] .................................... 24
Figure 2.19 - Chains of ANNs. [after Duhoux et al., 2002]............................................................. 24
Figure 2.20 - Direct multi-step method. ....................................................................................... 25
Figure 2.21 - An overtrained network. [after Demuth and Beale, 1998] .......................................... 28
Figure 2.22 - Choosing the appropriate number of training cycles [after Hecht-Nielsen, 1990].......... 29
Figure 3.1 - Schematic representation of the hydrological cycle (highlighting the processes on and
under the land surface). ...................................................................................................... 31
Figure 3.2 - Example hydrograph including a catchment response to a rainfall event. ...................... 32
Figure 3.3 - Schematic representation of cross-sectional hill slope flow [Rientjes and Boekelman,
2001] ................................................................................................................................ 33
Figure 3.4 - Horton overland flow [after Beven, 2001] .................................................................. 34
Figure 3.5 - Saturation overland flow due to the rise of the perennial water table [after Beven, 2001]
......................................................................................................................................... 34
Figure 3.6 - Perched subsurface flow [after Beven, 2001] ............................................................. 35
Figure 3.7 - Diagram of the occurrence of various overland flow and aggregated subsurface storm
flow processes in relation to their major controls [after Dunne and Leopold, 1978]................... 37
Figure 3.8 - Variable source area concept [after Chow et al., 1988]. .............................................. 37
Figure 3.9 - Examples of a lumped, a semi-distributed and a distributed approach. ......................... 38
Figure 3.10 - Schematic representation of the SHE-model. ............................................................ 39
Figure 3.11 - Comparing observed and simulated hydrographs [from Beven, 2001]......................... 50
Figure 4.1 - Screenshot of the original CT5960 ANN Tool (version 1). ............................................ 52
Figure 4.2 - Screenshot of the new CT5960 ANN Tool (version 2). ................................................. 54
Figure 4.3 - The Cascade Correlation architecture, initial state and after adding two hidden units.
[after Fahlman and Lebiere, 1991] ....................................................................................... 57
Figure 4.4 - Inaccurate form of the CasCor algorithm, as programmed in the M-file in the Classification
Toolbox. ............................................................................................................................ 58
Figure 4.5 - Program Structure Diagram of the CasCor M-file. ....................................................... 59
Figure 4.6 - Program Structure Diagram of the subroutine F for determining the CasCor network
output. .............................................................................................................................. 59
Figure 4.7 - CasCor network with two input units (Ni=2) and two hidden neurons (Nh=2)............... 60
Figure 4.8 - Modified Quickprop algorithm; combination of the original algorithm by Fahlman [1988]
and a slight modification by Veitch and Holmes [1990]. ......................................................... 61
Figure 5.1 - Location of Alzette catchment in North West Europe................................................... 65
103
List of Figures
104
List of Tables
List of Tables
Table 2.1 - Overview of supervised learning techniques ................................................................ 12
Table 2.2 - Overview of unsupervised learning techniques ............................................................ 12
Table 2.3 - Review of ANN performance on various aspects [modified after Dhar & Stein, 1997]. ..... 27
Table 4.1 - Comparison of CasCor algorithm with three other training algorithms............................ 63
Table 5.1 - Available data from Alzette-Pfaffenthal catchment. ...................................................... 65
Table 5.2 - Comparative tests of Q and lnQ as network outputs..................................................... 69
Table 5.3 - Comparative tests of rainfall inputs. ........................................................................... 78
Table 5.4 - Comparative tests of rainfall and evapotranspiration inputs. ......................................... 78
Table 5.5 - Comparative tests of groundwater inputs.................................................................... 79
Table 5.6 - Comparative tests of discharge inputs and outputs. ..................................................... 80
Table 5.7 - Comparative tests of a cause-and-effect model and various combinations of cause-and-
effect and time series models. ............................................................................................. 83
Table 5.8 - Results of comparative training algorithm tests. .......................................................... 84
Table 5.9 - Results of comparative transfer function tests. ............................................................ 84
Table 5.10 - Results of comparative ANN architecture tests........................................................... 86
Table 5.11 - Results of comparative CasCor parameter tests. ........................................................ 87
Table 5.12 - ANN model descriptions (regular training algorithms)................................................. 87
Table 5.13 - Results of ANN tests (regular training algorithms)...................................................... 88
Table 5.14 - ANN model descriptions (CasCor training algorithm). ................................................. 89
Table 5.15 - Results of ANN tests (CasCor training algorithm). ...................................................... 89
Table 5.16 - Results of regular versus CasCor training algorithm tests............................................ 96
105
References
References
Hydrologie, Lecture notes CThe3010
Akker, C. van den
1998 Faculty of Civil Engineering and Geosciences - Section of
Boomgaard, M. E.
Hydrology and Ecology
Rainfall-runoff modelling: the primer
Beven, Keith J. 2001
Wiley
Multi-step-ahead predictions with neural networks: a review
Bon, R.
2002 9mes rencontres internationales Approches Connexionnistes
Crucianu, M.
en Sciences conomiques et en Gestion, pp. 97-106, RFAI
Common misconcepts about neural networks as approximators
Carpenter, W. C.
1994 Journal of Computing in Civil Engineering, 8 (3), pp. 345-358,
Barthelemy, J.
ASCE
Chappelier, J.-C. Time in neural networks
1994
Grumbach, A. SIGART Bulletin, Vol. 5, No. 3, ACM Press
Orthogonal Least Squares Learning Algorithm for Radial Basis
Chen, S.
Function Networks
Cowan, C. F. N. 1991
IEEE Transactions on Neural Networks, Vol. 2, Issue 2, pp. 302-
Grant, P. M.
309, IEEE Computer Society
Chow, V. T.
Applied Hydrology
Maidment, D. R. 1988
McGraw Hill
Mays, L. W.
Dawson, C. W.
Evaluation of artificial neural network techniques for flow
Harpham, C.
2002 forecasting in the River Yangtze, China
Wilby, R. L.
Hydrology and Earth System Sciences, 6 (4), pp. 619-626, EGS
Chen, Y.
Neural Network Toolbox (for use with Matlab) Users Guide,
Demuth, Howard
1998 Version 3
Beale, Mark
The Mathworks Inc.
Seven methods for transforming corporate data into business
Dhar, Vasant
1997 intelligence
Stein, Roger
Prentice-Hall
River flow forecasting using artificial neural networks
Dibike, Y. B.
2000 Physics and Chemistry of the Earth (B), Vol. 26, No. 1, pp. 1-7,
Solomatine, D. P.
Elsevier Science B.V.
Duhoux, M. Improved long-term temperature prediction by chaining of
Suykens, J. neural networks
2001
De Moor, B. International Journal of Neural Systems, Vol. 11, No. 1, pp. 1-
Vandewalle, J. 10, World Scientific Publishing Company
Relation of field studies and modelling in the prediction of
Dunne 1983 storm runoff"
Journal of Hydrology, Vol. 65, pp. 25-48, Elsevier Science B.V.
Dunne, T. Water in Environmental Planning
1978
Leopold, L. B. W. H. Freeman and Co.
Performance evaluation of artificial neural networks for runoff
Elshorbagy, Amin
prediction
Simonovic, S. P. 2000
Journal of Hydrologic Engineering, Vol. 5, No. 4, pp. 424-427,
Panu, U. S.
ASCE
An Empirical Study of Learning Speed in Back-Propagation
Fahlman, Scott E. 1988 Networks
School of Computer Science, Carnegie Mellon University
Fahlman, Scott E. The Cascade-Correlation Learning Architecture
1991
Lebiere, Christian School of Computer Science, Carnegie Mellon University
106
References
French, M. N.
Rainfall forecasting in space and time using a neural network
Krajewski, W. F. 1992
Journal of Hydrology, Vol. 137, pp. 1-37, Elsevier Science B.V.
Cuykendal, R. R.
Application example of neural networks for time series analysis:
Furundzic, D. 1998 rainfall-runoff modeling
Signal Processing, 64, pp. 383-396, Elsevier Science B.V.
Artificial neural networks in hydrology I: preliminary concepts
Govindaraju, Rao S. 2000 Journal of Hydrologic Engineering, Vol. 5, No. 2, pp. 115-123,
ASCE
Artificial neural networks in hydrology II: hydrologic
applications
Govindaraju, Rao S. 2000
Journal of Hydrologic Engineering, Vol. 5, No. 2, pp. 124-137,
ASCE
The relationship between data and the precision of parameter
Gupta, Hoshin Vijai
1985 estimates of hydrologic models
Sorooshian, Soroosh
Journal of Hydrology, Vol. 81, pp. 57-77, Elsevier Science B.V.
Halff, A. H.
Predicting from rainfall using neural networks
Halff, H. M. 1993
Proceedings of Engineering Hydrology, pp. 760-765, ASCE
Azmoodeh, M.
Ham, Fredric H. Principles of neurocomputing for science & engineering
2001
Kostanic, Ivica McGraw-Hill Higher Education
Neural Networks: A Comprehensive Foundation (2nd edition)
Haykin, Simon 1998
Prentice Hall
Neurocomputing
Hecht-Nielsen, Robert 1990
Addison-Wesley
Hjemfelt, A. T. Artificial neural networks as unit hydrograph applications
1993
Wang, M. Proceedings of Engineering Hydrology, pp. 754-759, ASCE
Hooghart, J. C. Verklarende hydrologische woordenlijst
1986
et al. Commissie voor Hydrologisch Onderzoek - TNO
The role of infiltration in the hydrologic cycle
Horton, R.E. 1933
Transactions American Geophysical Union, 14, pp. 446-460
Hsu, Kuo-lin Artificial neural network modeling of the rainfall-runoff process
Gupta, Hoshin Vijai 1993 Water Resources Research, 29 (4), pp. 1185-1194, Department
Sorooshian, Soroosh of Hydrology and Water Resources, University of Arizona
Technical Writing and Professional Communication for Nonnative
Huckin, T. N.
1991 Speakers of English
Olsen, L. A.
McGraw-Hill
HOMS workshop on river flow forecasting
Kachroo, R. K. 1986 Unpublished internal report, Department of Engineering
Hydrology, University of Galway, Ireland
River flow prediction using artificial neural networks:
Imrie, C. E.
generalisation beyond the calibration range
Durucan, S. 2000
Journal of Hydrology, Vol. 233, pp. 138-153, Elsevier Science
Korre, A.
B.V.
Backpropagation learning for multi-layer feed-forward neural
Johansson, E. M.
networks using the conjugate gradient method
Dowla, F. U. 1992
International Journal of Neural Systems, Vol. 2, No. 4, pp. 291-
Goodman, D. M.
301, World Scientific Publishing Company
An introduction to neural computing
Kohonen, T. 1988
Neural Networks, Vol. I, pp. 3-16, Pergamon Press
Using radial basis functions to approximate a function and its
Leonard, J. A.
error bounds
Kramer, M. A. 1992
IEEE Transactions on Neural Networks, Vol. 3, Issue 4, pp. 624-
Ungar, L. H.
627, IEEE Computer Society
An introduction to computing with neural nets
Lippmann, R. P. 1987
IEEE ASSP Magazine, pp. 4-22, IEEE Computer Society
107
References
108
References
Websites
109
Appendix A
Using a steepest-descent gradient approach, the learning rule for a network weight in any one of the
network layers is given by
E
w ji = (A.1)
w ji
Using the chain rule for partial derivatives, this formula can be rewritten as
E vj
w ji = (A.2)
v j w ji
where vj is the activation level of neuron j.
The first partial derivative in (A.2) is different for weights of neurons in hidden layers and neurons in
output layers. For output layers, it can be written as
E 1 n 2
= (s) h
t f ( vh( s ) ) = t j f ( v (js ) ) g ( v (js ) ) (A.4)
vj
(s)
v j 2 h =1
or
E
= ( t j y (js ) ) g ( v (js ) ) = (j s ) (A.5)
vj
(s)
where g represents the first derivative of the activation function f. The term defined in (A.5) is
commonly referred to as local error.
For neurons in hidden layers, this first partial derivative in (A.2) is more complex since the change in
vj(s) propagates through the output layer of the network and affects all the network outputs.
Expressing this quantity as a function of quantities that are already known and of other terms, which
are easily evaluated gives us
1 n y(s)
2
E n
( s +1) (s)
j
th f whp( s +1)
= y (ps ) (s) (A.6)
v (js ) y (js ) 2 h =1 p =1 v j
or
E n
( s +1)
( th y j ) g ( vh ) whj g ( vh )
( s +1) ( s +1) ( s +1)
= (s)
vj
(s)
h =1
(A.7)
n
( s +1)
or
w(jis ) (k + 1) = w(jis ) (k ) + ( s ) (j s ) yi( s 1) (A.9)
110
Appendix A
We see that the update equations for the weights in the hidden layer and the output layer have the
same form. The only difference lies in the way the local errors are computed. For the output layer, the
local error is proportional to the difference between the desired output and the actual network output.
By extending the same concept to the outputs of the hidden layers, the local error for a neuron in a
hidden layer can be viewed as being proportional to the difference between the desired output and
actual output of the particular neuron. Of course, during the training process, the desired outputs of
the neuron in the hidden layer are not known, and therefore the local errors need to be recursively
estimated in terms of the error signals of all connected neurons.
Concluding, the network weights are updated according to the following formula:
where
(j s ) = ( tqh y (js ) ) g ( v (js ) ) (A.11)
(s)
j = h( s +1) whj( s +1) g ( v (js ) ) (A.12)
h
for the hidden layers.
111
Appendix B
Below is a description of the backpropagation algorithm, as described by Ham and Kostanic (2001).
Figure B.1 can prove useful when reading the following.
Step 3. The desired network response is compared with the actual output of the network and
the error can be determined.
The error function that has to be minimized by the backpropagation algorithm has the
n
112
Appendix B
Subsequently, the local errors can be computed for each neuron. These local errors
are the result of backpropagation of the output errors back into the network. They are
a function of:
The errors in following layers. These are either the network output errors
(when calculating local errors in the output layer) or the local errors in the
following layer (when calculating local errors in hidden layers and the input
layer).
The derivative of the transfer function in the layer. For this reason,
continuous transfer functions are desirable.
The exact formulas are shown in step 4.
wji(k+1) and wji(k) are weights between neuron i and j during the (k+1)th and kth
pass, or epoch. A similar equation can be written for correction of bias values.
The parameter in (B.1) is the so-called learning rate. A learning rate is used to
increase the chance of avoiding the training process being trapped in local minima
instead of global minima. Many learning paradigms make use of a learning rate factor.
If a learning rate is set too high, the learning rule can jump over an optimal solution,
but too small a learning factor can result in a learning procedure that evolves too
gradual.
Step 5. Until the network reaches a predetermined level of accuracy in producing the
adequate response for all the training patterns, continue steps 2 through 4.
N.B.
A well-known variant of this classical form is the backpropagation algorithm with momentum
updating. The idea of the algorithm is to update the weights in the direction, which is a linear
combination of the current gradient of the error surface and the one obtained in the previous step of
the training. The only difference with the previously mentioned backpropagation method is the way
the weights are updated:
w(jis ) (k + 1) = w(jis ) (k ) + (j s ) (k ) yi( s 1) + (j s ) (k 1) yi( s 1) (k 1) (B.4)
In this equation, is called momentum factor. It is typically chosen in interval (0, 1). The momentum
factor can speed up training in very flat regions of the error surface and help prevent oscillations in
the weights by introducing stabilization in weight changes.
113
Appendix B
algorithm, discussed in the previous section, before proceeding. Figure B.1 can prove useful when
examining the algorithm below.
Step 2. Propagate the qth training pattern throughout the network, calculating the output of
every neuron.
Step 4. Calculate the desired output value for each of the linear combiner estimates.
Referring to Figure B.1, we see that each of the neurons consists of adaptive linear
elements (commonly referred to as linear combiners) followed by sigmoidal
nonlinearities. The linear combiners are depicted by the symbol. We can observe
that the output of the non-linear activation function will be the desired response if the
linear combiner produces an appropriate input to the activation function. Therefore,
we can conclude that training the network essentially involves adjusting the weights
so that each of the networks linear combiners produces the right result.
For each of the linear combiner estimates, the desired output value is given by
v (js,q) = f 1 ( d j(,sq) ) (B.7)
where
d j(,sq) = y (js,q) + (j ,sq) (B.8)
is the estimated desired output of the jth neuron in the sth layer to the qth training
pattern. The function f 1 is the inverse of the activation function. The parameter is
some positive number commonly taken in the range from 10 to 400.
Step 5. Update the estimate of the covariance matrix in each layer and the estimate of the
cross-correlation vector for each neuron.
The conjugate gradient algorithm assumes an explicit knowledge of the covariance
matrices and the cross-correlation vectors. Of course, they are not known in advance
and have to be estimated during the training process. A convenient way to do this is
to update their estimates with each presentation of the input/output training pair.
The covariance matrix of the vector inputs to the sth layer is estimated by
C( s ) (k ) = b C( s ) (k 1) + y (qs 1) y (qs 1)T (B.9)
and the cross-correlation vector between the inputs to the sth layer and the desired
outputs of the linear combiner by
p (js ) ( k ) = b p (js ) (k 1) + v (js ) y (qs 1) (B.10)
114
Appendix B
Step 6. Update the weight vector for every neuron in the network as follows.
(a) At every neuron calculate the gradient vector of the objective function.
g (js ) ( k ) = C( s ) ( k ) w (js ) (k ) p i( s ) (k ) (B.11)
If gi(s)=0, do not update the weight vector for the neuron and go to step 7; else
perform the following steps:
N.B.
This is called the restart feature of the algorithm. After a couple of iterations, the
algorithm is restarted by a search in the steepest descent direction. This restart
feature is important for global convergence, because in general one cannot
guarantee that the directions d(k) generated are descent directions.
Else: calculate the conjugate direction vector by adding to the current negative
gradient vector of the objective function a linear combination of the previous
direction vectors:
d (js ) ( k ) = g (js ) (k ) + (j s ) d (js ) (k 1) (B.13)
where
C( s ) ( k ) d (js ) ( k 1)
(s)
j = g ( s )T
j (k ) (B.14)
d (js )T (k 1) C( s ) ( k ) d (js ) ( k 1)
N.B.
The various versions of conjugate gradients are distinguished by the manner in
which this parameter is computed.
Step 7. Until the network reaches a predetermined level of accuracy, go back to step 2.
where Hk is the Hessian matrix (second derivatives) of the performance index at the current values of
the weights and biases:
115
Appendix B
2 E (w ) 2 E (w ) 2 E (w )
...
w1
2
w1 w2 w1 wN
2 E (w ) 2 E (w ) 2 E (w )
...
H k = 2 E (w ) = w2 w1 w2 2 w2 wN (B.18)
w =w (k )
... ... ... ...
E (w )
2
E (w )
2
2 E (w )
...
wN w1 wN w2 wN 2 w = w ( k )
Step 1. Initialise the network weights to small random values and choose an initial Hessian
matrix approximation B(0) (e.g. B(0)= I, the identity matrix).
Step 2. Propagate each training pattern throughout the network, calculating the outputs of
every neuron for all input/output pairs.
Step 3. Calculate the elements of the approximate Hessian matrix and the gradients of the
error function for each input/output pair.
The approximate Hessian matrix is calculated using the BFGS method:
[ B ( k ) ( k ) ] [ B ( k ) ( k ) ]
T
y ( k ) y ( k )T
B(k + 1) = B(k ) + (B.20)
(k )T B(k ) ( k ) y (k )T (k )
where
( k ) = w (k + 1) w (k ) (B.21)
and
y (k ) = g(k + 1) g(k ) (B.22)
Equation (B.19) is used to calculate g, the gradient vector of the error function.
Step 4. Perform the update of the weights after all input/output pairs have been presented.
In this weight update, the approximate Hessian and the gradient vector used are averages
over each input/output pair.
w (k + 1) = w (k ) B k 1 g k (B.23)
Step 5. Until the network reaches a predetermined level of accuracy, repeat steps 2 to 4.
N.B.
The weight update approach presented here is a batch version of a quasi-Newton backpropagation
algorithm.
116
Appendix B
Step 2. Propagate each training pattern throughout the network, calculating the outputs of
every neuron for all input/output pairs.
Step 3. Calculate the elements of the Jacobian matrix associated with each input/output pair.
The simplest approach to compute the derivatives in the Jacobian function is to use
the approximation
ei
Ji, j (B.24)
w j
where ei represents the change in the output error due to small perturbations of the
weight wj.
Step 4. Perform the update of the weights after all input/output pairs have been presented.
In this weight update, the Jacobian and the error vector used are averages over each
input/output pair.
1
w (k + 1) = w ( k ) J Tk J k + k I J Tk e k (B.25)
Step 5. Until the network reaches a predetermined level of accuracy, repeat steps 2 to 4.
N.B.
The weight update approach presented here is a batch version of the Levenberg-Marquardt
backpropagation algorithm.
This method represents a transition between the steepest descent method and Newtons method.
When the scalar is small, it approaches Newtons method, using the approximate Hessian matrix.
When is large, it becomes gradient descent with a small step size. Newtons method is faster and
more accurate near an error minimum, so the aim is to shift towards Newtons method as quickly as
possible. Thus, is decreased after each successful step (reduction in performance function) and is
increased only when a tentative step would increase the performance function. In this way, the
performance function will always be reduced at each iteration of the algorithm [Govindaraju, 2000].
Step 2. Propagate each training pattern throughout the network, calculating the outputs of
every neuron for all input/output pairs.
117
Appendix B
Step 3. Calculate the local error at every neuron in the network for each training pair.
For the output neurons the local error is calculated as
(j ,sq) = ( t j ,q y (js, q) ) g ( v (js,q) ) (B.26)
Step 4. Update the weight vector for every neuron in the network as follows.
The weight update is calculated using the weight update of the previous time step:
S(k )
w (k ) = w (k 1) (B.28)
S(k 1) S(k )
where S(k ) and S( k 1) are the current and previous values of the gradient of the
E
error surface = ( s ) y ( s 1) (see Appendix A).
w
Consequently:
w ( k + 1) = w (k ) + w ( k ) (B.29)
N.B.
Initial weight changes and weight changes after a previous weight change of zero are
calculated using gradient descent:
w (k + 1) = ( s ) y ( s 1) (B.30)
Furthermore, Fahlman [1988] proposed to limit the magnitude of the weight change
to the weight change of the previous step times a constant factor.
Step 5. Until the network reaches a predetermined level of accuracy, repeat steps 2 to 4.
N.B.
The weight update approach presented here is a batch version of the Quickprop algorithm.
118
Appendix B
Figure B.2 - The Cascade Correlation architecture, initial state and after adding two hidden units. The
vertical lines sum all incoming activation. Boxed connections are frozen, X connections are trained
repeatedly. [after Fahlman and Lebiere, 1991]
119
Appendix B
Step 2. Train the network over the training data set (e.g. using the delta rule).
w ji ( k + 1) = w ji ( k ) + j ti (B.31)
S = (V p V )( E p ,o Eo ) (B.32)
o p
where Vp is the output of the new hidden node for pattern p; V is the average output
over all patterns; Ep,o is the network output error for output node o on pattern p; and
Eo is the average network error over all patterns. Pass the training data set one by
one and adjust input weights of the new neuron after each training set until S does
not change appreciably.
The aim is to maximize S, so that when the neuron is actually entered into the
network as a fully connected unit, it acts as a feature detector.
Step 5. Go to step 3, and repeat the procedure until the network attains a prespecified
minimum error within a fixed number of training cycles.
The incorporation of each new hidden unit and the subsequent error minimisation
phase should lead to a lower residual error at the output layer. Hidden units are
incorporated in this way until the output error has stopped decreasing or has reached
a satisfactory level.
120
Appendix C
[Ni, M] = size(train_patterns);
Uc = length(unique(train_targets));
%If there are only two classes, remap to {-1,1}
if (Uc == 2)
train_targets = (train_targets>0)*2-1;
end
%Initialize the net: In this implementation there is only one output unit, so there
%will be a weight vector from the hidden units to the output units, and a weight
%matrix from the input units to the hidden units.
%The matrices are defined with one more weight so that there will be a bias
w0 = max(abs(std(train_patterns')'));
Wd = rand(1, Ni+1).*w0*2-w0; %Direct unit weights
Wd = Wd/mean(std(Wd'))*(Ni+1)^(-0.5);
rate = 10*Theta;
J = 1e3;
end
121
Appendix C
iter = iter + 1;
if (iter/NiterDisp == floor(iter/NiterDisp)),
disp(['Direct unit, iteration ' num2str(iter) ': Average error is '
num2str(J(iter))])
end
end
iter = iter + 1;
J(iter) = M;
rate = 10*Theta;
122
Appendix C
iter = iter + 1;
if (iter/NiterDisp == floor(iter/NiterDisp)),
disp(['Hidden unit ' num2str(Nh) ', Iteration ' num2str(iter) ': Total
error is ' num2str(J(iter))])
end
end
end
if (Uc == 2)
test_targets = test_targets >0;
end
a = 1.716;
b = 2/3;
f = a*tanh(b*x);
df = a*b*sech(b*x).^2;
123
Appendix C
M-file containing Cascade-Correlation algorithm, as implemented by the author in the CT5960 ANN
Tool.
load BasisWorkspace
% Initialize
iter = 1;
for i = 1:Ni,
trainp(i,:) = train_patterns{i,1};
crossp(i,:) = cross_patterns{i,1};
end
traint = train_targets{1,1};
crosst = cross_targets{1,1};
%------------------
%Initialize the net
%------------------
%Wd is the weight matrix between the input units and the output neuron
124
Appendix C
%The matrices are defined with one more weight so that there will be a bias
(constant at value 1)
w0 = max(abs(std(trainp)'));
Wd = rand(1, Ni+1).*w0*2-w0; %Direct unit weights
GL = 0;
P5 = 100;
%----------------------------------------------------------------------------------
%Training without hidden neurons
%----------------------------------------------------------------------------------
while (iter < 25) | ((iter < Max_iter) & (GL < 2) & (P5 > 0.4)),
cumdeltaWd = zeros(1,length(Wd));
deltaWdprev = zeros(1,Ni+1);
for m=1:M,
for p=1:(length(Wd)),
if iter==1,
deltaWd(p) = LR*grad{iter}(p);
elseif (deltaWdprev(p) > 0),
if (grad{iter}(p) < (Mu/(1+Mu)) * grad{iter-1}(p)),
deltaWd(p) = LR*grad{iter}(p) + Mu*deltaWdprev(p);
else
if (grad{iter}(p) < 0) & (grad{iter}(p) > grad{iter-1}(p)),
deltaWd(p) = LR*grad{iter}(p) +
((grad{iter}(p)*deltaWdprev(p)) / (grad{iter-1}(p)-grad{iter}(p)));
elseif (grad{iter}(p) > 0) & (grad{iter}(p) > grad{iter-1}(p)),
deltaWd(p) = (grad{iter}(p)*deltaWdprev(p)) / (grad{iter-
1}(p)-grad{iter}(p));
else
deltaWd(p) = LR*grad{iter}(p);
end
end
125
Appendix C
end
Wd = Wd + cumdeltaWd;
if abs(max(Wd))>100,
disp('Training process instable.')
break
end
iter = iter + 1;
JV(iter) = 0;
for j = 1:V,
JV(iter) = JV(iter) + (crosst(j) - activation(Wd*[crossp(:,j);1])).^2;
end
JV(iter) = JV(iter)/V;
JVmin = min(JV(2:iter));
GL = 100*((JV(iter) / JVmin) - 1);
k = 5;
if iter<(k+1),
P5 = 1000*((sum(J(2:iter))) / (5* min(J(2:iter))) - 1);
else
P5 = 1000*((sum(J(iter-k+1:iter))) / (5*min(J(iter-k+1:iter))) -
1);
end
if (iter/NiterDisp == floor(iter/NiterDisp)),
disp(['Direct unit, iteration ' num2str(iter) '. Training: '
num2str(J(iter)) ', Cross-training: ' num2str(JV(iter))])
end
end
JDT = J(iter);
JDV = JV(iter);
%----------------------------------------------------------------------------------
%Training while adding neurons
%----------------------------------------------------------------------------------
disp('Adding neurons...')
Nh = 0;
Wo = Wd;
pre_iter = iter;
improv_e = pre_iter;
J(iter) = 1000;
VLV = 0;
VLT = 0;
GL = 0;
R1 = 1;
R2 = 1;
while (iter < Max_iter) & (GL < 5) & ((P5 > 0.1) | (R1 | R2)) & (Nh < Max_Nh-1),
iterc = 0;
if Nh>1,
%Add NaNs to previous columns of the Wh-matrix to make matrix dimension
correct
126
Appendix C
for i=1:(Nh-1),
Wh(Nh-i,Ni+1+Nh-1)= 0;
end
end
%Add column (connections between previous neurons and new one) and initialize
it
Wh(Nh,:) = rand(1, Ni+1+Nh-1).*w0*2-w0;
%Add value (connections between new neuron and output neuron)
Wo(:,Ni+1+Nh) = rand(1,1).*w0*2-w0;
Wbest = Wh;
%-----------------------------------------------
%Training hidden neuron weights (last row Wh)
%-----------------------------------------------
while (iter-improv_e < 40) & ((VLV < 25) | (iter-pre_iter < 25) | (VLT ~= 0)) &
(iterc < 150),
iterc = iterc + 1;
cum_delta_j = 0;
cumdeltaWh = zeros(1,length(Wh(Nh,:)));
deltaWhprev = zeros(1,length(Wh(Nh,:)));
for m=1:M,
Xm = trainp(:,m);
tk = traint(m);
for i = 1:Nh,
Whtemp = Wh(i,:);
Whtemp((Ni+1+i):end) = []; %delete NaNs from column
Whtempi = Whtemp;
Whtempi((Ni+1+1):end) = []; %delete non-input connection weights
from column
if i>1,
g(i) = g(i) + Whtemp(Ni+1+i-1)*y(Ni+1+i-1); %connections
from hidden neurons
end
127
Appendix C
yprev(end) = [];
grad{iter} = delta_k * yprev;
for p=1:(length(Wh(Nh,:))),
if (iterc==1),
deltaWh(p) = LR*grad{iter}(p);
elseif (deltaWhprev(p) > 0),
if (grad{iter}(p) < (Mu/(1+Mu)) * grad{iter-1}(p)),
deltaWh(p) = LR*grad{iter}(p) + Mu*deltaWhprev(p);
else
if (grad{iter}(p) < 0) & (grad{iter}(p) > grad{iter-1}(p)),
deltaWh(p) = LR*grad{iter}(p) +
((grad{iter}(p)*deltaWhprev(p)) / (grad{iter-1}(p)-grad{iter}(p)));
elseif (grad{iter}(p) > 0) & (grad{iter}(p) > grad{iter-
1}(p)),
deltaWh(p) = (grad{iter}(p)*deltaWhprev(p)) /
(grad{iter-1}(p)-grad{iter}(p));
else
deltaWh(p) = LR*grad{iter}(p);
end
end
deltaWh = wdecay*deltaWh;
deltaWhprev = deltaWh;
cumdeltaWh = cumdeltaWh + deltaWh;
end
iter = iter + 1;
JV(iter) = 0;
for j = 1:V,
Xm = crossp(:,j);
128
Appendix C
%determine goodness
GoodT(iter) = 100 * ( (J(iter)*M / abs(cum_delta_j)) - 1);
GoodV(iter) = 100 * ( (JV(iter)*V / abs(cum_delta_j)) - 1);
if GoodV(iter) == max(GoodV),
if GoodV(iter)~=GoodV(iter-1),
Wbest = Wh;
end
end
if (iter/NiterDisp == floor(iter/NiterDisp)),
disp(['Hidden unit ' num2str(Nh) ', Iteration ' num2str(iter) '.
Training: ' num2str(J(iter)) ', Cross-training: ' num2str(JV(iter))])
end
end
%-----------------------------------
%Training output neuron weights (Wo)
%-----------------------------------
rate = 10;
m = 0;
pre_iter= iter;
while (iter - pre_iter < 25) | ((iter < Max_iter) & (GL < 2) & (P5 > 0.4)),
cumdeltaWo = zeros(1,length(Wo));
deltaWoprev = zeros(1,length(Wo));
for m=1:M,
Xm = trainp(:,m);
tk = traint(m);
for i = 1:Nh,
Whtemp = Wh(i,:);
Whtemp((Ni+1+i):end) = []; %delete NaNs from column
Whtempi = Whtemp;
Whtempi((Ni+1+1):end) = []; %delete non-input connection weights
from column
129
Appendix C
if i>1,
g(i) = g(i) + Whtemp(Ni+1+i-1)*y(Ni+1+i-1); %connections
from hidden neurons
end
grad{iter} = delta_k * y;
for p=1:(length(Wo)),
if (iter-pre_iter==0),
deltaWo(p) = LR*grad{iter}(p);
elseif (deltaWoprev(p) > 0),
if (grad{iter}(p) < (Mu/(1+Mu)) * grad{iter-1}(p)),
deltaWo(p) = LR*grad{iter}(p) + Mu*deltaWoprev(p);
else
if (grad{iter}(p) < 0) & (grad{iter}(p) > grad{iter-1}(p)),
deltaWo(p) = LR*grad{iter}(p) +
((grad{iter}(p)*deltaWoprev(p)) / (grad{iter-1}(p)-grad{iter}(p)));
elseif (grad{iter}(p) > 0) & (grad{iter}(p) > grad{iter-
1}(p)),
deltaWo(p) = (grad{iter}(p)*deltaWoprev(p)) /
(grad{iter-1}(p)-grad{iter}(p));
else
deltaWo(p) = LR*grad{iter}(p);
end
end
Wo = Wo + cumdeltaWo;
if abs(max(Wo))>100,
disp('Training process instable.')
break
end
130
Appendix C
iter = iter + 1;
JV(iter) = 0;
for j = 1:V,
Xm = crossp(:,j);
JV(iter) = JV(iter) + (crosst(j) - cas_cor_activation(Xm, Wh, Wo,
Ni, Nh)).^2;
end
JV(iter) = JV(iter)/V;
JVmin = min(JV(2:iter));
GL = 100*((JV(iter) / JVmin) - 1);
k = 5;
P5 = 1000*((sum(J(iter-k+1:iter))) / (5*min(J(iter-k+1:iter))) -
1);;
if (iter/NiterDisp == floor(iter/NiterDisp)),
disp(['Hidden unit ' num2str(Nh) ' (post), Iteration ' num2str(iter) '.
Training: ' num2str(J(iter)) ', Cross-training: ' num2str(JV(iter))])
end
end
JNT(Nh) = J(iter);
JNV(Nh) = JV(iter);
if JNV(Nh) == min(JNV),
Wh_best = Wh;
Wo_best = Wo;
Nh_best = Nh;
end
if Nh > 1,
R1 = (JNT(Nh-1) - JNT(Nh) / JNT(Nh-1))*100 > 0.1;
R2 = JNV(Nh) - JNV(Nh-1) < 0;
else
R1 = 1;
R2 = 1;
end
end
Wh = Wh_best;
Wo = Wo_best;
Nh = Nh_best;
if (min(JNV)>JDV),
Nh=0;
Wo=Wd;
end
if Nh == 0,
Wh = 0;
end
save BasisWorkspace
131
Appendix C
for i = 1:Nh,
Whtemp = Wh(i,:);
Whtemp((Ni+1+i):end) = []; %delete NaNs from column
Whtempi = Whtemp;
Whtempi((Ni+1+1):end) = []; %delete non-input connection weights from column
if i>1,
g(i) = g(i) + Whtemp(Ni+1+i-1)*y(Ni+1+i-1); %connections from hidden
neurons
end
f = tanh(x);
df = sech(x).^2 + SPO;
132
Appendix D
3.5
3.5
RMSE: 3707.7402
RMSE: 3296.8791
3 R2: 62.5998
3 R2: 67.6566
2.5
2.5
1.5
1.5
1
0.5
0.5
0
0 -0.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Time Points Test Set Time Points Test Set
ANN 1 ANN 2
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 3 ANN 4
4
x 10 x 10
4
4 4
Target Values
Network Prediction
3.5
3.5
RMSE: 3623.576
3 R2: 63.2994
3 RMSE=3474
R2=55.5
2.5
2.5
1.5
1.5
1
1
0.5
0.5
0
-0.5 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 5 ANN 6
133
Appendix D
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 7 ANN 8
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 9 ANN 10
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 11 ANN 12
134
Appendix D
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 13 ANN 14
4
x 10 x 10
4
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5 3.5
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 15 ANN 16
4 4
x 10 x 10
4 4
Target Values
Network Prediction
3.5 3.5
RMSE: 3004.8236
RMSE=3040
R2: 82.8062
3
3
R2= 79.3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0 50 100 150 200 250 300 350 400
0
0 50 100 150 200 250 300 350 400
ANN 17 ANN 18
135
Appendix D
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5
3.5
RMSE: 3558.0436
RMSE: 3504.2756
3 R2: 53.6449
3 R2: 53.4602
2.5
2.5
2
Q
1.5
1.5
1
1
0.5
0.5
0
0 -0.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Time Points Test Set
Time Points Test Set
ANN 19 ANN 20
4 4
x 10 x 10
4 4
Target Values Target Values
Network Prediction Network Prediction
3.5
3.5
RMSE: 3471.0834
2
RMSE: 3073.2068
3 R : 53.0714
3 R2: 74.9376
2.5
2.5
1.5
1.5
1
0.5
0.5
0
0
-0.5 0 50 100 150 200 250 300 350 400
0 50 100 150 200 250 300 350 400
Time Points Test Set
Time Points Test Set
ANN 21 ANN 22
4 4
x 10 x 10
4 4
Target Values
Network Prediction
3.5 3.5
RMSE: 3083.0969
3 R2: 72.6151 3
RMSE=3141
R2=77.1
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
ANN 23 ANN 24
136
Appendix E
Data is often imported in Matlab using the Import Data wizard. See the Matlab documentation for
details on this wizard.
The CT5960 ANN Tool has two possibilities for reading variables:
Reading all variables from a .MAT-file;
Several Matlab variables can be stored in a Matlab .MAT-file using the save command.
For example, the following command:
save data.mat discharge prec
saves the discharge and prec variables into a data file called data.mat, which can then
be loaded into the CT5960 ANN Tool.
Procedure
After variables have been loaded, the following procedure can be followed.
The data selection must first take place, after which it is split sampled. Input and output variables
can be added to and deleted from the network. When adding variables to the network, the pop-up
windows require time steps for these variables to be inputted. The reason for this is that the tool is
only capable of using static networks, in which the time dimension is incorporated using a so-called
window-of-time input approach. For example, a prediction of discharge at the following time step
based on three previous rainfall values results in an input of R at -2, -1 and 0 and an output of Q at
+1. Split sampling parameters are set in the appropriate field on the right. The first step is concluded
by pressing the Finish Data Selection button.
Secondly, the ANN architecture is set up by choosing the number of neurons, the type of transfer
functions and the error function that is used during the ANN training. The Cascade-Correlation
algorithm disables these settings: the numbers of neurons is chosen automatically, the transfer
function is set default to tansig (hyperbolic tangent) and the error function to MSE.
The training and testing of the ANN is the third and final step in the procedure. Several training
parameters can be chosen, depending on the training algorithm. All regular training algorithms require
the maximum number of epochs and the training goal to be defined. The Cascade-Correlation
algorithm requires the training goal and the learning rate for the embedded Quickprop algorithm.
Good values for this are between 1 (slow learning, stable) to 10 (faster learning, possibly unstable).
Using cross-training is often a wise choice for it reduces the risk of overtraining occurring. An ANN is
tested by pressing the Test ANN Performance button. This shows a window with the target values, the
ANN predictions and two measures for the model performance, namely the Rooted Mean Squared
Error (RMSE) and the Nash-Sutcliffe coefficient (R2).
Other functions
The Re-initialize Interface button clears the total state of the tool. The GUI will look like it did when
the tool was started.
The View Variable button creates a figures in which the variable that is currently selected is plotted.
The user can exit the tool by pressing either the Exit button (after which a confirmation is asked) or
closing the window by pressing the small cross in the upper right corner (after which no confirmation
is asked).
137