You are on page 1of 48

WK5 Dynamic Networks

Contents
CS 476: Networks of Neural
Sequences
Computation
Time Delayed I

Time Delayed II WK5 Dynamic Networks:


Recurrent I
Time Delayed & Recurrent
Recurrent II
Networks
Conclusions
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete

Spring Semester, 2009


CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents

Sequence Learning
Contents
Time Delayed Networks I: Implicit
Sequences
Representation
Time Delayed I
Time Delayed Networks II: Explicit
Time Delayed II Representation

Recurrent I Recurrent Networks I: Elman + Jordan


Recurrent II Networks
Conclusions Recurrent Networks II: Back Propagation
Through Time
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Sequence Learning

MLP & RBF networks are static networks, i.e.


Contents they learn a mapping from a single input
Sequences signal to a single output response for an
arbitrary large number of pairs;
Time Delayed I

Time Delayed II
Dynamic networks learn a mapping from a
single input signal to a sequence of response
Recurrent I signals, for an arbitrary number of pairs
Recurrent II (signal,sequence).
Conclusions Typically the input signal to a dynamic
network is an element of the sequence and
then the network produces as a response the
rest of the sequence.
To learn sequences we need to include some
form of memory (short term memory) to the
CS 476: Networks of Neural Computation, CSD, UOC, 2009
network.
Sequence Learning II

We can introduce memory effects with two


Contents principal ways:
Sequences Implicit: e.g. Time lagged signal as input
Time Delayed I to a static network or as recurrent
Time Delayed II connections

Recurrent I Explicit: e.g. Temporal Backpropagation


Method
Recurrent II
In the implicit form, we assume that the
Conclusions
environment from which we collect examples
of (input signal, output sequence) is
stationary. For the explicit form the
environment could be non-stationary, i.e. the
network can track the changes in the
structure of the signal.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I

The time delayed approach includes two basic


Contents types of networks:
Sequences Implicit Representation of Time: We

Time Delayed I combine a memory structure in the input


layer of the network with a static network
Time Delayed II
model
Recurrent I Explicit Representation of Time: We

Recurrent II explicitly allow the network to code time, by


generalising the network weights from
Conclusions
scalars to vectors, as in TBP (Temporal
Backpropagation).
Typical forms of memories that are used are
the Tapped Delay Line and the Gamma Memory
family.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I

The Tapped Delay Line form of memory is


Contents shown below for an input signal x(n):
Sequences

Time Delayed I

Time Delayed II

Recurrent I

Recurrent II The Gamma form of memory is defined by:


Conclusions
n 1
g p ( n) p (1 ) n p n p
p 1

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks I

The Gamma Memory is shown below:


Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks I

In the implicit representation approach we


Contents combine a static network (e.g. MLP / RBF) with
Sequences a memory structure (e.g. tapped delay line).
Time Delayed I An example is shown below (the NETtalk
network):
Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks I

We present the data in a sliding window. For


Contents example in NETtalk the middle group of input
Sequences neurons present the letter in focus. The rest of
the input groups, three before & three after,
Time Delayed I
present context.
Time Delayed II
The purpose is to predict for example the next
Recurrent I symbol in the sequence.
Recurrent II The NETtalk model (Sejnowski & Rosenberg,
1987) has:
Conclusions
203 input nodes
80 hidden neurons
26 output neurons
18629 weights
Used BP method for training
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II

In explicit time representation, neurons have a


Contents spatio-temporal structure, i.e. its synapse
Sequences arriving to a neuron is not a scalar number but a
vector of weights, which are used for
Time Delayed I
convolution of the time-delayed input signal of a
Time Delayed II previous neuron with the synapse.

Recurrent I A schematic representation of a neuron is


Recurrent II given below:
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II

The output of the neuron in this case is given


Contents by: p

Sequences y j ( n) w j (l ) x ( n l ) b j
l 0
Time Delayed I

Time Delayed II
In case of a whole network, for example
Recurrent I
assuming a single output node and a linear
Recurrent II output layer, the responsep is given by:
m 1 m 1

y ( n) w j y j ( n) w j w j (l ) x ( n l ) b j b0
Conclusions j 1 j 1 l 0

Where p is the depth of the memory and b0 is the


bias of the output neuron

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II

In the more general case, where we have


Contents multiple neurons at the output layer, we have
Sequences for neuron j of any layer:
m0 p
Time Delayed I y j (n) w ji (l ) xi ( n l ) b j

Time Delayed II
i 1 l 0
Time Delayed II
Recurrent I The output of any synapse is given by the
Recurrent II convolution sum:
p
Conclusions T
s ji (n) w ji xi (n) w ji (l ) xi (n l )
l 0

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II

Where the state vector xi(n) and weight vector


Contents
wji for synapse I are defined as follows:
Sequences
xi(n)=[xi(n), xi(n-1),, xi(n-p)]T
Time Delayed I
wji=[wji(0), wji(1),, wji(p)]T
Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law

To train such a network we use the Temporal


Contents BackPropagation algorithm. We present the
Sequences algorithm below.
Time Delayed I Assume that neuron j lies in the output layer
Time Delayed II and its response is denoted by yj(n) at time n,
while its desired response is given by dj(n).
Recurrent I

Recurrent II
We can define an instantaneous value for the
sum of squared errors produced by the network
Conclusions as follows: 1
E ( n)
2 j
e
2
j ( n)

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-1

The error signal at the output layer is defined


Contents by:
Sequences
e j ( n) d j ( n) y j ( n)

Time Delayed I
The idea is a minimise an overall cost function,
Time Delayed II calculated over all time:

Recurrent I
Etotal E (n)
n
Recurrent II

Conclusions
We could proceed as usual by calculating the
gradient of the cost function over the weights.
This impliesE
that we needto
E (calculate
n) the
instantaneous
total

gradient:
w ji w n ji

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-2

However this approach to work we need to


Contents unfold the network in time (i.e. to convert it to
Sequences an equivalent static network and then calculate
the gradient). This option presents a number of
Time Delayed I
disadvantages:
Time Delayed II
A loss of symmetry between forward and
Recurrent I backward pass for the calculation of
Recurrent II instantaneous gradient;
Conclusions No nice recursive formula for propagation of
error terms;
Need for global bookkeeping to keep track
of which static weights are actually the same
in the equivalent network

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-3

For these reasons we prefer to calculate the


Contents gradient of the cost function as follows:
Sequences Etotal Etotal v j ( n)

w ji

v j ( n) w ji
Time Delayed I n

Time Delayed II
Note that in general holds:
Recurrent I

Recurrent II Etotal v j ( n) E ( n )

v j ( n) w ji w ji
Conclusions
The equality is correct only when we take the
sum over all time.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-4

To calculate the weight update we use the


Contents steepest descent method:
Sequences Etotal v j ( n)
w ji ( n 1) w ji ( n)
Time Delayed I v j ( n) w ji ( n)

Time Delayed II
Where is the learning rate.
Recurrent I
We calculate the terms in the above relation as
Recurrent II
follows:
Conclusions v j ( n )
xi ( n)
w ji ( n)

This is by definition the induced field vj(n)

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-5

We define the local gradient as:


Contents
Etotal
Sequences j ( n)
v j ( n )
Time Delayed I

Time Delayed II Thus we can write the weight update equations


in the familiar form:
Recurrent I
w ji (n 1) w ji (n) j (n) xi (n)
Recurrent II

Conclusions
We need to calculate the for the cases of
output and hidden layers.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-6

For the output layer the local gradient is given


Contents by: Etotal E ( n )
Sequences j ( n) e j (n) ' (v j (n))
v j ( n ) v j ( n )
Time Delayed I

Time Delayed II
For a hidden layer we assume that neuron j is
Recurrent I
connected to a set A of neurons in the next
Recurrent II layer (hidden or output).
E Then we have:
j ( n) total
Conclusions v j ( n )
Etotal vr (k )

r A k vr ( k ) v j ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-7

By re-writing we get the following:


Contents
vr ( k )
Sequences j ( n) r ( k )
r A k v j ( n )
Time Delayed I
v r ( k ) y j ( n )
r (k )
Time Delayed II r A k y j ( n ) v j ( n )
Recurrent I

Recurrent II Finally we putting all together we get:


n p
Conclusions
j (n) ' (v j (n)) r (k ) wrj (k l )
r A k n
p
' (v j (n)) r (n l ) wrj (n)
r A l 0

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Time Delayed Networks II: Learning Law-8

Where l is the layer level


Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent I

A network is called recurrent when there are


Contents connections which feedback to previous layers
Sequences or neurons, including self-connections. An
example is shown next:
Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions Successful early models of recurrent networks


are:
Jordan Network
Elman Network

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent I

The Jordan Network has the structure of an


Contents MLP and additional context units. The Output
Sequences neurons feedback to the context neurons in 1-1
fashion. The context units also feedback to
Time Delayed I
themselves.
Time Delayed II
The network is trained by using the
Recurrent I Backpropagation algorithm
Recurrent II
A schematic is shown in the next figure:
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent I

The Elman Network has also the structure of


Contents an MLP and additional context units. The Hidden
Sequences neurons feedback to the context neurons in 1-1
fashion. The hidden neurons connections to the
Time Delayed I
context units are constant and equal to 1. It is
Time Delayed II also called Simple Recurrent Network

Recurrent I (SRN).
Recurrent II The network is trained by using the
Backpropagation algorithm
Conclusions
A schematic
is shown in the
next figure:

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II

More complex forms of recurrent networks are


Contents possible. We can start by extending a MLP as a
Sequences
basic building block.

Time Delayed I Typical paradigms of complex recurrent models


are:
Time Delayed II
NonlinearAutoregressive with Exogenous
Recurrent I
Inputs Network (NARX)
Recurrent II
The State Space Model
Conclusions
The Recurrent Multilayer Perceptron (RMLP)
Schematic representations of the networks are
given in the next slides:

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-1

The structure of the NARX model includes:


Contents
A MLP static network;
Sequences
A current input u(n) and its delayed versions
Time Delayed I up to a time q;
Time Delayed II
A time delayed version of the current output
Recurrent I y(n) which feeds back to the input layer. The
Recurrent II memory of the delayed output vector is in
general p.
Conclusions
The output is calculated as:
y(n+1)=F(y(n),,y(n-p+1),u(n),,u(n-q+1))

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-2

A schematic of the NARX model is as follows:


Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-3

The structure of the State Space model


Contents includes:
Sequences A MLP network with a single hidden layer;
Time Delayed I Thehidden neurons define the state of the
Time Delayed II network;
Recurrent I A linear output layer;
Recurrent II A feedback of the hidden layer to the input
Conclusions layer assuming a memory of q lags;
Theoutput is determined by the coupled
equations:
x(n+1)=f(x(n),u(n))
y(n+1)=C x(n+1)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-4

Where f is a suitable nonlinear function


Contents characterising the hidden layer. x is the state
Sequences
vector, as it is produced by the hidden layer. It
has q components. y is the output vector and it
Time Delayed I has p components. The input vector is given by

Time Delayed II u and it has m components.

Recurrent I A schematic representation of the network is


Recurrent II given below:

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-5

The structure of the RMLP includes:


Contents
One or more hidden layers;
Sequences
Feedback around each layer;
Time Delayed I
Thegeneral structure of a static MLP
Time Delayed II
network;
Recurrent I
The output is calculated as follows
Recurrent II (assuming that xI, xII, and xo are the first,
Conclusions second and output layer outputs):
xI(n+1)= I(xI(n), u(n))
xII(n+1)= II(xII(n), xI(n+1))
xO(n+1)= O(xO(n), xII(n+1))
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-6

Where the functions I(), II() and O() denote


Contents the
Sequences Activation functions of the corresponding layer.
Time Delayed I
A schematic representation is given below:
Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-7

Some theorems on the computational


Contents power of recurrent networks:
Sequences Thm 1: All Turing machines may be
Time Delayed I simulated by fully connected recurrent
networks built on neurons with sigmoid
Time Delayed II
activation functions.
Recurrent I
Thm 2: NARX networks with one layer of
Recurrent II hidden neurons with bounded, one-sided
Conclusions saturated activation functions and a linear
output neuron can simulate fully connected
recurrent networks with bounded, one-sided
saturated activation functions, except for a
linear slowdown.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-8

Corollary: NARX networks with one hidden


Contents layer of neurons with BOSS activations
Sequences
functions and a linear output neuron are
Turing equivalent.
Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-9

The training of the recurrent networks can be


Contents done with two methods:
Sequences BackPropagation Through Time
Time Delayed I Real-Time Recurrent Learning
Time Delayed II
We can train a recurrent network with either
Recurrent I epoch-based or continuous training operation.
Recurrent II However an epoch in recurrent networks does
not mean the presentation of all learning
Conclusions
patterns but rather denotes the length of a
single sequence that we use for training. So an
epoch in recurrent network corresponds in
presenting only one pattern to the network.
At the end of an epoch the network stabilises.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-10

Some useful heuristics for the training is


Contents given below:
Sequences Lexigraphic order of training samples should
Time Delayed I be followed, with the shortest strings of
symbols being presented in the network first;
Time Delayed II
The training should begin with a small
Recurrent I
training sample and then its size should be
Recurrent II incrementally increased as the training
Conclusions proceeds;
The synaptic weights of the network should
be updated only if the absolute error on the
training sample currently being processed by
the network is greater than some prescribed
criterion;
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-11

The use of weight decay during training is


Contents recommended; weight decay was discussed
Sequences
in WK3

Time Delayed I

Time Delayed II The BackPropagation Through Time algorithm


Recurrent I
proceeds by unfolding a network in time. To be
more specific:
Recurrent II
Assume that we have a recurrent network N
Conclusions
which is required to learn a temporal task
starting from time n0 and going all the way to
time n.
Let N* denote the feedforward network that
results from unfolding the temporal
CS 476:operation of the
Networks of Neural recurrent
Computation, CSD, network
UOC, 2009 N.
Recurrent II-12

Thenetwork N* is related to the original


Contents network N as follows:
Sequences For each time step in the interval (n0,n], the
Time Delayed I network N* has a layer containing K neurons,
where K is the number of neurons contained
Time Delayed II in network N;
Recurrent I In every layer of network N* there is a copy
Recurrent II of each neuron in network N;

Conclusions For every time step l [n0,n], the synaptic


connection from neuron i in layer l to neuron j
in layer l+1 of the network N* is a copy of the
synaptic connection from neuron i to neuron j
in the network N.
The following example explains the idea of
CSunfolding:
476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-13

We assume that we have a network with two


Contents neurons which is unfolded for a number of
Sequences
steps, n:

Time Delayed I

Time Delayed II

Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-14

We present now the method of Epochwise


Contents BackPropagation Through Time.
Sequences Let the dataset used for training the network
Time Delayed I be partitioned into independent epochs, with
each epoch representing a temporal pattern
Time Delayed II
of interest. Let n0 denote the start time of an
Recurrent I epoch and n1 denotes its end time.
Recurrent II
We can define the following cost function:
Conclusions
1 n1
Etotal (n0 , n1 ) e j (n)
2

2 n n0 j A

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-15

Where A is the set of indices j pertaining to


Contents those neurons in the network for which
Sequences
desired responses are specified, and ej(n) is
the error signal at the output of such a
Time Delayed I
neuron measured with respect to some
Time Delayed II desired response.
Recurrent I

Recurrent II

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-16

The algorithm proceeds as follows:


Contents
1. For a given epoch, the recurrent network
Sequences starts running from some initial state until
Time Delayed I it reaches a new state, at which point the
training is stopped and the network is
Time Delayed II
reset to an initial state for the next epoch.
Recurrent I The initial state doesnt have to be the
Recurrent II same for each epoch of training. Rather,
what is important is for the initial state for
Conclusions the new epoch to be different from the
state reached by the network at the end
of the previous epoch;

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-17

2. First a single forward pass of the data


Contents through the network for the interval (n0,
Sequences n1) is performed. The complete record of
input data, network state (i.e. synaptic
Time Delayed I
weights), and desired responses over this
Time Delayed II interval is saved;
Recurrent I
3. A single backward pass over this past
Recurrent II record is performed to compute the
values jof Etotalgradients:
( n0 , n1 )
Conclusions ( n)the
local
v j ( n )

For all j A and n0 < n n1 . This


computation is performed by the formula:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-18

' (v j (n))e j (n) for n n1


Contents
j (n)
' (v j (n)) e j (n) w jk k (n 1) for n0 n n1
Sequences k A
Time Delayed I
Where () is the derivative of an
Time Delayed II activation function with respect to its
Recurrent I argument, and vj(n) is the induced local
Recurrent II
field of neuron j.

Conclusions The use of above formula is repeated,


starting from time n1 and working back,
step by step, to time n0 ; the number of
steps involved here is equal to the
number of time steps contained in the
epoch.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-19

4. Once the computation of back-


Contents
propagation has been performed back to
time n0+1, the following adjustment is
Sequences
applied to the synaptic weight wji of
Time Delayed I neuron j:
Etotal
Time Delayed II w ji
w ji
Recurrent I n1

Recurrent II

n n0 1
j (n) xi ( n 1)

Conclusions

Where is the learning rate parameter


and xi(n-1) is the input applied to the ith
synapse of neuron j at time n-1.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Recurrent II-20

There is a potential problem with the


Contents method, which is called the Vanishing
Sequences
Gradients Problem, i.e. the corrections
calculated for the weights are not large
Time Delayed I enough when using methods based on
Time Delayed II steepest descent.
Recurrent I However this is a research problem currently
Recurrent II and ones has to see the literature for details.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Conclusions

Dynamic networks learn sequences in contrast


Contents to the static mappings of MLP and RBF
Sequences
networks.

Time Delayed I Time representation takes place explicitly or


implicitly.
Time Delayed II
The implicit form includes time-delayed
Recurrent I
versions of the input vector and use of a static
Recurrent II network model afterwards or the use of
Conclusions recurrent networks.
The explicit form uses a generalisation of the
MLP model where a synapse is modelled now as
a weight vector and not as a single number. The
synapse activation is not any more the product
of the synapses weight with the output of a
CSprevious neuron
476: Networks of Neural but ratherCSD,
Computation, theUOC,
inner
2009product of
Conclusions I

The extended MLP networks with explicit


Contents temporal structure are trained with the
Sequences
Temporal BackPropagation algorithm.

Time Delayed I The recurrent networks include a number of


simple and complex architectures. In the simpler
Time Delayed II
case we train the networks using the standard
Recurrent I BackProgation algorithm.
Recurrent II In the more complex cases we first unfold the
Conclusions network in time and then train it using the
BackProgation Through Time algorithm.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

You might also like