You are on page 1of 6

Learning and Memory in Neural Networks

Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of
Edinburgh, UK.
Neural networks consist of computational units (neurons) that are linked by a directed graph with some degree
of connectivity (network). The connections comprising the edges in the graph are termed weights. As the name
suggests the magnitude of the weight determines the magnitude of the eect that the connecting neuron can have
upon its target partner. In caricature, neural networks use the many parallel operations of simple units to perform
computations with uncertain data, rather than serial operations with logical blocks to perform computations with
exact data. Neural networks are useful computational devices for learning data classications, for autoassociative
(content addressable) memories and for associative (classical conditioning) memories. In this brief, neural networks
performing each of these tasks are introduced, respectively: The multilayer perceptron, the Hopeld network and
the associative network.
Two class data classication: The Perceptron
The perceptron (Rosenblatt, 1958) can be used to learn a distinction between two clusters within some data set,
Fig. 1A. The aim of the perceptron is to classify data into two classes C
1
and C
2
by labelling each data point x
with its output f(a) {1, 1} such that f(a) = 1 for class C
1
and f(a) = 1 for class C
2
. As input the perceptron
is passed a feature vector (x
i
) {1, 1}, i {1, ..., N} constructed from the N dimensional data point to be
classied by means of some xed non linear transformation (for example brightness thresholding of the pixels of an
image).
In order to perform a classication, the activation is rst calculated
a(x) =

i
w
i
(x
i
) (1)
where the is the bias. The activation is then passed through the step function
f(a) =

1
1
a 0
a < 0.
(2)
What classications of the data can the perceptron perform given some feature vector (x)? The perceptron
can only separate feature vectors that are linearly separable. The decision between attributing a given input vector
to class C
1
or to class C
2
occurs when a(x) = 0. This criterion is satised by an N 1 dimensional hyperplane
within feature space. This surface is the decision boundary of the classication. Data that is separable with the
perceptron is described as linearly separable because the decision boundary for a single linear thresholder is linear
(a hyperplane). The bias determines the displacement of the decision boundary from the origin.
How does the decision boundary relate to the correct weight vector for a given classication? Consider two loca-
tions on the decision boundary (x
1
) and (x
2
). Since a(x
1
) = a(x
2
) = 0 it is the case that w
T
((x
1
) (x
2
)) = 0
which can only be satised if the displacement from (x
1
) to (x
2
) is orthogonal to the weight vector (Bishop, 2006).
Thus, components of the weight vector associated with some decision boundary are identical to the components of
the normal to the decision boundary in input space.
The aim of learning with the perceptron is to nd the weight vector corresponding to a decision boundary
that renders the input data disjoint. One method for achieving this is the perceptron learning algorithm, an
explanation of which is beyond the scope of this brief but see Rojas or Bishop (Rojas, 1996; Bishop, 2006). The
perceptron convergence theorem states that the perceptron learning algorithm is guaranteed to nd a weight vector
corresponding to a decision boundary in a nite number of steps, as long as the feature vectors are linearly separable
(Minsky and Papert, 1969; Bishop, 2006). When the data are linearly separable there may be more than one valid
classication, in which case the one achieved will depend upon the initial conditions. If the data are not linearly
separable, then the perceptron learning algorithm does not converge.
For any given data set there may be many possible decision boundaries. Assume that the learning algorithm
has converged upon some successful decision boundary. The information that is transmitted about that data by
the perceptron in this case is a disambiguation between the chosen labeling and the other possible labellings. By
counting all possible classications of a set of random points in feature space it can be shown that the limiting
capacity of the perceptron is 2 bits per weight (MacKay, 2003).
1
Figure 1: Diagrams contrasting the architectures of the neural networks discussed in this brief. A: The perceptron
consists of a single unit (open circle) that takes a threshold of the scalar product of its feedforward weights (lines)
with its inputs (lled circles). The Perceptron can classify two-class linearly separable data. B: The multilayer
perceptron (MLP) consists of one hidden layer of units (grey circles) between the input layer (lled circles) and the
output layer (open circles). Each unit in the hidden layer computes a continuous threshold function of the scalar
product of its feedforward weights (lines) with the values of the input layer. Each unit in the output layer takes the
scalar product of its feedforward weights (lines) with the inputs received from the hidden layer. The MLP can t
nonlinear curves to classications of data sets where the number of classes is determined by the number of outputs.
C: A Hopeld network is a feedback network. Thus the synaptic weights are directed (arrows) and symmetric. The
Hopeld net functions as a content addressable memory by completing a disrupted pattern. Each unit operates as
both an input and an output unit. D: The associative network learns an association between two binary patterns
on the inputs (lled circles) and the outputs (open circles). At each crossing of an input line and an output line,
there is a binary synaptic weight. The weights in this diagram show the result of storing the association between
the input pattern and the output pattern using the learning rule in the text. Each lled square represents a weight
that has been set to one. All other weights on the grid are zero.
2
Multiclass data classication with a neural network: The multilayer Per-
ceptron
As we have seen, the classications that can be performed by a single perceptron are limited to those that are
linearly separable. This is an enormous restriction since many interesting patterns in data give rise to feature
vectors that are not linearly separable. This point was demonstrated by Minsky and Papert (Minsky and Papert,
1969). However it turns out that this restriction does not apply in the case of multilayer perceptrons (MLP) and so
these neural networks nd far greater utility in learning data classications. The most common implementation of
the multilayer perceptron makes use of three layers: An input layer, a hidden layer and an output layer, Fig. 1B.
Multilayer perceptrons are fully feedforward networks, meaning that the graph describing their structure is acyclic.
The input layer of the network provides the feature space into which the data must be mapped. Each input
comprising the N dimensional feature vector (x) is connected to every perceptron in the hidden layer. There are
H perceptrons in the hidden layer. In turn each of the perceptrons in the hidden layer passes its output to every
perceptron in the output layer. There are K perceptrons in the output layer. The classication of the input feature
vector is read out from the output layer.
The activations of the units in the hidden layer, layer (1) are
a
(1)
j
=

l
w
(1)
jl
(x
l
) +
(1)
j
(3)
where j {1, ..., H}, l (1, ..., N} and the
(1)
j
are the biases of the hidden layer perceptrons. The activations
in the output layer, layer (2) are
a
(2)
i
=

j
w
(2)
ij
h
j
+
(2)
i
(4)
where i {1, ..., K} and the
(2)
i
are the biases of the output layer perceptrons. The outputs for units in each
layer respectively are
h
j
= f
(1)
(a
(1)
j
)
y
i
= f
(2)
(a
(2)
i
)
(5)
where the activation function f
(2)
is the identity function f
(2)
(a) = a. The form of the activation function f
(1)
can dier from implementation to implementation. For learning in the MLP however, it is important that this
function be a continuous non linearity. Consequently the logistic sigmoid is chosen,
f
(1)
(a) =
1
1 + exp(a)
. (6)
Training in the MLP adjusts the weights to minimize a sum squared error function. In this manner the network
ts a non linear function to the data (the target classication). Again, a full discussion of the learning algortihm is
beyond the scope of this brief but see Bishop and MacKay (MacKay, 2003; Bishop, 2006). One important algorithm
is the backpropagation algorithm (Rumelhart, Hinton, and Williams, 1986).
The Multilayer perceptron is a non linear curve tting device. The number of hidden units H that should be
chosen in order to perform that t is not constrained by the data but is a free parameter. Since increasing the
number of hidden units increases the complexity of the model we might expect that the complexity of the function
that is t to the data should also increase. Indeed this is the case, but it turns out that the complexity of the
curve is independent of the number of hidden units as H (Neal, 1996; MacKay, 2003), but is determined
by the magnitude of the weights themselves. It may seem advantageous to choose the most complex t that is
possible with available computational resources, but this is not the case. Since the input data contains noise that
is unreproducible and peculiar to the example under consideration it is possible for the neural network to overt
the data. Overtting results in a match that is too close to one particular example and leads to a decrease in
the performance of recognising examples that should belong to the same class, but that have random variations
(generalisation). The approach taken to prevent overtting is to add a regularizer that penalises excessively large
weights. One is then left with the problem of how to choose the regularizer so as to optimise the trade o between
specicity and generalisation performance (MacKay, 2003; Bishop, 2006).
3
Auto-associative memory: Hopeld networks
Hopeld networks can be used to create content addressable memories that store data in such a way that it can be
retrieved by supplying a partial version of the original pattern. The graph of the connections in a Hopeld network
is cyclic and thus the network is a feedback rather than a feedforward network, Fig. 1C.
Hopeld networks are constructed from N neurons where each neuron i {1, ..., N} is connected to some other
neuron j {0, 1, ..., N} by a symmetric connection w
ij
such that w
ij
= w
ji
. There are no self connections w
ii
= 0.
Each neuron can have a bias w
i0
that can be considered as resulting from the feedforward input from a zeroth layer
of neurons with constant acitivity. The activation of each neuron is
a
i
=

j
w
ij
x
j
. (7)
For the binary Hopeld network the output is a threshold function of the activation as in Eq. (2) where
x
i
= f(a
i
) {1, 1}. Alternatively for a continuous Hopeld network where outputs vary between 1 and 1, the
output function is x
i
= f(a
i
) = tanh(a
i
).
In feedback networks the output of each neuron is also an input to other neurons. Due to this, alterations
to the weights of each neuron can either be synchronous or asynchronous. In the synchronous case all neurons
calculate their activations according to Eq. (7) and then update their outputs only after the calculation of all other
activations is complete. In the asynchronous case each neuron rst calculates its activation and then updates its
output in turn, before the calculation of activations of the other neurons.
For some xed set of weights, the Hopeld network stores data as features in the N dimensional phase space
dened by the x
i
activity variables. Each memory is stored as a xed point of the activity of the network. The
partial pattern to be recalled is presented as the initial condition of the activity in the network. After some time
the network converges upon the xed point corresponding to the basin of attraction of that initial condition. The
aim is that this xed point should be the complete pattern and that the initial condition is the partial pattern to
be completed. The values of the weights in the Hopeld network determine the locations of the xed points in the
activity space. For an asynchronous Hopeld net to store a set of M patterns {x
(m)
}, m {1, ..., M}, the weights
are set with one-shot learning according to
w
ij
=

m
x
(m)
i
x
(m)
j
(8)
where is a constant. How do we know that the the simple operation in Eq.(8) ensures that the Hopeld
net recalls the given patterns? Asynchronous Hopeld networks have Lyapunov functions of their dynamics. The
Lyapunov function is a function that always decreases under the evolution of the system, the presence of which
ensures that the system settles upon a xed point. Proof of this is beyond the scope of this brief, but see MacKay
(MacKay, 2003).
What about synchronous Hopeld nets? In the synchronous case we can guarantee that the activity of the
Hopeld net settles upon a xed point as long as the time is continuous. In this case activation functions a
i
(t) and
the activity of the neurons x
i
(t) are dened as continuous functions of time
a
i
(t) =

j
w
ij
x
j
(t) (9)
and the neurons response to its activation is ltered with some time constant
dx
i
(t)
dt
=
1

(x
i
(t) f(a
i
)) (10)
where the output function is the hyperbolic tangent x
i
= f(a
i
) = tanh(a
i
). The synchronous continuous time
Hopeld net has a Lyapunov function and is thus guaranteed to settle to a xed point of the weights are set with
Eq.(8).
As mentioned, due to the presence of the Lyapunov function, the Hopeld network is guaranteed to settle into
some stable state (output pattern) when supplied with an initial activity state (input pattern). However as the
number of patterns stored in the network is increased, there comes a point at which the output patterns are garbled
and are not valid completions of the input patterns. This failure of performance of the Hopeld net is graceful rather
than catastrophic. Typically when the limit of overload is approached, some memories survive with a minority of
bits ipped. The limiting number of patterns stored at which this transition occurs from memory storage to spurious
spin glass states is M = 0.138N (Amit, Gutfreund, and Sompolinski, 1985).
4
Associative networks
One distinctive feature of biological learning is the ability for associations to be made between stimuli (for example,
in the famous case of Pavlovian, or classical, conditioning where a dog learns to associate the ring of a bell with being
fed). Associative networks are feedforward neural networks that learn an association between two input patterns.
When one pattern is presented to the inputs of the network, the outputs of the associative network present a pattern
that has been associated with the input pattern. Adjustment of the weights allows this pairing between an arbitrary
input and an arbitrary output pattern.
The associative network can be envisaged as a grid of horizontal input lines and vertical output lines. At the
intersections between these lines - the points on the grid - are the weights. One edge of the input lines terminates
in an array of points that are the N
I
inputs. One edge of the output lines terminates in an array of points that are
the N
O
outputs, Fig. 1D. The inputs x
i
{0, 1}, i {1, ..., N
I
} are binary and the outputs are binary y
j
{0, 1},
j {1, ..., N
O
}. The weights at each grid point w
ij
{0, 1} are also binary. When one pattern k is presented
{x
(k)
}, k {1, ..., K} each output is set to one y
j
= 1 if its dendritic sum
d
(k)
i
=

j
w
ij
x
(k)
j
(11)
is equal to the input activity
a
(k)
=

j
x
(k)
j
(12)
but is set to zero y
j
= 0 otherwise (Willshaw, Buneman, and Longuet-Higgins, 1969; Buckingham and Willshaw,
1992).
Associations are stored in the network by applying the input pattern to be associated, to the inputs x while the
output pattern to be associated is applied to the outputs y. The following rule is then applied to each weight w
ij
associated with input line i and output line j
w
ij
= 1 if x
i
= 1 y
j
= 1
w
ij
= 0 otherwise
(13)
There is a rich literature dealing with the optimisation of the storage capacity of associative networks under
diering conditions (Dayan and Willshaw, 1991). An important factor in determining the performance of the
associative networks is the sparsity of the patterns. Sparse patterns tend to have a small fraction of their units
active. Consider N
I
inputs where for each pattern to be stored M
I
inputs are active on average, being stored
in association with N
O
outputs where M
O
outputs are active on average. For the simple network described here
the limit to the number of associations that can be stored before the expected number of output units that res
spuriously approaches one is R
N
I
N
O
M
I
M
O
ln(1
1
N
O
) (Buckingham and Willshaw, 1992).
References
Amit, D.J., H. Gutfreund, and H. Sompolinski (1985). Storing innite numbers of patterns in a spin glass model
of neural networks. Phys. Rev. Lett. 55: 15301533.
Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag, New York.
Buckingham, J. and D. Willshaw (1992). Performance characteristics of the associative net. Network 3: 407414.
Dayan, P. and D.J. Willshaw (1991). Optimising synaptic learning rules in linear associative memories. Biological
Cybernetics 65: 253265.
MacKay, D.J.C (2003). Information theory, inference and learning algorithms. Cambridge University Press,
Cambridge, UK.
Minsky, M. and S. Papert (1969). Perceptrons. MIT Press, Cambridge, Mass.
Neal, R. (1996). Bayesian Learning for Neural Networks. Springer, Berlin.
Rojas, R. (1996). Neural Networks. Springer, Berlin.
5
Rosenblatt, F. (1958). The perceptron a probabilistic model for information storage in the brain. Psychological
review 65: 386408.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Learning internal representations by error propagation,
pp. 318362. MIT Press, Cambridge, Mass.
Willshaw, D.J., O.P. Buneman, and H.C. Longuet-Higgins (1969). Non-holographic associative memory. Na-
ture 222: 960969.
6

You might also like