You are on page 1of 20

1

The Process of Learning

Learning Tasks

The learning algorithm for a neural network is depended on


the learning tasks to be performed by the network. Such
learning tasks include

Pattern association
Pattern recognition
Function approximation
Filtering
Beam forming
Identification and Control

Learning Methods

Supervised Learning

This is learning with a teacher


Obviously the environment is unknown to the
neural network
Conceptually, the teacher is having knowledge of
the environment
As a result we have a set of input-output examples
This input-output examples will provide the
samples for training
Suppose we have a set of input signal (input vector)
from the environment and the teacher is capable of
2

supplying the desired response, we have a training


set
So we have an input matrix P and an output matrix
T as the training set
For a particular input, the network will give an
output which is different from the desired out put
given by the teacher
So there is an error between the actual and desired
response
This error is used then to correct the free parameters
of the network
This correction is continuous till the error between
the actual and the desired output is the same (with
in a tolerance limit)
Thus using the matrix P and T, we will train the
neural network. Then the network will be adapted to
the environment
Now suppose we give an arbitrary input (vector) to
the network, the network will supply the required
response (vector)

Unsupervised Learning

This is unsupervised learning where there is no


teacher, to oversee the learning process
This is self organized learning
This attempts to develop a network on the basis of
given sample data
3

One popular, efficient, and somewhat obvious


approach is clustering
Clustering is nothing but mode separation or class
separation
The objective is to design a mechanism that clusters
the given sample data
This can be achieved by computing similarity
In many occasions, the data fall in to easily
observable groups, where the task is simple. But in
some occasions this is not the case
To perform unsupervised learning we may use a
competitive learning rule
Fore example, let there is a neural network
consisting an input node and a competitive layer.
Here what is used is a task independent measure of
quality that the network is designed to learn
The free parameters of network will be adapted and
finally optimized on the basis of the task
independent measure mentioned early
Fundamentally the unsupervised learning
algorithms (or laws) may be characterised by first
order differential equations
These equation describe how the networks free
parameters evolve (adjust) over time (or iteration, in
the discrete case)
Here some sort of pattern associability (similarity)
is used to guide the learning process
Such an operation leads to network correlation,
clustering or competitive behavior
4

Some Superwised / Unsuperwised Learning Rules


1. Perceptron learning rule
2. Widrow-Hoff learning rule
3. Delta learning rule
4. Hebbian learnig
5. Competitive learning

1. The Rosenblatts perceptron learning rule

The learning signal is the difference between the desired


and the actual response. This learning is supervised. This
type of learning can be applied only if the neuron response
is binary (0 or 1) or bipolar (1 or 1). The weight
adjustment in this method is obtained as
wkj ( n) [d k sgn (vk ( n))] x j
if wT x 0
(vk (n)) 1 T
1 if w x0
wkj ( n 1) wkj ( n) wkj ( n)

x1

x2 (vk ) -1
vk yk dk
5

wkj ek
xm

Where n=1,2, is the iteration number, x j j 1,2,..., m is the


input, is the learning rate parameter, v k (n ) is the net
activity of the neuron k, ( v k ( n )) y k ( n ) is the out put of the
neuron k, dk is the desired response, ek (n) is the error
between the output and the desired response of the neuron
k and wkj (n ) is the correction applied to the synaptic
weight between the neuron k and the input node j=1,2, ,
m. There will be no weight correction for the cases were the
actual response and the desired response is equal.

Example
Consider a single perceptron with the set of input training
vectors (samples) and initial weight vector
1 0 1 1
2 1.5 1 1
x1 , x2 , x3 ; w (1)
0 0.5 0.5 0

1 1 1 0.5
6

Let the learning rate parameter =0.1. The teachers desired


response for x1, x 2 , x 3 are d1 1, d 2 1, and d 3 1, respectively. The
learning according to the perceptron learning rule progress
as follows:
Step 1 Input is x1 and the desired response is d1

1
2
(vk (1)) w (1)T x1 1 1 0 0.5 2.5
0

1

w ( 2) w (1) 0.1 ( 1 1) x1

1 1 0.8
1 2 0.6
0.2
0 0 0

0.5 1 0.7

Step 2 Input is x2 and the desired response is d2

0.8
0.6
(vk (2)) w (2)T x 2 0 1.5 0.5 1 1.6
0

0.7

No correction is performed in this step because


d 2 sgn (vk (2)) 1 hence; w (3) w (2)

Step 3 Input is x3 and the desired response is d3


7

0.8
0.6
(vk (3)) w (3)T x 3 1 1 0.5 1 2.1
0

0.7

w ( 4) w (3) 0.1 (1 1) x 3
0.8 1 0.6
0.6 1 0.4
0.2
0 0.5 0.1

0.7 1 0.5

This completes one epoch of training. Now the training


examples are again presented to the network. As an
exercise you may do this and commend on the result
obtained.

2. Widrow-Hoff learning rule


Here the neurons are assumed to be with linear activation
functions characterized by
y k (n) (vk ( n)) vk (n)

The correction in the weights in each time step n is


obtained as
wkj (n) ek (n) x j (n)
wkj (n 1) wkj (n) wkj (n)

Remarks:
This is learning with a teacher.
8

The output of the neuron k should be directly available


so that the desired response can be supplied.
The correction in synaptic weight applied is
proportional to the product of the error signal and the
input signal.
wkj (n)
and wkj ( n 1)
may be viewed as past and present
values of the synaptic weight wkj . In computational
terms we may write
wkj ( n) z 1[ wkj ( n 1)]

Where z 1 is the unit delay operator and represent a


storage element. We see that the error correction
learning is a closed loop control system.

Method of steepest descend


Consider the cost function (w ) of some unknown weight
vector w . The function (w ) maps w in to real numbers and
let it is continuously differentiable w.r.t w . The problem is
to find out the optimal weight vector w * such that
9

(w*) (w ) . This is an unconstrained optimization problem


which can be stated as follows:
Minimize the cost function (w ) with respect to the
weight vector w .
In this method the correction in weight is applied in the
direction of steepest descent, that is, in a direction opposite
to the gradient vector (w ) where
T

, ,,
w1 w2 wm

T

( w ) , ,,
w1 w2 wm

Now the weight correction is effected as


w (n 1) w (n) (w )
w (n 1) w (n) w ( n)
w (n) ( w )

Using the first order Taylor series expansion around w (n) to


approximate ( w ( n 1))

( w ( n 1)) ( w ( n)) ( ( w ( n)))T ( w ( n))


( w ( n)) ( ( w ( n)))T ( w ( n))
2
( w ( n)) ( w )

Thus we see that ( w ( n 1)) ( w ( n)) ie, the performance index


decreases iteration after iteration. Finally it converges to
the optimal solution w*. The convergence behavior
depends on the learning rate parameter. The following
points are worth noting:
10

When is small, the transient response of the


algorithm is over damped and the trajectory traced by
w(n) take a smooth but slow path in the w-plane
When is large, the transient response of the
algorithm is under damped and the trajectory traced by
w(n) take an fast but oscillatory path in the w-plane
When exceeds a critical value, the algorithm
becomes unstable.

3. The delta learning rule


The delta learning rule is also built around a single neuron
and is valid only for continuous activation functions. This
can be achieved by minimizing a cost function or
performance index. Since we are interested in the error
correction learning, this performance index can take the
form
(w ) 12 (ek 2 )
1 1
(d k yk ) 2 (d k (vk )) 2
2 2

Where w [ wkj ] . The cost function (w ) denotes the


instantaneous energy, which can be used to make the
necessary changes in the synaptic weights. This is
11

obviously error correction learning. From the steepest


descend algorithm, the minimization of error requires the
weight changes to be in the direction of the negative
gradient, we take
w ( w ) where the operator
T

, , , and
w
1k w 2k w mk
T

( w ) , , ,
w1k w2 k wmk

For a particular k. Now, the components of the gradient vector are


(w )
(d k (vk )) ' (vk )
wkj

Since the minimization of the error requires the changes in


weight to be in the negative gradient direction, we have
wkj ( d k (vk )) ' (vk ) x j
ek ' ( v k ) x j

Exapmple:
Consider the set of input training vectors and initial weight
vector
1 0 1 1
2 1.5 1
x1 , x 2 , x 3 ; w 1 1
0 0.5 0.5 0

1 1 1 0.5
12

Let the learning rate parameter =0.1. The teachers desired


response for x1, x 2 , x3 are d1 1, d 2 1, and d3 1, respectively. Let
the continuous bipolar activation function be
1 e vk
(v k )
1 e vk
2e vk 1
' (v k ) (1 2 (vk ))
vk 2 2
(1 e )

Such an activation function is continuous and bipolar. Here


the slope of activation function is expressed in terms the
output signal of the neuron. For the given learning rate
parameter the delta rule training can be summarized as
follows:

Step 1 We will present the first input sample x1 and the


initial weight vector w1 , yielding
v1k ( w1 )T x1 2.5
y1k (v1k ) 0.848
1
' (v1k ) [1 2 (v1k )] 0.140
2
0.974
0.948
w 2 w1 0.1[ d1 (v1k )] ' (v1k ) x1
0

0.526

Step 2 We will present the second input sample x 2 and the


weight vector w2 , yielding
13

vk2 ( w 2 )T x 2 1.948
yk2 (vk2 ) 0.75
1
' (vk2 ) [1 2 (vk2 )] 00.218
2
0.974
0.956
w 3 w 2 0.1[ d 2 (vk2 )] ' (vk2 ) x 2
0.002

0.531

Step 3 We will present the second input sample x 3 and the


weight vector w3 , yielding
v k3 (w 3 ) T x 3 2.46
y k3 (v k3 ) 0.842
1
' (v k3 ) [1 2 (v k3 )] 0.145
2
0.947
0.929
w w 0.1[d 3 (v k )] ' (v k ) x 3
4 3 3 3
0.016

0.505

Since the desired values are 1 or -1, correction is applied in


every step. Since the algorithm did not converge, the
training samples should be presented again.

All the learning methods, we have seen so far, involves a


single output neuron k at a time. In what follows a learning
14

method where a layer of output neurons k=1,2,,p is


involved.

4. Hebbian learning
To Donald Hebb in his famous book organizational
behavior (1949)
When an axon of cell A is near enough to excite a cell
B and repeatedly or persistently takes part in firing it,
some growth process or metabolic change takes place
in one or both cells such that A's efficiency, as one of
the cells firing B, is increased.
The above statement is in a neurobiological sense. For
more complex kinds of learning, almost every learning
modal that has been proposed, involves both output activity
and input activity in the learning rule. The essential idea is
that the amount of synaptic change is a function of both
pre-synaptic and post-synaptic activity. Based on the above
fact, Hebbian learning is the oldest and most famous of all
learning rules
15

The above statement is made in a neurobiological context.


We may expand and rephrase it as a two part rule

If two neurons on either side of a synaptic connection


are activated simultaneously (synchronously), then
the strength of the synapse is selectively increased
If two neurons on either side of a synaptic connection
are activated not simultaneously (asynchronously),
then the strength of the synapse is selectively
decreased
Hebbian learning can be applied for neurons with binary
and continuous activation function. Putting the above
mathematically:

Consider the single neuron k. The net activity of the neuron


k is obtained as;
vk ( n) w T ( n) x( n)
yk ( n) ( w T ( n) x( n))

In the case of bipolar continuous activation function and,


y k (n) sign(vk (n)) 1

in the case of a logistic bipolar activation function


16

The corresponding weight is effected as


w (n) f ( yk (n), x(n))

In the above the function f can take a veriety of different


forms. One such form is,
w ( n) y k (n) x(n)
w ( n 1) w ( n) w (n) OR
wkj (n) y k (n) x j (n)
wkj (n 1) wkj (n) wkj (n)

Where j denotes the neuron just before the neuron k.

Example
Consider a single perceptron with the set of input training
vectors (samples) and initial weight vector
1 1 0 1
2 0.5 1 1
x1 , x2 , x3 ; w (1)
1.5 2 1 0

0 1 .5 1.5 0.5

Assume the learning rate 1 and the nonlinear activation


as bipolar hard limiting function with the out put as 1

Step 1
T
1 1
T 1 2
vk w1 x1 3
0 1.5

0.5 0

The updated weight is


17

2
3
w 2 w1 sgn(v k ) x1 w1 x1
1.5

0.5

Step 2
T
2 1`
3 0.5
vk w 2 x 2 0.25
1.5 2

0.5 1.5

1

3 2 2 2 2 2.5
w w sgn(vk ) x w x
3.5

2

Step 3
1
3.5
w 4 w 3 sgn( y k ) x 3 w 3 x 3
4.5

0.5

Exercise:
Repeat the same problem using a continuous bipolar
activation function instead of the logistic function
18

5. The competitive learning


With no available information regarding the desired
outputs, unsupervised learning networks update weights
only on the basis of input patterns. The competitive
learning network is a popular scheme to achieve this type
of unsupervised data clustering or classification.
Here we have p number of output neurons. The output of
the winning neuron k is set equal to one, and for all others
the output equal to zero.
1 if vk vi for all i , i k
y k 0 otherwise

The weights connected to the neuron k is normalized as,


w
j
kj 1
for all k
19

The weight correction is effected as


(x wkj ) if neuron k wins the compititio n
wkj
0
j
if neuron k losses the compition

O1

Ok

Op
Now consider three inputs that fall in to the range [0,1].
One can see that all the activities are taking place on the
surface of a unit sphere. The rule has the overall effect of
moving the synaptic weight of the winning neuron towards
the input pattern x . So the final result is that the weight
vector of the winning neuron k orients itself towards the
input pattern x .

Using Euclidean distance as dissimilarity measure is a more


general scheme of competitive learning, in which the
activation of the output unit k is
0.5
3 2

vk x j wkj
x wK
j 1

The weight correction is effected as
w ( n) x (n) w ( n)
kj j kj

wkj ( n 1) wkj ( n) wkj (n)

And in vector terms


w ( n) x( n) w k ( n)
w ( n 1) w ( n) w ( n)
In this case neither the data nor the weights need be of unit
length.

Exercise
20

The weight correction proposed in the competitive learning


method may be viewed as that in the steepest descent
method so as to minimize a cost function. What is that cost
function?

You might also like