You are on page 1of 38

Neural Nets Using Backpropagation

Chris Marriott Ryan Shirley CJ Baker Thomas Tannahill

Agenda

Review of Neural Nets and Backpropagation Backpropagation: The Math Advantages and Disadvantages of Gradient Descent and other algorithms Enhancements of Gradient Descent Other ways of minimizing error

Review

Approach that developed from an analysis of the human brain Nodes created as an analog to neurons Mainly used for classification problems (i.e. character recognition, voice recognition, medical applications, etc.)

Review

Neurons have weighted inputs, threshold values, activation function, and an output

Weighted inputs

Output

Activation function = f(S(inputs * weight))

Review
4 Input AND

Inputs

Threshold = 1.5 Outputs

Threshold = 1.5
Inputs Threshold = 1.5 All weights = 1 and all outputs = 1 if active 0 otherwise

Review

Output space for AND gate


Input 1
(0,1) (1,1)

1.5 = w1*I1 + w2*I2

(0,0)

(1,0)

Input 2

Review

Output space for XOR gate Demonstrates need for hidden layer
Input 1
(0,1) (1,1)

Input 2
(0,0) (1,0)

Backpropagation: The Math

General multi-layered neural network


Output Layer

0
X0,0 X1,0

8
X9,0

Hidden Layer

0
W0,0

1
W1,0 Wi,0

Input Layer

Backpropagation: The Math

Backpropagation

Calculation of hidden layer activation values

Backpropagation: The Math

Backpropagation

Calculation of output layer activation values

Backpropagation: The Math

Backpropagation

Calculation of error

dk = f(Dk) -f(Ok)

Backpropagation: The Math

Backpropagation

Gradient Descent objective function

Gradient Descent termination condition

Backpropagation: The Math

Backpropagation

Output layer weight recalculation

Learning Rate (eg. 0.25)

Error at k

Backpropagation: The Math

Backpropagation

Hidden Layer weight recalculation

Backpropagation Using Gradient Descent

Advantages

Relatively simple implementation Standard method and generally works well


Slow and inefficient Can get stuck in local minima resulting in sub-optimal solutions

Disadvantages

Error Back-Propagation Algorithm

Back-propagation is one of the simplest and most general methods for training of multilayer neural networks. The power of back-propagation is that it enables us to compute an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights. Our goal now is to set the interconnection weights based on the training patterns and the desired outputs Slow convergence speed, is Disadvantages of error backpropagation algorithm.

The BP algorithm suffers from the problem of local minima

Advantages & Disadvantages

MLP and BP is used in Cognitive and Computational Neuroscience modelling but still the algorithm does not have real neuro-physiological support The algorithm can be used to make encoding / decoding and compression systems. Useful for data pre-processing operations The MLP with the BP algorithm is a universal approximator of functions The algorithm is computationally efficient as it has O(W) complexity to the model parameters The algorithm has local robustness The convergence of the BP can be very slow, especially in large problems, depending on the method

Advantages

A neural network can perform tasks that a linear program can not. When an element of the neural network fails, it can continue without any problem by their parallel nature. A neural network learns and does not need to be reprogrammed. It can be implemented in any application. It can be implemented without any problem

Disadvantages

The neural network needs training to operate. The architecture of a neural network is different from the architecture of microprocessors therefore needs to be emulated. Requires high processing time for large neural networks.

Alternatives To Gradient Descent

Simulated Annealing

Advantages

Can guarantee optimal solution (global minimum)


May be slower than gradient descent Much more complicated implementation

Disadvantages

Alternatives To Gradient Descent

Genetic Algorithms/Evolutionary Strategies

Advantages

Faster than simulated annealing Less likely to get stuck in local minima

Disadvantages

Slower than gradient descent Memory intensive for large nets

Alternatives To Gradient Descent

Simplex Algorithm

Advantages

Similar to gradient descent but faster Easy to implement

Disadvantages

Does not guarantee a global minimum

Enhancements To Gradient Descent

Momentum

Adds a percentage of the last movement to the current movement

Enhancements To Gradient Descent

Momentum

Useful to get over small bumps in the error function Often finds a minimum in less steps w(t) = -n*d*y + a*w(t-1)

w is the change in weight n is the learning rate d is the error y is different depending on which layer we are calculating a is the momentum parameter

Enhancements To Gradient Descent

Adaptive Backpropagation Algorithm


It assigns each weight a learning rate That learning rate is determined by the sign of the gradient of the error function from the last iteration

If the signs are equal it is more likely to be a shallow slope so the learning rate is increased The signs are more likely to differ on a steep slope so the learning rate is decreased

This will speed up the advancement when on gradual slopes

Enhancements To Gradient Descent

Adaptive Backpropagation

Possible Problems:

Since we minimize the error for each weight separately the overall error may increase

Solution:

Calculate the total output error after each adaptation and if it is greater than the previous error reject that adaptation and calculate new learning rates

Enhancements To Gradient Descent

SuperSAB(Super Self-Adapting Backpropagation)


Combines the momentum and adaptive methods. Uses adaptive method and momentum so long as the sign of the gradient does not change

This is an additive effect of both methods resulting in a faster traversal of gradual slopes

When the sign of the gradient does change the momentum will cancel the drastic drop in learning rate

This allows for the function to roll up the other side of the minimum possibly escaping local minima

Enhancements To Gradient Descent

SuperSAB

Experiments show that the SuperSAB converges faster than gradient descent Overall this algorithm is less sensitive (and so is less likely to get caught in local minima)

Other Ways To Minimize Error

Varying training data

Cycle through input classes Randomly select from input classes


Randomly change value of input node (with low probability) E.g. Speech recognition

Add noise to training data

Retrain with expected inputs after initial training

Other Ways To Minimize Error

Adding and removing neurons from layers

Adding neurons speeds up learning but may cause loss in generalization Removing neurons has the opposite effect

Resources

Artifical Neural Networks, Backpropagation, J. Henseler Artificial Intelligence: A Modern Approach, S. Russell & P. Norvig 501 notes, J.R. Parker www.dontveter.com/bpr/bpr.html www.dse.doc.ic.ac.uk/~nd/surprise_96/journal/vl4/cs 11/report.html

Local Minima

Local Minimum

Global Minimum

Problems with Local Minima

Backpropagation is gradient descent search


Where the height of the hills is determined by error But there are many dimensions to the space

One for each weight in the network

Therefore backpropagation

Can find its ways into local minima Random re-start: learn lots of networks

One partial solution:

Starting with different random weight settings

Can take best network Or can set up a committee of networks to categorise examples

Another partial solution: Momentum

Adding Momentum

Imagine rolling a ball down a hill

Gets stuck here

Without Momentum

With Momentum

Momentum in Backpropagation

For each weight

Remember what was added in the previous epoch

In the current epoch

Add on a small amount of the previous

The amount is determined by


The momentum parameter, denoted is taken to be between 0 and 1

How Momentum Works

If direction of the weight doesnt change


Then the movement of search gets bigger The amount of additional extra is compounded in each epoch May mean that narrow local minima are avoided May also mean that the convergence rate speeds up

Caution:

May not have enough momentum to get out of local minima Also, too much momentum might carry search

Back out of the global minimum, into a local minimum

Convergence and Local Minima


Gradient descent to some local minimum
Perhaps not global minimum...

Heuristics to alleviate the problem of local minima


Add momentum Use stochastic gradient descent rather than true gradient descent. Train multiple nets with different initial weights using the same data.

You might also like