w7 PDF

Artificial Neural
Networks
CSL465/603 - Fall 2016
Narayanan C Krishnan
ckn@iitrpr.ac.in
Outline
Perceptron
Stochastic Gradient Descent
Multi-layer perceptron
Backpropagation algorithm
Variants of backpropagation networks
Artificial Neural Networks
CSL465/603 - Machine Learning
Properties of Neural Networks

Inspired by Natural Models
Neuron like switching units
Weighted interconnections among units
Highly parallel, distributed process
Learn the weights automatically
Perceptron
Developed by Frank Rosenblatt 1950-60
LINEAR MODELS FOR CLASSIFICATION
196
Initial4.version
was a piece of hardware
Figure 4.8 Illustration of the Mark 1 perceptron hardware. The photograph on the left shows how the inputs
were obtained using a simple camera system in which an input scene, in this case a printed character, was
illuminated by powerful lights, and an image focussed onto a 20 20 array of cadmium sulphide photocells,
giving a primitive 400 pixel image. The perceptron also had a patch board, shown in the middle photograph,
which allowed different configurations of input features to be tried. Often these were wired up at random to
demonstrate the ability of the perceptron to learn without the need for precise wiring, in contrast to a modern
digital computer. The photograph on the right shows one of the racks of adaptive weights. Each weight was
implemented using a rotary variable resistor, also called a potentiometer, driven by an electric motor thereby
allowing the value of the weight to be adjusted automatically by the learning algorithm.
Perceptron
Input vector x column vector
Weight vector w column vector
x1
w1
x2
w2
.
.
.
x0=1
w0
wn
wi xi
i=0
o=
xn
1 if
x >0
i=0 i i
-1 otherwise
1, if w + x > 0
Output value - x = %
1, if w + x < 0
Representational Power of
Perceptron (1)
Perceptron hyperplane decision surface
x2
x2
+
+
x1
-
(a)
+
x1
-
Decision boundary w x = 0
Datasets that can separated by a hyperplane
linearly separable.
(b)
Representational Power of
Perceptron (2)
A single perceptron can represent many Boolean
functions
1
Perceptron Criterion
1, if w + x > 0
x =%
1, if w + x < 0
Using the target coding scheme it follows that all
data points should satisfy
w + x4 5 > 0, = 1, ,
A possible error function could be
; = = w + x4 5
5
Set of misclassified points -
Parameter update using stochastic gradient descent

- w ABC = w DEF + x4 5
Perceptron Update Rule Illustration
Perceptron Convergence
Perceptron convergence theorem
If there exists an exact solution ( if the training data set is
linearly separable), then the perceptron learning
algorithm is guaranteed to find an exact solution in a finite
number of steps.
However, might require substantial number of steps

to converge
Hard to distinguish between a non-separable
problem and one that is slow to converge
10
Perceptron Training Through

Gradient Descent
Best fit solutions for non-linearly separable data
Gradient Descent to find weights that best fit the
training examples
Least squares training error
L
1
IJ = = 5 5 2
2
5M1
Weight update equation
w ABC = w DEF = 5 5 x4
5M1
11
Stochastic Approximation
Practical difficulties with gradient descent
Converge to the local minimum can be slow
Incremental/Stochastic gradient descent

Update the weights following the calculation of error for
each individual example
Error
1
JIJ = 5 5 2
2
Parameter update
w ABC = w DEF 5 5 x4
12
Limitation of Perceptron
Can represent only linear functions
Example - XOR Function
1
N 0
2 + N > 0
1 + N > 0
1 + 2 + N 0
13
Multilayer Perceptron
Architecture of the multilayer network
Node at the hidden layer
Algorithm to learn the weights of the connections
between nodes
bad
hid
+ hod
r had
r hawed
hoard
o heed
c hud
who'd
hood
0
head hid
4 ...
head hid ...
whod hood
who'd hood
.
,
F1
FIGURE
F2
4.5
Decision regions of a multilayer feedforward network. The network shown here was trained to
14
Architecture and Hidden Layer (1)

Key difference from Perceptron - non-linear
activation function
Sigmoid
Tanh
x1
w1
x2
w2
.
.
.
x0 = 1
w0
wn
net = wi xi
i=0
o = (net) =
1
-net
1+e
xn
15
Architecture and Hidden Layer (2)

1
NY
1Y
^
_Y
PY
=1
NP
1P
=1
QP
`P

Input - x
Connection weights
between input and hidden
layer PQ
Output of the SP hidden
layer node - P
P = wVW x
Connection weights
between hidden and
output layer YP
Output of the SP output
layer node Y
Y = v\W z
16
Forward Pass of MLP

1
NY
1Y
^
_Y
PY
=1
NP
1P
QP
`P

Given an input x, the

output of the neural
network is estimated
using forward pass
Hidden layer outputs
P = wVW x
Output layer
Y = v\W z
=1
17
Backpropagation Algorithm (1)

Stochastic Gradient Descent
Error for the SP data point
^
1
1
2
a; , =
5 5 = = 5Y 5Y
2
2
YM1
Training rule for output node weights

Y
NY
1Y
_Y

18

Training rule for hidden node weights
bcde
bfgh
=
1
P1
P2
^
P^
P
QP
19

Weight update terms
Output layer weights
PY = Y Y Y 1 Y P
Hidden layer weights
^
QP = = Y Y Y 1 Y PY P 1 P Q
YM5
Forward pass - propagate the input x4 to obtain the

output y4
Backward pass - calculate the error and propagate
it backwards
klf
mno
PY
= PY
PY
klf
mno
QP
= QP
QP
20

Randomize the order of the training data
Iterate over a number of epochs/until termination
criteria is met
For each input data
Perform forward pass
Back propagate any error
Termination criteria
All training samples are correctly classified
Error between two consecutive epoch does not change
significantly
Have a limit on the number of epochs.
21
MLP for Classification - = 2

Sigmoid output layer node models the posterior
probability - = = 1 x
Error function - cross entropy
, = log + 1 log 1
22
MLP for Classification 3

Number of nodes in output layer
Softmax
function at the output layer - Y =
{
Bwx yz |
{|
~
Bwx
y
Error function
, = = Y log Y
Weight update equations
YM1
PY = Y Y P
^
QP = = Y Y PY P 1 P Q
YM5
23
Improving Convergence (1)

Gradient descent can be slow to converge
Successive weight updates can lead to large
oscillations (stochastic updates)
Use previous estimate of the weight to smooth the
trajectory
Introduce a momentum term - 0,0.1
S
PS =
+ PS1
P
24
Improving Convergence (2)

Learning rate 0, 0.3
Low learning rate no oscillations, but convergence
is slow
Adaptive learning rate
S = S1 +
+,
if S < S1
= %
,
otherwise
Increase if error decreases

Decrease if error increases
25
Representational Power of MLP with

non-linear activation functions
XOR problem
26
Hidden Layer Representations (1)

Do the hidden layer nodes encode something
meaningful?
Autoencoder network
Inputs
Outputs
27
Hidden Layer Representations (2)

Hidden unit encoding for input 01000000
Sum of squared errors for each output unit

0.9
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.1
0.2
0.1
0
0
500
1000
1500
2000
2500
500
1000
1500
2000
2500
Weights from inputs to one hidden unit

4
3
2
1
0
-1
-2
-3
-4
-5
0
500
1000
1500
2000
2500
28
Overfitting (1)
Overtraining could result in overfitting!
0.01
Training set error
Test set error
0.009
0.008
0.006
0.005
0.004
0.003
0.08
0.002
0.07
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Training iterations
Training set error

Test set error
0.06
0.05
Error
Error
0.007
0.04
0.03
0.02
0.01
0
0
1000
2000
3000
4000
Training iterations
5000
6000
29
Overfitting (2)
Network with + 1 inputs, outputs and + 1
hidden layer units has totally + 1 + + 1
number of parameters to be learned!
determines the complexity of the model
Large value of complex functions that are prone to
overfitting
Small value of simpler functions that under fit the data
1
10
30
Solutions to avoid overfitting (1)

Tuning the number of hidden layer units
Test the learned network on unseen validation set
31

Tuning the number of hidden layer units
Test the learned network on unseen validation set
Dynamic node creation
Start with an initial size hidden later
If the training error is high, ass another hidden layer unit
Randomly initialize the weights of the new unit
Continue with backpropagation
32

Regularization
Weight decay penalize models with many non-zero
connection weights
= + 2
2
Grouping of connections
Force a set of connections to have same weight
33
Convolutional Neural Networks

Three key properties
Local receptive fields
Weight sharing
Subsampling
268
5. NEURAL NETWORKS
Successful in Computer Vision tasks
Input image
Convolutional layer
Sub-sampling
layer
34
Recurrent Networks
Neural network model applied to time series data
Outputs of the network at time t is also an input to other
units at time t+1
35
Summary
Perceptron
Linearly separable data
Perceptron criterion
Best fit approximation for non-linearly separable data

Stochastic gradient descent
Multilayer perceptrons
Non-linear activation functions
Hidden layer units
Backpropagation algorithm for training
Overfitting in backpropagation networks

Variants
CNN, RNN
36

w7 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

w7 PDF

Uploaded by

Copyright:

Available Formats

Artificial Neural

Artificial Neural Networks

CSL465/603 - Machine Learning

Properties of Neural Networks

Artificial Neural Networks

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Artificial Neural Networks

CSL465/603 - Machine Learning

Set of misclassified points -

Parameter update using stochastic gradient descent

CSL465/603 - Machine Learning

Perceptron Update Rule Illustration

Artificial Neural Networks

CSL465/603 - Machine Learning

However, might require substantial number of steps

Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron Training Through

Weight update equation

CSL465/603 - Machine Learning

Incremental/Stochastic gradient descent

Artificial Neural Networks

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

head hid ...

CSL465/603 - Machine Learning

Architecture and Hidden Layer (1)

Artificial Neural Networks

CSL465/603 - Machine Learning

Architecture and Hidden Layer (2)

Artificial Neural Networks

CSL465/603 - Machine Learning

Forward Pass of MLP

Given an input x, the

Artificial Neural Networks

CSL465/603 - Machine Learning

Backpropagation Algorithm (1)

Training rule for output node weights

Artificial Neural Networks

CSL465/603 - Machine Learning

Backpropagation Algorithm (2)

Artificial Neural Networks

CSL465/603 - Machine Learning

Backpropagation Algorithm (3)

Forward pass - propagate the input x4 to obtain the

CSL465/603 - Machine Learning

Backpropagation Algorithm (4)

CSL465/603 - Machine Learning

MLP for Classification - = 2

Artificial Neural Networks

CSL465/603 - Machine Learning

MLP for Classification 3

CSL465/603 - Machine Learning

Improving Convergence (1)

Artificial Neural Networks

CSL465/603 - Machine Learning

Improving Convergence (2)

Increase if error decreases

Artificial Neural Networks

CSL465/603 - Machine Learning