You are on page 1of 36

Artificial Neural

Networks
CSL465/603 - Fall 2016
Narayanan C Krishnan
ckn@iitrpr.ac.in

Outline
Perceptron
Stochastic Gradient Descent
Multi-layer perceptron
Backpropagation algorithm
Variants of backpropagation networks

Artificial Neural Networks

CSL465/603 - Machine Learning

Properties of Neural Networks


Inspired by Natural Models
Neuron like switching units
Weighted interconnections among units
Highly parallel, distributed process
Learn the weights automatically

Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron
Developed by Frank Rosenblatt 1950-60
LINEAR MODELS FOR CLASSIFICATION
196
Initial4.version
was a piece of hardware

Figure 4.8 Illustration of the Mark 1 perceptron hardware. The photograph on the left shows how the inputs
were obtained using a simple camera system in which an input scene, in this case a printed character, was
illuminated by powerful lights, and an image focussed onto a 20 20 array of cadmium sulphide photocells,
giving a primitive 400 pixel image. The perceptron also had a patch board, shown in the middle photograph,
which allowed different configurations of input features to be tried. Often these were wired up at random to
demonstrate the ability of the perceptron to learn without the need for precise wiring, in contrast to a modern
digital computer. The photograph on the right shows one of the racks of adaptive weights. Each weight was
implemented using a rotary variable resistor, also called a potentiometer, driven by an electric motor thereby
allowing the value of the weight to be adjusted automatically by the learning algorithm.
Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron
Input vector x column vector
Weight vector w column vector
x1

w1

x2

w2

.
.
.

x0=1
w0

wn

wi xi

i=0

o=

xn

1 if

x >0

i=0 i i
-1 otherwise

1, if w + x > 0
Output value - x = %
1, if w + x < 0
Artificial Neural Networks

CSL465/603 - Machine Learning

Representational Power of
Perceptron (1)
Perceptron hyperplane decision surface
x2

x2
+

+
x1
-

(a)
+

x1
-

Decision boundary w x = 0
Datasets that can separated by a hyperplane
linearly separable.
Artificial Neural Networks

CSL465/603 - Machine Learning

(b)

Representational Power of
Perceptron (2)
A single perceptron can represent many Boolean
functions
1

Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron Criterion
1, if w + x > 0
x =%
1, if w + x < 0
Using the target coding scheme it follows that all
data points should satisfy
w + x4 5 > 0, = 1, ,
A possible error function could be
; = = w + x4 5
5

Set of misclassified points -

Parameter update using stochastic gradient descent


- w ABC = w DEF + x4 5
Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron Update Rule Illustration

Artificial Neural Networks

CSL465/603 - Machine Learning

Perceptron Convergence
Perceptron convergence theorem
If there exists an exact solution ( if the training data set is
linearly separable), then the perceptron learning
algorithm is guaranteed to find an exact solution in a finite
number of steps.

However, might require substantial number of steps


to converge
Hard to distinguish between a non-separable
problem and one that is slow to converge

Artificial Neural Networks

CSL465/603 - Machine Learning

10

Perceptron Training Through


Gradient Descent
Best fit solutions for non-linearly separable data
Gradient Descent to find weights that best fit the
training examples
Least squares training error
L
1
IJ = = 5 5 2
2
5M1

Weight update equation

w ABC = w DEF = 5 5 x4
5M1
Artificial Neural Networks

CSL465/603 - Machine Learning

11

Stochastic Approximation
Practical difficulties with gradient descent
Converge to the local minimum can be slow

Incremental/Stochastic gradient descent


Update the weights following the calculation of error for
each individual example
Error
1
JIJ = 5 5 2
2
Parameter update
w ABC = w DEF 5 5 x4

Artificial Neural Networks

CSL465/603 - Machine Learning

12

Limitation of Perceptron
Can represent only linear functions
Example - XOR Function
1

N 0
2 + N > 0
1 + N > 0
1 + 2 + N 0
Artificial Neural Networks

CSL465/603 - Machine Learning

13

Multilayer Perceptron
Architecture of the multilayer network
Node at the hidden layer
Algorithm to learn the weights of the connections
between nodes

bad
hid
+ hod
r had
r hawed
hoard
o heed
c hud
who'd
hood
0

head hid

4 ...

head hid ...

whod hood
who'd hood

.
,

F1

FIGURE
Artificial Neural Networks

F2

4.5

CSL465/603 - Machine Learning

Decision regions of a multilayer feedforward network. The network shown here was trained to

14

Architecture and Hidden Layer (1)


Key difference from Perceptron - non-linear
activation function
Sigmoid
Tanh
x1

w1

x2

w2

.
.
.

x0 = 1
w0

wn

net = wi xi
i=0

o = (net) =

1
-net

1+e

xn

Artificial Neural Networks

CSL465/603 - Machine Learning

15

Architecture and Hidden Layer (2)


1
NY

1Y

^
_Y

PY

=1

NP

1P

=1

Artificial Neural Networks

QP

`P

Input - x
Connection weights
between input and hidden
layer PQ
Output of the SP hidden
layer node - P
P = wVW x
Connection weights
between hidden and
output layer YP
Output of the SP output
layer node Y
Y = v\W z

CSL465/603 - Machine Learning

16

Forward Pass of MLP


1
NY

1Y

^
_Y

PY

=1

NP

1P

QP

`P

Given an input x, the


output of the neural
network is estimated
using forward pass
Hidden layer outputs
P = wVW x
Output layer
Y = v\W z

=1

Artificial Neural Networks

CSL465/603 - Machine Learning

17

Backpropagation Algorithm (1)


Stochastic Gradient Descent
Error for the SP data point
^
1
1
2
a; , =
5 5 = = 5Y 5Y
2
2

YM1

Training rule for output node weights


Y

Artificial Neural Networks

CSL465/603 - Machine Learning

NY

1Y

_Y

18

Backpropagation Algorithm (2)


Training rule for hidden node weights
bcde
bfgh

=
1

P1

P2

^
P^

P
QP

Artificial Neural Networks

CSL465/603 - Machine Learning

19

Backpropagation Algorithm (3)


Weight update terms
Output layer weights
PY = Y Y Y 1 Y P
Hidden layer weights
^

QP = = Y Y Y 1 Y PY P 1 P Q
YM5

Forward pass - propagate the input x4 to obtain the


output y4
Backward pass - calculate the error and propagate
it backwards
klf
mno
PY
= PY
PY
klf
mno
QP
= QP
QP
Artificial Neural Networks

CSL465/603 - Machine Learning

20

Backpropagation Algorithm (4)


Randomize the order of the training data
Iterate over a number of epochs/until termination
criteria is met
For each input data
Perform forward pass
Back propagate any error

Termination criteria
All training samples are correctly classified
Error between two consecutive epoch does not change
significantly
Have a limit on the number of epochs.
Artificial Neural Networks

CSL465/603 - Machine Learning

21

MLP for Classification - = 2


Sigmoid output layer node models the posterior
probability - = = 1 x
Error function - cross entropy
, = log + 1 log 1

Artificial Neural Networks

CSL465/603 - Machine Learning

22

MLP for Classification 3


Number of nodes in output layer
Softmax
function at the output layer - Y =
{
Bwx yz |
{|
~
Bwx
y

Error function

, = = Y log Y
Weight update equations

YM1

PY = Y Y P
^

QP = = Y Y PY P 1 P Q
YM5
Artificial Neural Networks

CSL465/603 - Machine Learning

23

Improving Convergence (1)


Gradient descent can be slow to converge
Successive weight updates can lead to large
oscillations (stochastic updates)
Use previous estimate of the weight to smooth the
trajectory
Introduce a momentum term - 0,0.1
S

PS =
+ PS1
P

Artificial Neural Networks

CSL465/603 - Machine Learning

24

Improving Convergence (2)


Learning rate 0, 0.3
Low learning rate no oscillations, but convergence
is slow
Adaptive learning rate
S = S1 +
+,
if S < S1
= %
,
otherwise

Increase if error decreases


Decrease if error increases

Artificial Neural Networks

CSL465/603 - Machine Learning

25

Representational Power of MLP with


non-linear activation functions
XOR problem

Artificial Neural Networks

CSL465/603 - Machine Learning

26

Hidden Layer Representations (1)


Do the hidden layer nodes encode something
meaningful?
Autoencoder network
Inputs

Artificial Neural Networks

Outputs

CSL465/603 - Machine Learning

27

Hidden Layer Representations (2)


Hidden unit encoding for input 01000000

Sum of squared errors for each output unit


0.9

0.8

0.9

0.7

0.8

0.6

0.7

0.5

0.6

0.4

0.5

0.3

0.4

0.2

0.3

0.1

0.2
0.1

0
0

500

1000

1500

2000

2500

500

1000

1500

2000

2500

Weights from inputs to one hidden unit


4
3
2
1
0
-1
-2
-3
-4
-5
0

Artificial Neural Networks

500

1000

1500

CSL465/603 - Machine Learning

2000

2500

28

Overfitting (1)
Overtraining could result in overfitting!
0.01
Training set error
Test set error

0.009
0.008

0.006
0.005
0.004
0.003

0.08

0.002

0.07

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Training iterations

Training set error


Test set error

0.06
0.05

Error

Error

0.007

0.04
0.03
0.02
0.01
0

Artificial Neural Networks

0
CSL465/603 - Machine Learning

1000

2000

3000
4000
Training iterations

5000

6000
29

Overfitting (2)
Network with + 1 inputs, outputs and + 1
hidden layer units has totally + 1 + + 1
number of parameters to be learned!
determines the complexity of the model
Large value of complex functions that are prone to
overfitting
Small value of simpler functions that under fit the data
1

Artificial Neural Networks

CSL465/603 - Machine Learning

10

30

Solutions to avoid overfitting (1)


Tuning the number of hidden layer units
Test the learned network on unseen validation set

Artificial Neural Networks

CSL465/603 - Machine Learning

31

Solutions to avoid overfitting (2)


Tuning the number of hidden layer units
Test the learned network on unseen validation set
Dynamic node creation
Start with an initial size hidden later
If the training error is high, ass another hidden layer unit
Randomly initialize the weights of the new unit

Continue with backpropagation

Artificial Neural Networks

CSL465/603 - Machine Learning

32

Solutions to avoid overfitting (3)


Regularization
Weight decay penalize models with many non-zero
connection weights

= + 2
2

Grouping of connections
Force a set of connections to have same weight

Artificial Neural Networks

CSL465/603 - Machine Learning

33

Convolutional Neural Networks


Three key properties
Local receptive fields
Weight sharing
Subsampling
268

5. NEURAL NETWORKS

Successful in Computer Vision tasks

Input image
Artificial Neural Networks

Convolutional layer
CSL465/603 - Machine Learning

Sub-sampling
layer

34

Recurrent Networks
Neural network model applied to time series data
Outputs of the network at time t is also an input to other
units at time t+1

Artificial Neural Networks

CSL465/603 - Machine Learning

35

Summary
Perceptron
Linearly separable data
Perceptron criterion

Best fit approximation for non-linearly separable data


Stochastic gradient descent

Multilayer perceptrons
Non-linear activation functions
Hidden layer units
Backpropagation algorithm for training

Overfitting in backpropagation networks


Variants
CNN, RNN

Artificial Neural Networks

CSL465/603 - Machine Learning

36

You might also like