Professional Documents
Culture Documents
=
=
m
j
j kj k
x w u
1
Adder, weighted sum, linear
combiner
k k k
b u v + =
Activation potential; b
k
: bias
) (
k k
v y =
Output; : activation function
Part I 9
Another way of including bias
Set x
0
=+1 and w
k0
=b
k
So we have
=
=
m
j
j kj k
x w v
0
Part I 10
McCulloch-Pitts model
} 1 , 1 {
i
x
) (
1
b x w y
i
m
i
i
+ =
=
0 if
0 if
1
1
) (
<
=
v
v
v
A form of signum
(sign) function
Note difference from textbook; also number of neurons in the brain
Bipolar input
Part I 11
McCulloch-Pitts model (cont.)
Example logic gates (see blackboard)
McCulloch-Pitts networks (introduced in 1943) are the first
class of abstract computing machines: finite-state automata
Finite-state automata can compute any logic (Boolean) function
Part I 12
Network architecture
View an NN as a connected, directed graph, which defines
its architecture
Feedforward nets: loop-free graph
Recurrent nets: with loops
Part I 13
Feedforward net
Since the input layer
consists of source nodes,
it is typically not
counted when we talk
about the number of
layers in a feedforward
net
For example, the
architecture of 10-4-2
counts as a two-layer net
Part I 14
A one-layer recurrent net
In this net, the input
typically sets the initial
condition of the output
layer
Part I 15
Network components
Three components characterize a neural net
Architecture
Activation function
Learning rule (algorithm)
Part I 16
CSE 5526: Introduction to Neural Networks
Perceptrons
Part I 17
Perceptrons
Architecture: one-layer feedforward net
Without loss of generality, consider a single-neuron perceptron
x
1
x
2
x
m
y
1
y
2
Part I 18
Definition
) (v y =
otherwise
0 if
1
1
) (
=
v
v
Hence a McCulloch-Pitts neuron, but with real-valued inputs
b x w v
i
m
i
i
+ =
=1
Part I 19
Pattern recognition
With a bipolar output, the perceptron performs a 2-class
classification problem
Apples vs. oranges
How do we learn to perform a classification problem?
Task: The perceptron is given pairs of input x
p
and desired
output d
p
. How to find w (with b incorporated) so that
? all for , p d y
p p
=
Part I 20
Decision boundary
The decision boundary for a given w:
g is also called the discriminant function for the perceptron, and it is
a linear function of x. Hence it is a linear discriminant function
0 ) (
1
= + =
=
b x w g
i
m
i
i
x
Part I 21
Example
See blackboard
Part I 22
Decision boundary (cont.)
For an m-dimensional input space, the decision boundary is
an (m 1)-dimensional hyperplane perpendicular to w. The
hyperplane separates the input space into two halves, with
one half having y =1, and the other half having y =-1
When b =0, the hyperplane goes through the origin
x
1
x
2
y =-1
y =+1
w
g =0
x
o
o
o
o
o
o
o
o o
x
x
x
x
x x
Part I 23
Linear separability
For a set of input patterns x
p
, if there exists one w that
separates d =1 patterns from d =-1 patterns, then the
classification problem is linearly separable
In other words, there exists a linear discriminant function that
produces no classification error
Examples: AND, OR, XOR (see blackboard)
A very important concept
Part I 24
Linear separability: a more general illustration
Part I 25
Perceptron learning rule
Strengthen an active synapse if the postsynaptic neuron fails
to fire when it should have fired; weaken an active synapse if
the neuron fires when it should not have fired
Formulated by Rosenblatt based on biological intuition
Part I 26
Quantitatively
) ( ) ( ) 1 ( n w n w n w + = +
n: iteration number
: step size or learning rate
) ( )] ( ) ( [ ) ( n x n y n d n w + =
In vector form
x w ] [ y d =
Part I 27
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
Part I 28
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
Part I 29
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
Part I 30
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
Part I 31
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
Part I 32
Geometric interpretation
Assume =1/2
x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
w(3)
Part I 33
Geometric interpretation
Each weight update moves w closer to d =1 patterns, or away
from d =-1 patterns. w(3) solves the classification problem
x
1
x
2
x: d =1
o
o
o
o: d =-1
x
x
x
w(3)
Part I 34
Perceptron convergence theorem
Theorem: If a classification problem is linearly separable, a
perceptron will reach a solution in a finite number of
iterations
[Proof]
Given a finite number of training patterns, because of linear separability,
there exists a weight vector w
o
so that
where
0 >
p
T
o p
d x w (1)
) ( min
p
T
o p p
d x w =
Part I 35
Proof (cont.)
We assume that the initial weights are all zero. Let N
p
denote
the number of times x
p
has been used for actually updating
the weight vector at some point in learning
At that time:
p p
p
p
p p p
p
p
d N
y d N
x
x w
=
=
2
] ) ( [
Part I 36
Proof (cont.)
Consider first
where
P
N
d N
p
p
p
T
o p
p
p
T
o
2
2
2
=
x w w w
w w
T
o
(2)
=
p
p
N P
Part I 37
Proof (cont.)
Now consider the change in square length after a single
update by x:
where
Since upon an update, d(w
T
x) 0
2
2
2
2 2 2 2 2
4
4 4
2
+ =
+ = + =
x w x
w x w w w w w
T
d
d
2
w
2
max
p p
x =
Part I 38
Proof (cont.)
By summing for P steps we have the bound:
Now square the cosine of the angle between w
o
and w, we
have by (2) and (3)
Cauchy-Schwarz inequality
P
2
2
4 w
2
w
(3)
2
2
2
2
2 2 2
2 2
2
4
4 ) (
1
o o o
T
o
P
P
P
w w w w
w w
=
Part I 39
Proof (cont.)
Thus, P must be infinite to satisfy the above inequality. This
complete the proof
Remarks
In the case of w(0) =0, the learning rate has no effect on the proof.
That is, the theorem holds no matter what ( >0) is
The solution weight vector is not unique
Part I 40
Generalization
Performance of a learning machine on test patterns not used
during training
Example: Class 1: handwritten m: class 2: handwritten n
See blackboard
Perceptrons generalize by deriving a decision boundary in
the input space. Selection of training patterns is thus
important for generalization