You are on page 1of 40

Part I 1

CSE 5526: Introduction to


Neural Networks

Instructor: DeLiang Wang

Part I 2
What is this course is about?
AI (artificial intelligence) in the broad sense, in particular
learning
The human brain and its amazing ability, e.g. vision
Part I 3
Human brain
Lateral fissure
Central sulcus
Part I 4
Brain versus computer
Brain



Computer



Brain-like computation Neural networks (NN or ANN)
Neural computation
Discuss syllabus
Part I 5
A single neuron
Part I 6
Real neurons, real synapses
Properties
Action potential (impulse) generation
Impulse propagation
Synaptic transmission & plasticity
Spatial summation
Terminology
Neurons units nodes
Synapses connections architecture
Synaptic weight connection strength (either positive or negative)
Part I 7
Model of a single neuron
Part I 8
Neuronal model

=
=
m
j
j kj k
x w u
1
Adder, weighted sum, linear
combiner
k k k
b u v + =
Activation potential; b
k
: bias
) (
k k
v y =
Output; : activation function
Part I 9
Another way of including bias
Set x
0
=+1 and w
k0
=b
k


So we have

=
=
m
j
j kj k
x w v
0
Part I 10
McCulloch-Pitts model
} 1 , 1 {
i
x
) (
1
b x w y
i
m
i
i
+ =

=

0 if
0 if
1
1
) (
<

=
v
v
v
A form of signum
(sign) function
Note difference from textbook; also number of neurons in the brain
Bipolar input
Part I 11
McCulloch-Pitts model (cont.)
Example logic gates (see blackboard)

McCulloch-Pitts networks (introduced in 1943) are the first
class of abstract computing machines: finite-state automata
Finite-state automata can compute any logic (Boolean) function
Part I 12
Network architecture
View an NN as a connected, directed graph, which defines
its architecture
Feedforward nets: loop-free graph
Recurrent nets: with loops
Part I 13
Feedforward net
Since the input layer
consists of source nodes,
it is typically not
counted when we talk
about the number of
layers in a feedforward
net
For example, the
architecture of 10-4-2
counts as a two-layer net
Part I 14
A one-layer recurrent net
In this net, the input
typically sets the initial
condition of the output
layer
Part I 15
Network components
Three components characterize a neural net
Architecture
Activation function
Learning rule (algorithm)


Part I 16
CSE 5526: Introduction to Neural Networks
Perceptrons
Part I 17
Perceptrons
Architecture: one-layer feedforward net
Without loss of generality, consider a single-neuron perceptron








x
1
x
2
x
m
y
1
y
2
Part I 18
Definition
) (v y =
otherwise
0 if
1
1
) (

=
v
v
Hence a McCulloch-Pitts neuron, but with real-valued inputs
b x w v
i
m
i
i
+ =

=1
Part I 19
Pattern recognition
With a bipolar output, the perceptron performs a 2-class
classification problem
Apples vs. oranges
How do we learn to perform a classification problem?
Task: The perceptron is given pairs of input x
p
and desired
output d
p
. How to find w (with b incorporated) so that
? all for , p d y
p p
=
Part I 20
Decision boundary
The decision boundary for a given w:




g is also called the discriminant function for the perceptron, and it is
a linear function of x. Hence it is a linear discriminant function
0 ) (
1
= + =

=
b x w g
i
m
i
i
x
Part I 21
Example
See blackboard
Part I 22
Decision boundary (cont.)
For an m-dimensional input space, the decision boundary is
an (m 1)-dimensional hyperplane perpendicular to w. The
hyperplane separates the input space into two halves, with
one half having y =1, and the other half having y =-1
When b =0, the hyperplane goes through the origin


x
1
x
2
y =-1

y =+1

w
g =0

x
o
o
o
o
o
o
o
o o
x
x
x
x
x x
Part I 23
Linear separability
For a set of input patterns x
p
, if there exists one w that
separates d =1 patterns from d =-1 patterns, then the
classification problem is linearly separable
In other words, there exists a linear discriminant function that
produces no classification error
Examples: AND, OR, XOR (see blackboard)
A very important concept

Part I 24
Linear separability: a more general illustration
Part I 25
Perceptron learning rule
Strengthen an active synapse if the postsynaptic neuron fails
to fire when it should have fired; weaken an active synapse if
the neuron fires when it should not have fired
Formulated by Rosenblatt based on biological intuition

Part I 26
Quantitatively
) ( ) ( ) 1 ( n w n w n w + = +
n: iteration number
: step size or learning rate

) ( )] ( ) ( [ ) ( n x n y n d n w + =
In vector form
x w ] [ y d =
Part I 27
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
Part I 28
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
Part I 29
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
Part I 30
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
Part I 31
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
Part I 32
Geometric interpretation
Assume =1/2

x
1
x
2
w(0)
x: d =1
o
o
o
o: d =-1
x
x
x
w(1)
w(2)
w(3)
Part I 33
Geometric interpretation
Each weight update moves w closer to d =1 patterns, or away
from d =-1 patterns. w(3) solves the classification problem

x
1
x
2
x: d =1
o
o
o
o: d =-1
x
x
x
w(3)
Part I 34
Perceptron convergence theorem
Theorem: If a classification problem is linearly separable, a
perceptron will reach a solution in a finite number of
iterations
[Proof]
Given a finite number of training patterns, because of linear separability,
there exists a weight vector w
o
so that





where

0 >
p
T
o p
d x w (1)
) ( min
p
T
o p p
d x w =
Part I 35
Proof (cont.)
We assume that the initial weights are all zero. Let N
p
denote
the number of times x
p
has been used for actually updating
the weight vector at some point in learning

At that time:




p p
p
p
p p p
p
p
d N
y d N
x
x w

=
=

2
] ) ( [
Part I 36
Proof (cont.)
Consider first








where






P
N
d N
p
p
p
T
o p
p
p
T
o

2
2
2
=

x w w w
w w
T
o
(2)

=
p
p
N P
Part I 37
Proof (cont.)
Now consider the change in square length after a single
update by x:







where
Since upon an update, d(w
T
x) 0









2
2
2
2 2 2 2 2
4
4 4
2

+ =
+ = + =
x w x
w x w w w w w
T
d
d
2
w
2
max
p p
x =
Part I 38
Proof (cont.)
By summing for P steps we have the bound:



Now square the cosine of the angle between w
o
and w, we
have by (2) and (3)





Cauchy-Schwarz inequality





P
2
2
4 w
2
w
(3)
2
2
2
2
2 2 2
2 2
2
4
4 ) (
1
o o o
T
o
P
P
P
w w w w
w w



=
Part I 39
Proof (cont.)
Thus, P must be infinite to satisfy the above inequality. This
complete the proof

Remarks
In the case of w(0) =0, the learning rate has no effect on the proof.
That is, the theorem holds no matter what ( >0) is
The solution weight vector is not unique





Part I 40
Generalization
Performance of a learning machine on test patterns not used
during training

Example: Class 1: handwritten m: class 2: handwritten n
See blackboard

Perceptrons generalize by deriving a decision boundary in
the input space. Selection of training patterns is thus
important for generalization

You might also like