You are on page 1of 101

1

Introduction to
Predictive Learning
Electrical and Computer Engineering
LECTURE SET 6
Neural Network Learning
2
OUTLINE
Objectives
- introduce biologically inspired NN learning methods for
clustering, regression and classification
- explain similarities and differences between statistical and NN
methods
- show examples using synthetic and real-life data
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and discussion
3
Brief history and motivation for ANN
Huge interest in understanding the nature and
mechanism of biological/ human learning
Biologists + psychologists do not adopt classical
parametric statistical learning, because:
- parametric modeling is not biologically plausible
- biological info processing is clearly different from
algorithmic models of computation
Mid 1980s: growing interest in applying biologically
inspired computational models to:
- developing computer models (of human brain)
- various engineering applications
New field Artificial Neural Networks (~1986 1987)
ANNs represent nonlinear estimators implementing
the ERM approach (usually squared-loss function)

4
History and motivation (contd)
Relationship to the problem of inductive learning:






The same learning problem setting
Neural-style learning algorithm:
- on-line (flow through)
- simple processing
Biological terminology


Generator
of samples
Learning
Machine
System
x
y
A
y
x y
Aw ~ xy
w
Hebbian Rule:
Synapse
5
Neural vs Algorithmic computation
Biological systems do not use principles of
digital circuits
Digital Biological
Connectivity 1~10 ~10,000
Signal digital analog
Timing synchronous asynchronous
Signal propag. feedforward feedback
Redundancy no yes
Parallel proc. no yes
Learning no yes
Noise tolerance no yes


6
Neural vs Algorithmic computation
Computers excel at algorithmic tasks (well-
posed mathematical problems)
Biological systems are superior to digital
systems for ill-posed problems with noisy data
Example: object recognition [Hopfield, 1987]
PIGEON: ~ 10^^9 neurons, cycle time ~ 0.1 sec,
each neuron sends 2 bits to ~ 1K other neurons
2x10^^13 bit operations per sec
OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2
10x10^^14 bit operations per sec
Both have similar raw processing capability, but pigeons
are better at recognition tasks
7
Neural terminology and artificial neurons
Some general descriptions of ANNs:
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
http://en.wikipedia.org/wiki/Neural_network
McCulloch-Pitts neuron (1943)





Threshold (indicator) function of weighted sum of inputs

8
Goals of ANNs

Develop models of computation inspired by
biological systems
Study computational capabilities of networks
of interconnected neurons
Apply these models to real-life applications

Learning in NNs = modification (adaptation) of
synaptic connections (weights) in response to
external inputs


9
Historical highlights of ANN
1943 McCulloch-Pitts neuron
1949 Hebbian learning
1960s Rosenblatt (perceptron), Widrow
60s-70s dominance of hard AI
1980s resurgence of interest (PDP
group, MLP etc.)
1990s connection to statistics/VC-theory
2000s mature field/ unnecessary
fragmentation

10
OUTLINE
Objectives
Brief history and motivation for
artificial neural networks
Sequential estimation of model
parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and Discussion
11
Sequential estimation of model parameters
Batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches ~ batch
- Neural-network inspired methods ~ on-line
BUT the difference is only on the implementation level (so
both types of learning methods should yield similar
generalization performance)
Recall ERM inductive principle (for regression):


Assume dictionary parameterization with fixed basis fcts


( ) ( ) ( ) ( )

= =
= =
n
i
n
i
i i i i emp
f y
n
y L
n
R
1 1
2
,
1
, ,
1
w x w x w

y = f x,w ( ) = w
j
g
j
x ( )
j =1
m

12
Sequential (on-line) least squares minimization
Training pairs presented sequentially
On-line update equations for minimizing
empirical risk (MSE) wrt parameters w are:


(gradient descent learning)
where the gradient is computed via the chain rule:


the learning rate is a small positive value
(decreasing with k)


) ( ), ( k y k x
( ) ( ) ( ) ( ) ( ) w x
w
w w , , 1 k y k L k k
k
c
c
= +
c
cw
j
L x, y, w ( )=
cL
c

y
c

y
cw
j
= 2

y y ( )g
j
x ( )
k

13
On-line least-squares minimization algorithm
Known as delta-rule (Widrow and Hoff, 1960):
Given initial parameter estimates w(0), update
parameters during each presentation of k-th
training sample x(k),y(k)
Step 1: forward pass computation

- estimated output
Step 2: backward pass computation
- error term (delta)

z
j
k ( ) = g
j
x(k) ( )
j =1,...,m

y k ( )= w
j
k ( )z
j
k ( )
j =1
m

o k ( ) =

y k ( ) y k ( )
w
j
k +1 ( ) = w
j
k ( )
k
o k ( )z
j
k ( ), j =1, ... , m
14
Neural network interpretation of delta rule
Forward pass Backward pass



1
z
1
k ( )
z
m
k ( )

y k ( )
w
0
k ( )
w
1
k ( )
w
m
k ( )
1 z
1
k ( ) z
m
k ( )
o k ( ) =

y k ( ) y k ( )
Aw
j
k
( )
=
k
o k
( )
z
j
k
( )
w
j
k + 1
( )
= w
j
k
( )
+ Aw
j
k
( )
Biological learning



x y
Aw ~ xy
w
Hebbian Rule:
Synapse
15
Theoretical basis for on-line learning
Standard inductive learning: given training
data find the model providing min of
prediction risk

Stochastic Approximation guarantees
minimization of risk (asymptotically):

under general conditions
on the learning rate:


R e ( )= L z,e ( ) p z ( )
}
dz
z
1
,...,z
n
e k +1 ( )= e k ( )
k
grad
e
L z
k
,e k ( ) ( )
lim
k

k
= 0

k
k=1

k
2
k=1

<
16
Practical issues for on-line learning
Given finite training set (n samples):
this set is presented sequentially to a learning algorithm
many times. Each presentation of n samples is called
an epoch, and the process of repeated presentations is
called recycling (of training data)

Learning rate schedule: initially set large, then slowly
decreasing with k (iteration number). Typically good
learning rate schedules are data-dependent.
Stopping conditions:
(1) monitor the gradient (i.e., stop when the gradient
falls below some small threshold)
(2) early stopping can be used for complexity control



z
1
,...,z
n
17
OUTLINE
Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
- MultiLayer Perceptron (MLP) networks
- Radial Basis Function (RBF) Networks
Methods for unsupervised learning
Summary and discussion
18
Multilayer Perceptrons (MLP)
Recall graphical NN
representation for
dictionary methods:
where





How to estimate parameters (weights) via ERM?
Wis m1
1
2
m
V is d m

y = w
j
z
j
j =1
m

z
j
= g x, v
j
( )
x
1
x
2
x
d
z
1
z
2
z
m
g x,v
i
( )= s v
i 0
+ x
k
v
ik
k=1
d

|
\

|
.
= s xv
i
( )
s t ( ) =
1
1+ exp t ( )
s t ( ) = tanh t ( )=
exp t ( ) exp t ( )
exp t ( )+ exp t ( )
19
Learning for a single neuron (delta rule):
Forward pass Backward pass



1
z
1
k ( )
z
m
k ( )

y k ( )
w
0
k ( )
w
1
k ( )
w
m
k ( )
1 z
1
k ( ) z
m
k ( )
o k ( ) =

y k ( ) y k ( )
Aw
j
k
( )
=
k
o k
( )
z
j
k
( )
w
j
k + 1
( )
= w
j
k
( )
+ Aw
j
k
( )
How to implement gradient-descent learning in a
network of neurons?



20
Backpropagation training
Minimization of
with respect to parameters (weights) W, V

Gradient descent optimization for


where

Careful application of gradient descent leads
leads to backpropagation algorithm
( ) ( )

=
=
n
i
i i emp
y f R
1
2
, , V W x
V k +1 ( ) = V k ( )
k
grad
V
L x k ( ), y k ( ), V k ( ), w k ( ) ( )
w k +1 ( )= w k ( )
k
grad
w
L x k ( ), y k ( ), V k ( ), w k ( ) ( )
k =1,...,n,...
L x k ( ), y k ( ),V k ( ),w k ( ) ( )=
1
2
f x,w,V ( ) y ( )
2
21
Backpropagation: forward pass
for training input x(k), estimate predicted output
( ) k y

22
Backpropagation: backward pass
update the weights by propagating the error
23
Details of backpropagation
Sigmoid activation - picture?
simple derivative
Poor behaviour for large t ~ saturation
How to avoid saturation?
- Proper initialization (small weights)
- Pre-scaling of inputs (zero mean, unit variance)
Learning rate schedule (initial, final)
Stopping rules, number of epochs
Number of hidden units
s t ( ) =
1
1+ exp t ( )
' s t ( )= s t ( ) 1 s t ( ) ( )
24
Additional enhancements
The problem: convergence may be very slow
for error functional with different curvatures:



Solution: add momentum term to smooth
oscillations
where and is momentum parameter
w k +1 ( ) =w k ( ) o k ( )z k ( )+Aw k ( )
Aw k ( )= w k ( )w k 1 ( )

25
Regularization Effect of Backpropagation
Backpropagation ~ iterative optimization
Final model (weights) depends on:
- initial point + final point (stopping rules)
initialization and/ or stopping rules can be used
for model complexity control
26
Various forms of complexity control
MLP topology ~ number of hidden units
Constraints on parameters (weights) ~
weight decay
Type of optimization algorithm (many
versions of backprop., other opt. methods)
Stopping rules
Initial conditions (initial small weights)
Multiple factors make it difficult to control
complexity; usually vary one complexity
parameter while keeping all others fixed
27
Example: univariate regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
(two hidden units)
underfitting

0 0.2 0.4 0.6 0.8 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
28
Example: univariate regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
(five hidden units)
near optimal

0 0.2 0.4 0.6 0.8 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
29
Example: univariate regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
(20 hidden units)
little overfitting

0 0.2 0.4 0.6 0.8 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
30
Backpropagation for classification
Original MLP is for regression
(as shown)





For classification:
- use sigmoid output unit
- during training, use real-values 0/1 for class labels
- during operation, threshold the output of a trained MLP
classifier at 0.5 to predict class labels

Wis m1
1
2
m
V is d m

y = w
j
z
j
j =1
m

z
j
= g x, v
j
( )
x
1
x
2
x
d
z
1
z
2
z
m
31
Classification example (Ripleys data set)
Data set: 250 samples ~ mixture of gaussians, where
Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and
Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
MLP classifier
(two hidden units)
underfitting
-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
32
Classification Example

MLP classifier (three hidden units)
~ near optimal solution

-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
33
Classification Example

MLP classifier (six hidden units)
some overfitting

-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
34
MLP software

MLP software widely available in public domain
For example, Netlab toolbox (in Matlab) at
http://www1.aston.ac.uk/eas/research/groups/n
crg/resources/netlab/

Many commercial products (full of Neural
Network marketing hype), i.e.
Nearly 80% Accurate Market Forecasting Software
Get FREE up to date predictions and see for yourself!
35
NetTalk
(Sejnowski and Rosenberg, 1987)
One of the first successful applications of backpropagation:
http://www.cnl.salk.edu/ParallelNetsPronounce/index.php
Goal: Learning to read (English text) aloud, i.e.
Learn Mapping: English text phonemes
using MLP network

Network inputs encode 7-letter window (the 4-th letter
in the middle needs to be pronounced)
Network outputs (26 units) encode phonemes that
drive a speech synthesizer
The MLP network is trained using labeled data (both
individual words and unrestricted text)

36
NetTalk architecture


Input encoding: 7x29 = 203 units
Output encoding: 26 units (phonemes)
Hidden layer: 80 hidden units
37
Listening to NetTalk-generated speech
Listen to tape recordings illustrating NETtalk operation. These recordings
are available (in MP3 format) from an article in Wikipedia at
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
This article has a link to the audio examples of the neural network as it
progresses through training. Specifically, it has three recordings contain 3
different audio outputs of NETtalk:
(a) during the first 5 minutes of training, starting with weights initialized to zero.
(b) after training using the set of 10,000 words. This training set corresponds to 20
passes (epochs) over 500-word text.
(c) generated with new text input from transcription that was not part of the training
set.
After listening to these recordings, answer and comment on the following questions:
- can you recognize words in the beginning of recording (a)? in the end of (a)?
- compare the quality of outputs (b) and (c). Which one seems closer to human
speech and why?


38
NETtalk: question for discussion
NETtalk system uses a seven-letter window for
text input. Try to justify this choice (of window
size) based on the properties of natural English
language. How the performance of NETtalk would
change if a small window (of size 3 letters) or a
large window (of size 21 letters) is used instead?


39
Radial Basis Function (RBF) networks
Dictionary parameterization:





- each b.f. is (usually) local
- center and width
i.e. Gaussian:


Typically used for regression or classification
Wis m1
1
2
m
V is d m

y = w
j
z
j
j =1
m

z
j
= g x, v
j
( )
x
1
x
2
x
d
z
1
z
2
z
m
( )
0
1
g w w f
m
j
j
j
j m
+
|
|
.
|

\
|

=

=
o
v x
x
( )
( )
|
|
.
|

\
|

=
|
|
.
|

\
|

=
[
=
2
2
1
2
2
2
exp
2
exp g
o o
v x
x
d
j
j j
v x
v
j
o
j
40
RBF network training
RBF training (learning) ~ estimation of
(1) RBF parameters (centers, width)
(2) linear weights ws
Non-adaptive implementation:
(1) Estimate RBF parameters via unsupervised learning
(only x-values of training data) can use SOM, GLA etc.
(2) Estimate weights w via linear least squares
Advantages:
- fast training;
- when x-samples are plenty, but (x,y) data are few
Limitations: cannot discard irrelevant inputs
the curse of dimensionalty
41
Non-adaptive RBF training algorithm
1. Choose the number of basis functions
(centers) m.
2. Estimate centers using x-values of training data
via unsupervised learning (SOM, GLA, clustering etc.)
3. Determine width parameters using heuristic:
For a given center
(a) find the distance to the closest center:
for all

(b) set the width parameter
where parameter controls degree of overlap between
adjacent basis functions. Typically
4. Estimate weights w via linear least squares
(minimization of the empirical risk).
v
j
v
j
o
j
r
j
= min
k
v
k
v
j
k = j
o
j
= r
j

1s s3
42
RBF network complexity control
RBF model complexity can be controlled by
The number of RBFs:
Goal: select opt number of units (RBFs)
RBF width:
Goal: select opt width parameter (for large
number of RBFs)
Penalization of large weights ws
See toy examples next (using the number of
units as the complexity parameter)
43
Example: RBF regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection
2 RBFs
underfitting

0 0.2 0.4 0.6 0.8 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
44
Example: RBF regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection
5 RBFs
~ optimal

0 0.2 0.4 0.6 0.8 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
45
Example: RBF regression
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection
20 RBFs
overfitting

0 0.2 0.4 0.6 0.8 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
46
RBF Classification example (Ripleys data)
Data set: 250 samples ~ mixture of gaussians, where
Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and
Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
RBF classifier
(4 units)
little underfitting
-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
47
RBF Classification example (contd)
RBF classifier (9 units)

Optimal
-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
48
RBF Classification example (contd)
RBF classifier (25 units)
Little overfitting
-1.5 -1 -0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
49
OUTLINE
Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
- Overview
- Clustering and vector quantization
- Self-Organizing Maps (SOM)
- Application example
Summary and discussion
50
Overview
Recall from Lecture Set 2:
unsupervised learning
data reduction approach
Example: Training data represented by 3 centers

H
51
Two types of problems
1. Data reduction:
VQ + clustering

Vector Quantizer Q:



Model:

VQ setting: given n training samples
find the coordinates of m centers (prototypes) such
that the total squared error distortion is minimized


X = x
1
, x
2
, .. .,x
n
{ }
f x,e ( ) = Q x ( )= c
j
I x eR
j
( )
j =1
m

c
j
min
1
) (
1
2
=

=
n
i
j i emp
n
R c x C
{ }
m
c c c C ,..., ,
2 1
=
52
2. Dimensionality reduction:
linear nonlinear









Note: the goal is to estimate a mapping from d-dimensional
input space (d=2) to low-dim. feature space (m=1)

R e ( )= x f x,e ( )
2
p x ( )
}
dx
x
1
x
2
53
Unsupervised Learning: Formalization
Unsupervised learning
~ mapping from the input space (x) to some
model space
For VQ/ clustering:
a model is a set of centers (cluster centers)
For dimensionality reduction:
a model is a low-dim. Space
Note 1: two types of problems can be combined
Note 2: unsupervised learning requires
estimation of two mappings x z x*
z = F(x) and x* = G(z)
54
Vector Quantization and Clustering
Two complementary goals of VQ:
1. partition the input space into disjoint regions
2. find positions of units (coordinates of prototypes)








Note: optimal partitioning into regions is according to
the nearest-neighbor rule (~ the Voronoi regions)

55
Generalized Lloyd Algorithm(GLA) for VQ
Given data points , loss function L (i.e.,
squared loss) and initial centers

Perform the following updates upon presentation of
1. Find the nearest center to the data point (the
winning unit):


2. Update the winning unit coordinates (only) via


Increment k and iterate steps (1) (2) above
Note: - the learning rate decreases with iteration number k
- biological interpretations of steps (1)-(2) exist


c
j
0 ( ) j =1,...,m
x k ( ) k =1,2,...
x k ( )
( ) ( ) k k j
i
i
c x = min arg
( ) ( ) ( ) ( ) ( ) | | k k k k k
j j j
c x c c + = + 1
56
Batch version of GLA
Given data points , loss function L (i.e.,
squared loss) and initial centers
Iterate the following two steps
1. Partition the data (assign sample to unit j )
using the nearest neighbor rule. Partitioning matrix Q:



2. Update unit coordinates as centroids of the data:




Note: final solution may depend on initialization (local min)
potential problem for both on-line and batch GLA


c
j
0 ( ) j =1,...,m
x
i
i =1,...,n
q
ij
=
1 if L x
i
, c
j
k ( )
( )
= min
l
L x
i
, c
l
k ( )
( )
0 otherwise



c
j
k +1 ( )=
q
ij
x
i
i=1
n

q
ij
i =1
n

, j = 1, . .. , m
x
i
57
Statistical Interpretation of GLA
Iterate the following two steps
1. Partition the data (assign sample to unit j ) using
the nearest neighbor rule. Partitioning matrix Q:



~ Projection of the data onto model space(units) F(x)

2. Update unit coordinates as centroids of the data:




~Conditional Expectation (averaging, smoothing) G(z)
conditional upon results of partitioning step (1)

q
ij
=
1 if L x
i
, c
j
k ( )
( )
= min
l
L x
i
, c
l
k ( )
( )
0 otherwise



c
j
k +1 ( )=
q
ij
x
i
i=1
n

q
ij
i =1
n

, j = 1, . .. , m
x
i
58
Numeric Example of univariate VQ
Given data: {2,4,10,12,3,20,30,11,25}, set m=2
Initialization (random): c
1
=3,c
2
=4
Iteration 1
Projection: P
1
={2,3} P
2
={4,10,12,20,30,11,25}
Expectation (averaging): c
1
=2.5, c
2
=16
Iteration 2
Projection: P
1
={2,3,4}, P
2
={10,12,20,30,11,25}
Expectation(averaging): c
1
=3, c
2
=18
Iteration 3
Projection: P
1
={2,3,4,10},P
2
={12,20,30,11,25}
Expectation(averaging): c
1
=4.75, c
2
=19.6
Iteration 4
Projection: P
1
={2,3,4,10,11,12}, P
2
={20,30,25}
Expectation(averaging): c
1
=7, c
2
=25
Stop as the algorithm is stabilized with these values



59
GLA Example 1
Modeling doughnut distribution using 5 units

(a) initialization (b) final position (of units)


60
GLA Example 2
Modeling doughnut distribution using 20 units:
7 units were never moved by the GLA
the problem of unused units (dead units)
61
Avoiding local minima with GLA
Starting with many random initializations,
and then choosing the best GLA solution
Conscience mechanism: forcing dead
units to participate in competition, by keeping
the frequency count (of past winnings) for
each unit,
i.e. for on-line version of GLA in Step 1


Self-Organizing Map: introduce topological
relationship (map), thus forcing the neighbors
of the winning unit to move towards the data.

( ) ( ) ) ( min arg k freq k k j
i i
i
c x =
62
Clustering methods
Clustering: separating a data set into
several groups (clusters) according to some
measure of similarity
Goals of clustering:
interpretation (of resulting clusters)
exploratory data analysis
preprocessing for supervised learning
often the goal is not formally stated
VQ-style methods (GLA) often used for
clustering, i.e. k-means or c-means
Many other clustering methods as well

63
Clustering (contd)
Clustering: partition a set of n objects
(samples) into k disjoint groups, based on
some similarity measure. Assumptions:
- similarity ~ distance metric dist (i,j)
- usually k given a priori (but not always!)
Intuitive motivation:
similar objects into one cluster
dissimilar objects into different clusters
the goal is not formally stated
Similarity (distance) measure is critical
but usually hard to define (objectively).
Distance needs to be defined for different
types of input variables.


64
Applications of clustering
Marketing:
explore customers data to identify buying
patterns for targeted marketing (Amazon.com)
Economic data:
identify similarity between different countries,
states, regions, companies, mutual funds etc.
Web data:
cluster web pages or web users to discover
groups of similar access patterns
Etc., etc.

65
Clustering Methods
Many different approaches developed in
- neural networks
- mathematics (graph theory, linear
algebra)
- pattern recognition
- data mining etc.
Example Graph-theoretic approach:
Minimum Spanning Tree (MST) clustering
Types of clustering methods:
hierarchical, partitional, fuzzy clustering.


66
K-means clustering (~ GLA)
This is representative Partitional Clustering method.
Given a data set of n samples and the value of k:
Step 0: (arbitrarily) initialize cluster centers
Step 1: assign each data point (object) to the cluster
with the closest cluster center
Step 2: calculate the mean (centroid) of data points
in each cluster as estimated cluster centers
Iterate steps 1 and 2 until the cluster membership is
stabilized



x
i
67
The K-Means Clustering Method
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
reassign
68
Clustering of High-Dimensional Data
Additional Challenges
- many clustering methods rely on intuition for
low-dimensional data
- visualization is possible only in 2D or 3D.
Multidimensional scaling (MDS)
aims to produce 2D representation of inter-point
distances between high-dimensional samples.
MDS finds a set of points in 2D space,
which minimizes the stress function



ij
o
( ) ( )

=
=
j i
j i ij n m
S
2
2 1
, , , z z z z z o
| |
n
z z Z ,...,
1
=
69
Self-Organizing Maps
History and biological motivation
Brain changes its internal structure to reflect
life experiences interaction with
environment is critical at early stages of
brain development (first 1-2 years of life)
Existence of various regions (maps) in the
brain
How these maps may be formed?
i.e. information-processing model leading to
map formation
T. Kohonen (early 1980s) proposed SOM

70
Goal of SOM
Dimensionality reduction: project given (high-
dimensional) data onto low-dimensional space (map)
Feature space (Z-space) is 1D or 2D and is discretized
as a number of units, i.e., 10x10 map
Z-space has distance metric ordering of units
Similarities and differences between VQ and SOM












X
F(Z)
Z
G(X)

X
71
Self-Organizing Map
Discretization of 2D space via 10x10 map. In this discrete
space, distance relations exist between all pairs of units.
Distance relation ~ map topology
Units in 2D feature space
72
SOM Algorithm (flow through)
Given data points , distance metric in the
input space (~ Euclidean), map topology (in z-space),
initial position of units (in x-space)

Perform the following updates upon presentation of
1. Find the nearest center to the data point (the
winning unit):


2. Update all units around the winning unit via


Increment k, decrease the learning rate and the
neighborhood width, and repeat steps (1) (2) above


c
j
0 ( ) j =1,...,m
x k ( ) k =1,2,...
x k ( )
( ) ( ) 1 min arg ) ( * = k k k
i
i
c x z
( ) ( ) ( )
( )
( ) ( ) ( ) ( ) 1 * , 1 + = k k K k k k
j k j j
c x z z c c
o
|
73
SOM example (1-st iteration)
Step 1:
Step 2:
74
SOM example (next iteration)
Step 1:
Step 2:
Final map
75
Hyper-parameters of SOM
SOM performance depends on parameters (~ user-defined):
Map dimension and topology (usually 1D or 2D)
Number of SOM units ~ quantization level (of z-space)
Neighborhood function ~ rectangular or gaussian (not
important)
Neighborhood width decrease schedule (important),
i.e. exponential decrease for Gaussian


with user defined:
Also linear decrease of neighborhood width
Learning rate schedule (important)
(also linear decrease)
Note: learning rate and neighborhood decrease should be
set jointly





( )
( )
( )
|
|
.
|

\
|
'

= '
k
K
k
2
2
2o
o
z z
z z exp ,
( ) ( )
max
k k
initial final initial
k o o o o =
max
k
initial
o
final
o
( )
max
k
k
initial
final
initial
k
|
|
.
|

\
|
=
|
|
| |
76
Modeling uniform distribution via SOM
(a) 300 random samples (b) 10X10 map
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SOM neighborhood: Gaussian
Learning rate: linear decrease
) / 1 ( 1 . 0 ) (
max
k k k = |
77
Position of SOM units: (a) initial, (b) after 50 iterations,
(c) after 100 iterations, (d) after 10,000 iterations
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
78
Batch SOM (similar to batch GLA)
Given data points , loss function L (i.e.,
squared loss) and initial centers
Iterate the following steps
Partition the data (assign sample to unit j ) using the
nearest neighbor rule. Partitioning matrix Q:


Update unit coordinates as weighted average of all samples:


where is the weight of sample


Decrease the neighborhood width
Iterate (repeat) above steps max number of iterations Kmax


c
j
0 ( ) j =1,...,m
x
i
i =1,...,n
q
ij
=
1 if L x
i
, c
j
k ( )
( )
= min
l
L x
i
, c
l
k ( )
( )
0 otherwise



x
i
( )
( )

=
=
= +
n
i
i j
n
i
i j i
j
K
K
k c
1
1
,
,
) 1 (
z z
z z x
o
o
i
x
) , (
i j
K z z
o
o
79
SOM Example 1
Modeling doughnut distribution using batch SOM
linear (1D) SOM topology with 5 units
- initial neighborhood =1, final neighborhood = 0.05
- number of iterations Kmax=50


Note: no unused units:



-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
X1
X
2
80
SOM Example 2
Modeling the same doughnut distribution using:
square grid (2D) SOM topology with 5x5=25 units
- initial neighborhood =1, final neighborhood = 0.05
- number of iterations Kmax=50


Note:
final model not sensitive
to poor initialization



-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
X1
X
2
81
SOM Applications and Variations
Main web site: Helsinki University of Technology (HUT)
http://www.cis.hut.fi/research/som-research/
Numerous Applications
Marketing surveys/ segmentation
Financial/ stock market data
Text data / document map WEBSOM
Image data / picture map - PicSOM
see HUT web site

82
Practical Issues for SOM
Pre-scaling of inputs, usually to [0, 1]
range. Why?
Map topology: usually 1D or 2D
Number of map units (per dimension)
Learning rate schedule (for on-line
version)
Neighborhood type and schedule:
Initial size (~1), final size
Final neighborhood size + number of
units determine model complexity.

83
Modeling US states using 1D SOM
Purpose: clustering of US states
Data encoding: each state described by 5
socio-economic indicators: obesity index,
result of 2004 presidential elections, median
income, mean NAEP, IQ score
Data scaling: each input scaled
independently to [0,1] range
SOM specs: 1D map, 9 units, initial
neighborhood width 1, final width 0.05

84


State Obesity index Election_04 Median Income Mean NAEP IQ score
Hawaii 17 0 49775 238 94
Colorado 17 1 49617 252 104
Connecticut 18 0 53325 255 99
Massachusetts 18 0 50587 257 111
New Hampshire 18 1 53549 257 102
Utah 18 1 48537 250 89
California 19 0 48113 238 94
Maryland 19 0 55912 248 95
New Jersey 19 0 53266 253 103
Rhode Island 19 0 44311 245 89
Vermont 19 0 41929 256 102
Florida 19 1 38533 245 87
Montana 19 1 33900 254 100
Oregon 20 0 42704 250 100
Arizona 20 1 41554 241 92
Idaho 20 1 38613 249 96
New Mexico 20 0 35251 235 85
Wyoming 20 1 40499 253 102
Maine 21 0 37654 253 99
New York 21 0 42432 251 90
Washington 21 0 44252 251 92
South Dakota 21 1 38755 254 100
Delaware 22 0 50878 250 90
Illinois 22 0 45906 248 93
Minnesota 22 0 54931 256 113
Wisconsin 22 0 46351 252 105
Nevada 22 1 46289 239 92
Alaska 23 1 55412 245 92
85


Iowa 23 0 41827 253 109
Kansas 23 1 42523 253 101
Missouri 23 1 43955 251 92
Nebraska 23 1 43566 251 101
North Dakota 23 1 36717 254 111
Ohio 23 1 43332 252 107
Oklahoma 23 1 35500 244 98
Pennsylvania 24 0 43577 249 99
Arkansas 24 1 32423 242 98
Georgia 24 1 43316 243 93
Indiana 24 1 41581 251 105
North Carolina 24 1 38432 252 106
Virginia 24 1 49974 253 99
Michigan 25 0 45335 249 99
Kentucky 25 1 37893 247 94
Tennessee 25 1 36329 241 90
Alabama 26 1 36771 236 90
Louisiana 26 1 33312 238 99
South Carolina 26 1 38460 246 87
Texas 26 1 40659 247 98
Mississippi 27 1 32447 236 90
West Virginia 28 1 30072 245 92
86
SOM Modeling 1 of US states


Unit States (assigned to each unit)
1 HI, CA, MD, RI, NM,
2 OR, ME, NY, WA, DE, IL, PA, MI,
3 CT, MA, NJ, VT, MN, WI,
4
5 CO, NH, MT, WY, SD,
6 KS, NE, ND, OH, IN, NC, VA,
7 UT, ID, AK, IA, MO,
8 FL, AZ, NV, OK, GA, KY, TX
9 AR, TN, AL, LA, SC, MS, WV
87


88
SOM Modeling 2 of US states
- remove voting input and apply 1D SOM:


Unit States
1 CO, CT, MA, NH, NJ, MN,
2 WI, IA, ND, OH, IN, NC,
3 VT, MT, OR, ID, WY, ME, SD,
4 KS, MO, NE, PA, VA, MI,
5 UT, MD, NY, WA, DE, IL, AK,
6 HI, CA , RI,
7 FL, AZ, NM, NV,
8 OK, GA, KY, SC, TX,
9 AR, TN, AL, LA, MS, WV
89
SOM Modeling 2 of US states (contd)
- remove voting input and apply 1D SOM:


90
Tree-structured SOM
Fixed SOM topology gives poor modeling of
structured distributions:


91
Minimum SpanningTree SOM
Define SOM topology adaptively during each iteration
of SOM algorithm
Minimum Spanning Tree (MST) topology ~ according
to distance between units (in the input space)
Topological distance ~ number of hops in MST

1
2 3
92
Example of using MST SOM
Modeling cross distribution
MST topology vs fixed 2D grid map


93
Application: skeletonization of images
Singh at al, Self-organizing maps for the skeletonization of sparse
shapes, IEEE Trans Neural Networks, 11, Issue 1, Jan 2000


Skeletonization of noisy images
Application of MST SOM: robustness with
respect to noise

94
Clustering of European Languages
Background historical linguistics studies
relatedness btwn languages based on
phonology, morphology, syntax and lexicon
Difficulty of the problem: due to evolving
nature of human languages and globalization.
Hypothesis: similarity based on analysis of a
small stable word set.
See glottochronology, Swadesh list, at
http://en.wikipedia.org/wiki/Glottochronology

95
SOM for clustering European languages
Modeling approach: language ~ 10 word set.
Assuming words in different languages are encoded
in the same alphabet, it is possible to perform
clustering using some distance measure.
Issues:
selection of stable word set
data encoding + distance metric
Stable word set: numbers 1 to 10
Data encoding: Latin (English) alphabet,
use 3 first letters (in each word)


96
Numbers word set in 18 European languages
Each language is a feature vector encoding 10 words



E
n
g
l
i
s
h
N
o
r
w
e
g
i
a
n
P
o
l
i
s
h
C
z
e
c
h
S
l
o
v
a
k
i
a
n
F
l
e
m
i
s
h
C
r
o
a
t
i
a
n
P
o
r
t
u
g
u
e
s
e
F
r
e
n
c
h
S
p
a
n
i
s
h
I
t
a
l
i
a
n
S
w
e
d
i
s
h
D
a
n
i
s
h
F
i
n
n
i
s
h
E
s
t
o
n
i
a
n
D
u
t
c
h
G
e
r
m
a
n
H
u
n
g
a
r
i
a
n
one en jeden jeden jeden ien jedan um un uno uno en en yksi uks een erins egy
two to dwa dva dva twie dva dois deux dos due tva to kaksi kaks twee zwei ketto
three tre trzy tri tri drie tri tres trois tres tre tre tre kolme kolme drie drie harom
four fire cztery ctyri styri viere cetiri quarto quatre cuatro quattro fyra fire nelja neli vier vier negy
five fem piec pet pat vuvve pet cinco cinq cinco cinque fem fem viisi viis vijf funf ot
six seks szesc sest sest zesse sest seis six seis sei sex seks kuusi kuus zes sechs hat
seven sju sediem sedm sedem zevne sedam sete sept siete sette sju syv seitseman seitse zeven sieben het
eight atte osiem osm osem achte osam oito huit ocho otto atta otte kahdeksan kaheksa acht acht nyolc
nine ni dziewiec devet devat negne devet nove neuf nueve nove nio ni yhdeksan uheksa negen neun kilenc
ten ti dziesiec deset desat tiene deset dez dix dies dieci tio ti kymmenen kumme tien zehn tiz
97
Data Encoding
Word ~ feature vector encoding 3 first letters
Alphabet ~ 26 letters + 1 symbol BLANK
vector encoding:





For example, ONE : O~14 N~15 E~05



ALPHABET INDEX
BLANK 00
A 01
B 02
C 03
D 04

X 24
Y 25
Z 26

98
Word Encoding (contd)
Word 27-dimensional feature vector





Encoding is insensitive to order (of 3 letters)
Encoding of 10-word set: concatenate feature
vectors of all words: one + two + + ten
word set encoded as vector of dim. [1 X 270]


one (Word)


15 14 05 (Indices)


0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
99
SOM Modeling Approach
2-Dimensional SOM (Batch Algorithm)
Number of Knots per dimension=4
Initial Neighborhood =1 Final Neighborhood = 0.15
Total Number of Iterations= 70







100
OUTLINE
Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model
parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and discussion
101
Summary and Discussion
Neural Network methods (vs statistical
approaches):
- new techniques/ new insights
- simple (brute-force) computational approaches
- biological motivation
The same fundamental issues: small-sample
problems, curse-of-dimensionality, non-linear
optimization, complexity control
Neural network methods implement risk-
minimization (predictive learning setting)
Hype and controversy

You might also like