You are on page 1of 64

Introduction to

Predictive Learning

LECTURE SET 4
Statistical Learning Theory

Electrical and Computer Engineering


2

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
3

Objectives
Problems with philosophical
approaches
- lack quantitative description/
characterization of ideas;
- no real predictive power (as in Natural
Sciences)
- no agreement on basic definitions/
concepts (as in Natural Sciences)

Goal: to introduce Predictive Learning as a


scientific discipline
4

Characteristics of Scientific
Theory
Problem setting
Solution approach
Math proofs (technical analysis)
Constructive methods
Applications
Note: Problem Setting and Solution
Approach are independent (of each
5
other)

History and Overview


SLT aka VC-theory (Vapnik-Chervonenkis)
Theory for estimating dependencies from
finite samples (predictive learning
setting)
Based on the risk minimization approach
All main results originally developed in
1970s for classification (pattern
recognition) why?
but remained largely unknown
Recent renewed interest due to practical
success of Support Vector Machines
6
(SVM)

History and
Overview(contd)
MAIN CONCEPTUAL CONTRIBUTIONS
Distinction between problem setting,
inductive principle and learning algorithms
Direct approach to estimation with finite
data (KID principle)
Math analysis of ERM (standard inductive
setting)
Two factors responsible for
generalization:
- empirical risk (fitting error)
- complexity(capacity) of approximating
7

Importance of VC-theory
Math results addressing the main
question:
- under what general conditions the ERM approach
leads to (good) generalization?

New approach to induction:


Predictive vs generative modeling (in classical
statistics)

Connection to philosophy of science


- VC-theory developed for binary classification
(pattern recognition) ~ the simplest generalization
problem
- natural sciences: from observations to scientific
law
8

Inductive Learning Setting


The learning machine observes samples (x ,y), and
y f (x, w)
returns an estimated response
Two modes of inference: identification vs imitation
Risk Loss(y, f(x,w)) dP(x,y) min

The Problem of Inductive


Learning

Given: finite training samples Z={(xi,


yi),i=1,2,n} choose from a given set of
functions f(x, w) the one that approximates
best the true output. (in the sense of risk
minimization)
Concepts and Terminology
approximating functions f(x, w)
(non-negative) loss function L(f(x, w),y)
expected risk functional R(Z,w)
Goal: find the function f(x, wo) minimizing
R(Z,w) when the joint distribution P(x,y) is
10
unknown.

Empirical Risk Minimization


ERM principle in model-based
learning
Model parameterization: f(x, w)
1 n
Loss function: L(f(x, w),y)
Remp (w ) L( f ( x i , w ), y i )
n i 1
Estimate risk from data:
Choose w* that minimizes Remp

Statistical Learning Theory


developed from the theoretical
analysis of ERM principle under
11

Probabilistic Modeling vs ERM

12

Probabilistic Modeling vs ERM:


Example

Known class distribution optimal decision


boundary
10
8
6

x2

4
2
0
-2
-4
-6
-2

4
x1

10

13

Probabilistic Approach

Estimate parameters of Gaussian class


distributions , and plug them into quadratic
decision boundary
10
8
6

x2

4
2
0
-2
-4
-6
-2

4
x1

10

14

ERM Approach
Quadratic and linear decision boundary
estimated via minimization of squared loss
10
8
6

x2

4
2
0
-2
-4
-6
-2

4
x1

10

15

Estimation of multivariate
functions

Is it possible to estimate a function from finite


data?
Simplified problem: estimation of unknown
continuous function from noise-free samples
Many results from function approximation
theory:

To estimate accurately a d-dimensional function one


needs O(n^^d) data points
For example, if 3 points are needed to estimate 2-nd
order polynomial for d=1, then 3^^10 points are
needed to estimate 2-nd order polynomial in 10dimensional space.
Similar results in signal processing

Never enough data points to estimate


multivariate functions in most practical
applications (image recognition, genomics etc.)
For multivariate function estimation, the
number of free parameters increases
16
exponentially with problem dimensionality
(the

Properties of high-dimensional
data

Sparse data looks like a porcupine: the volume of a unit sphere


inscribed in a d-dimensional cube gets smaller even as the volume of
d-cube gets exponentially larger!
A point is closer to an edge than to another point
Pairwise distances between points are the same
Intuition behind kernel (local) methods no longer holds.

How generalization is possible, in spite of the curse of dimensionality?


17

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
18

Keep-It-Direct Principle

The goal of learning is generalization rather


than estimation of true function (system
identification)

Loss(y, f(x,w)) dP(x,y) min

Keep-It-Direct Principle (Vapnik,


1995)
Do not solve an estimation problem of
interest by solving a more general
(harder) problem as an intermediate
step
Good predictive model reflects some
properties of unknown distribution P(x,y)
Since model estimation with finite data is illposed, one should never try to solve a more
general problem than required by given
19
application

Learning vs System Identification

Consider regression problem y g (x)


where unknown target functiong (x) E ( y / x)

R(w ) ( y f (x, w )) 2 dP (x, y ) min


Goal 1: Prediction

Goal 2: Function Approximation (system


identification)
2
or

R (w ) ( f (x, w ) g (x)) dx min


f (x, w ) E ( y / x) min

Admissible models: algebraic polynomials


Purpose of comparison: contrast goals (1) and
(2)
NOTE: most applications assume Goal 2, i.e.
Noisy Data ~ true signal + noise
20

Empirical Comparison
2
g
(
x
)

sin
(2x) x [0,1]
Target function: sine-squared

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.2

0.4

0.6

0.8

Input distribution: non-uniform Gaussian pdf


Additive gaussian noise with st. deviation = 0.1

21

Empirical Comparison (contd)

Model selection: use separate data sets


- training : for parameter estimation
- validation: for selecting polynomial degree
- test: for estimating prediction risk (MSE)

Validation set generated differently to contrast


(1)&(2)
Predictive Learning (1) ~ Gaussian
Funct. Approximation (2) ~ uniform fixed
sampling

Training + test data ~ Gaussian

Training set size: 30 Validation set size : 30


22

Regression estimates (2 typical realizations of


data):

Dotted line ~ estimate obtained using predictive


learning
Dashed line ~ estimate via function approximation
23
setting

Conclusion

The goal of prediction (1) is different


(less demanding) than the goal of
estimating the true target function
(2) everywhere in the input space.
The curse of dimensionality applies
to system identification setting (2),
but may not hold under predictive
setting (1).
Both settings coincide if the input
distribution is uniform (i.e., in signal
and image denoising applications)
24

Philosophical Interpretation of KID


Interpretation of predictive models
Realism ~ objective truth (hidden in
Nature)
Instrumentalism ~ creation of human
mind (imposed on the data) favored
by KID
Objective Evaluation still possible (via
prediction risk reflecting application
needs) Natural Science
Methodological implications
Importance of good learning
formulations (asking the right
question)
25

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
26

VC-theory has 4 parts:


1. Analysis of consistency/convergence
of ERM
1 n
Remp ( ) L( yi , f (xi , )) min
n

i 1

2. Generalization bounds
3. Inductive principles (for finite
samples)
4. Constructive methods (learning
algorithms) for implementing (3)
NOTE: (1)(2)(3)(4)
27

Consistency/Convergence of
ERM

Empirical Risk known but Expected Risk


unknown
Asymptotic consistency requirement:
under what (general) conditions models
providing min Empirical Risk will also
provide min Prediction Risk, when the
number of samples grows large?
Why asymptotic analysis is needed?
- helps to develop useful concepts
- necessary and sufficient conditions ensure
that VC-theory is general and can28not be

Consistency of ERM

Remp
Convergence of empirical risk
to expected
R
risk does not imply consistency of ERM
Models estimated via ERM (w*) are always biased
estimates of the functions minimizing true risk:
Remp n* R n*
29

Conditions for Consistency of


ERM
Main insight: consistency is not possible without

restricting the set of possible models


Example: 1-nearest neighbor classification method.
- is it consistent ?
Consider binary decision functions (classification)
How to measure their flexibility, or ability to
explain the training data (for binary classification)?
This complexity index for indicator functions:
- is independent of unknown data distribution;
- measures the capacity of a set of possible models,
rather than characteristics of the true model
30

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
31

SHATTERING
Linear indicator functions: can split 3 data points
in 2D in all 2^^3 = 8 possible binary partitions

If a set of n samples can be separated by a set of


functions in all 2^^n possible ways, this sample is
said to be shattered (by the set of functions)
Shattering ~ a set of models can explain a given
sample of size n (for all possible labelings)

32

VC DIMENSION
Definition: A set of functions has VC-dimension h

is there exist h samples that can be shattered by this


set of functions, but there are no h+1 samples that
can be shattered

VC-dimension h=3 ( h=d+1 for linear


functions )
VC-dim. is a positive integer (combinatorial
33
index)

VC-dimension and Consistency of


ERM

VC-dimension is infinite if a sample of size


n can be split in all 2^^n possible ways
(in this case, no valid generalization is
possible)
Finite VC-dimension gives necessary and
sufficient conditions for:
(1) consistency of ERM-based learning
(2) fast rate of convergence
(these conditions are distribution-independent)

Interpretation of the VC-dimension via


falsifiability:
34

VC-dimension and
Falsifiability

A set of functions has VC-dimension h if


(a) It can explain (shatter) a set of h samples
~ there exists h samples that cannot falsify it
and
(b) It can not shatter h+1 samples
~ any h+1 samples falsify this set
() Finiteness of VC-dim is necessary and
sufficient condition for generalization
(for any learning method based on ERM)
35

Recall Occams Razor:


Main problem in predictive learning
- Complexity control (model selection)
- How to measure complexity?
Interpretation of Occams razor (in
Statistics):
Entities ~ model parameters
Complexity ~ degrees-of-freedom
Necessity ~ explaining (fitting) available
data
Model complexity = number of parameters
(DoF)
Consistent with classical statistical view:
36

Philosophical Principle of VCfalsifiability


Occams Razor: Select the model that
explains available data and has the
smallest number of free parameters
(entities)
VC theory: Select the model that
explains available data and has low VCdimension (i.e. can be easily falsified)
New principle of VC-falsifiability
37

Calculating the VC-dimension


How to estimate the VC-dimension (for a
given set of functions)?
Apply definition (via shattering) to derive
analytic estimates works for simple sets of
functions
Generally, such analytic estimates are not
possible for complex nonlinear
parameterizations (i.e., for practical machine
learning and statistical methods) 38

Example: VC-dimension of spherical indicator


functions.
Consider spherical decision surfaces in a d2
2
dimensionalf xx-space,
parameterized
by center c
,
c
,
r

I
(
x

c
)

and radius r parameters:

In a 2-dim space (d=2) there exists 3 points that can


be shattered, but 4 points cannot be shattered
h=3

39

Example: VC-dimension of a linear combination of fixed basis


functions (i.e. polynomials, Fourier expansion etc.)
Assuming that basis functions are linearly independent, the
VC-dim equals the number of basis functions (free
parameters).
f x ,w I sin wx 0
Example: single parameter but infinite VC-dimension

40

Example: Wide linear decision boundaries


D x ( w x) b
Consider linear functions such that the distance
btwn D(x) and the closest data sample is larger
than given value
Then VC-dimension depends on the width
R2
2 , d 1
h minmodels):
parameter, rather than d (as in linear

41

Linear combination of fixed basis


m

functions
f x, w I w g (x) w 0

i 1

is equivalent to linear functions in m-dimensional


space
VC-dimension = m + 1
(this assumes linear independence of basis
functions)
In general, analytic estimation of VC-dimension is
hard
VC-dimension can be
- equal to DoF

42

VC-dimension vs number of
parameters
VC-dimension can be equal to DoF
(number of parameters)
Example: linear estimators
VC-dimension can be smaller than DoF
Example: penalized estimators
VC-dimension can be larger than DoF
Example: feature selection
sin (wx)
43

VC-dimension for Regression


Problems

VC-dimension was defined for indicator


functions
Can be extended to real-valued functions, i.e.
2
third-order
f ( x, wpolynomial
, b) w3 x 3 for
w2 xunivariate
w1 x b
regression:
linear parameterization VC-dim = 4
Qualitatively, the VC-dimension ~ the ability
to fit (or explain) finite training data for
44
regression.

Example: what is VC-dim of kNN


Regression?

Ten training
samples
from2
2
2

Using k-nn regression with k=1 and


k=4:

y x 0.1x N (0, ), where 0.25

45

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization
(SRM)
46

Recall consistency of ERM

Two Types of VC-bounds:


(1)How close is the empirical risk
Remp *

R risk
*
to the true

(2) How close is the empirical risk to the minimal


possible risk ?
47

Generalization Bounds
Bounds for learning machines

(implementing ERM) evaluate the difference btwn


(unknown) risk and known empirical risk, as a
function of sample size n and general properties of
admissible models (their VC-dimension)
Classification:
the following bound holds with
1
probability of
for all approximating functions
R ( ) Remp ( ) Remp ( ), n / h, ln / n

where
is called the confidence interval
Regression:
1
the following bound holds with
probability of
for all approximating
Rfunctions
( ) Remp ( ) / 1 c

where

a n
h ln 2 1 ln / 4
n ln
h
,
a1

h
n
n
48

Practical VC Bound for


regression

Practical regression bound

can be
min 4level
/ n ,1
obtained by setting the confidence
and theoretical constants:

h h h ln n

R (h) Remp (h) 1


ln

n
n
n
2
n

can be used for model selection (examples given


later)

Compare to analytic bounds (SC, FPE) in Lecture


Set 2

Analysis (of denominator) shows that


h < 0.8 n for any estimator
In practice:

49

VC Regression Bound for model


selection

VC-bound can be used for analytic model


selection
(if the VC-dimension is known)

Example: polynomial regression for estimating


Sine_Squared target function
from 25 noisy samples
1.5

Optimal model found: 1


6-th degree polynomial
0.5
(no resampling needed)
0

-0.5
0

0.2

0.4

0.6

0.8

50

Modeling pure noise with x in [0,1] via poly


regression
1
sample size
n=30, noise
10
Risk(MSE)

10

10

DegreeofFreedom

10

10

fpe

gcv

vc

cv

fpe

gcv

vc

cv

30
20
10
0

Comparison of different model selection


methods:
- prediction risk (MSE)
- selected DoF (~ h)
51

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization
(SRM)
52

Structural Risk Minimization


Analysis of generalization bounds
R ( ) Remp ( ) Remp ( ), n / h, ln / n

suggests that when n/h is large, the term


is small

R ( ) ~ Remp ( )

This leads to parametric modeling approach (ERM)


When n/h is not large (say, less than 20), both terms
in the right-hand side of VC- bound need to be
minimized
make the VC-dimension a controlling variable
SRM = formal mechanism
f (x, )for controlling model
complexity
S1 S2 ... Sk ...
h1 h2 ... hk ...
Set of admissible models
has a nested
structure
such that
53

Structural Risk Minimization


An upper bound on the true risk and the empirical
risk, as a function of VC-dimension h (for fixed
sample size n)

54

SRM vs ERM modeling

55

SRM Approach
Use VC-dimension as a controlling parameter for
minimizing VC bound:

R( ) Remp ( ) n / h

Two general strategies for implementing


SRM:

Remp ( )
n / h
1. Keep
fixed and minimize
(most statistical and neural network methods)
n / h
2. Keep
fixed and minimize
(Support Vector Machines)

Remp ( )

56

Common SRM structures


Dictionary structure

f m x, w b
A set of algebraic polynomials
f 1 f 2 .... f k ....
is a structure since

i
w
x
i
i 0

More generallyf m x, w, V b wi g x , v i
i 0
where gx,vi is a set of basis functions
(dictionary).

The number of terms (basis functions) m


specifies an element of a structure.
For fixed basis fcts, VC-dim ~ number of
parameters
57

wi

Common SRM structures


Feature selection (aka subset selection)
Consider sparse polynomials of degree m:
k
for m=1: f 1 ( x, w, b, k1 ) b wx 1
for m=2: f 2 ( x, w, b, k1 , k 2 ) b w1 x k1 w2 x k 2
etc.
Each monomial is a feature. The goal is to select a
set of m features providing min. empirical risk (MSE)
This is a structure since f 1 f 2 .... f m ....
m

More generally, f m x, w,V wi gx,vi


i0
where m basis fcts are selected
from a (large) set of
M fcts
58
Note: nonlinear optimization, VC-dimension
is

Common SRM structures


Penalization
Consider algebraic polynomial of fixed degree
10
2
i
where
w ck
c1 c 2 c3 ...
f x, w
wi x
i 0

For each (positive) value c this set of functions


2
specifies an element of aSstructure

x,w

,
w
ck
k
Minimization of empirical risk (MSE) on each
Sk
element
of a structure is a constrained
minimization problem
This optimization problem can be equivalently
2 the penalized empirical
stated
as
minimization
of
R

w
k ~ ck
pen
k
emp
k
risk
functional:
where the choice of

Note: VC-dimension is unknown

59

Example: SRM structures for


regression
Regression data set
x-values~ uniformly sampled in [0,1]
2
y

0
.
8
sin(
2

x
)

0
.
2
x
0.5 x
y-values ~ target fct
additive Gaussian noise with st. dev 0.05
Experimental set-up
training set ~ 40 samples
validation set ~ 40 samples (for model
selection)
SRM structures defined on algebraic
polynomials
- dictionary (polynomial degrees 1 to 10)
- penalization (fixed degree-10 polynomial)
60

Estimated models using different SRM


structures:y 0.4078 6.4198 x 68.2162 x 163.7679 x 158.3952 x
2

55.9565 x 5

- dictionary
y 0.6186 22.7337 x 2 41.1772 x 3 19.2736 x 4
- penalization lambda=1.013e-005
- sparse polynomial
Visual results: target fct~ red line, feature selection~
black solid, dictionary ~ green, penalization ~ yellow
line
1

0.5

0.5

1.5

0.2

0.4

0.6
x

0.8

61

SRM Summary
SRM structure ~ complexity ordering on a
set of admissible models (approximating
functions)
Many different structures on the same
set of approximating functions (possible
models)
How to choose the best structure?
- depends on application data
- VC theory cannot provide answer
SRM = mechanism for complexity
control
- selecting optimal complexity for a given
data set
62

OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization
(SRM)
63

Summary and Discussion: VCtheory

Methodology
- learning problem setting (KID principle)
- concepts (risk minimization, VCdimension, structure)
Interpretation/ evaluation of existing
methods
Model selection using VC-bounds
New types of inference (TBD later)
What theory can not do:
- provide formalization (for a given
application)
- select good structure
- always a gap between theory and
64
applications

You might also like