Aprendizaje de Maquinas

Introduction to
Predictive Learning
LECTURE SET 4
Statistical Learning Theory
Electrical and Computer Engineering

2
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem
Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
3
Objectives
Problems with philosophical
approaches
- lack quantitative description/
characterization of ideas;
- no real predictive power (as in Natural
Sciences)
- no agreement on basic definitions/
concepts (as in Natural Sciences)
Goal: to introduce Predictive Learning as a

scientific discipline
4
Characteristics of Scientific
Theory
Problem setting
Solution approach
Math proofs (technical analysis)
Constructive methods
Applications
Note: Problem Setting and Solution
Approach are independent (of each
5
other)
History and Overview

SLT aka VC-theory (Vapnik-Chervonenkis)
Theory for estimating dependencies from
finite samples (predictive learning
setting)
Based on the risk minimization approach
All main results originally developed in
1970s for classification (pattern
recognition) why?
but remained largely unknown
Recent renewed interest due to practical
success of Support Vector Machines
6
(SVM)
History and
Overview(contd)
MAIN CONCEPTUAL CONTRIBUTIONS
Distinction between problem setting,
inductive principle and learning algorithms
Direct approach to estimation with finite
data (KID principle)
Math analysis of ERM (standard inductive
setting)
Two factors responsible for
generalization:
- empirical risk (fitting error)
- complexity(capacity) of approximating
7
Importance of VC-theory
Math results addressing the main
question:
- under what general conditions the ERM approach
leads to (good) generalization?
New approach to induction:

Predictive vs generative modeling (in classical
statistics)
Connection to philosophy of science

- VC-theory developed for binary classification
(pattern recognition) ~ the simplest generalization
problem
- natural sciences: from observations to scientific
law
8
Inductive Learning Setting

The learning machine observes samples (x ,y), and
y f (x, w)
returns an estimated response
Two modes of inference: identification vs imitation
Risk Loss(y, f(x,w)) dP(x,y) min
The Problem of Inductive

Learning
Given: finite training samples Z={(xi,

yi),i=1,2,n} choose from a given set of
functions f(x, w) the one that approximates
best the true output. (in the sense of risk
minimization)
Concepts and Terminology
approximating functions f(x, w)
(non-negative) loss function L(f(x, w),y)
expected risk functional R(Z,w)
Goal: find the function f(x, wo) minimizing
R(Z,w) when the joint distribution P(x,y) is
10
unknown.
Empirical Risk Minimization

ERM principle in model-based
learning
Model parameterization: f(x, w)
1 n
Loss function: L(f(x, w),y)
Remp (w ) L( f ( x i , w ), y i )
n i 1
Estimate risk from data:
Choose w* that minimizes Remp
Statistical Learning Theory

developed from the theoretical
analysis of ERM principle under
11
Probabilistic Modeling vs ERM
12
Probabilistic Modeling vs ERM:

Example
Known class distribution optimal decision

boundary
10
8
6
x2
4
2
0
-2
-4
-6
-2
4
x1
10
13
Probabilistic Approach
Estimate parameters of Gaussian class

distributions , and plug them into quadratic
decision boundary
10
8
6
x2
4
2
0
-2
-4
-6
-2
4
x1
10
14
ERM Approach
Quadratic and linear decision boundary
estimated via minimization of squared loss
10
8
6
x2
4
2
0
-2
-4
-6
-2
4
x1
10
15
Estimation of multivariate
functions
Is it possible to estimate a function from finite

data?
Simplified problem: estimation of unknown
continuous function from noise-free samples
Many results from function approximation
theory:
To estimate accurately a d-dimensional function one

needs O(n^^d) data points
For example, if 3 points are needed to estimate 2-nd
order polynomial for d=1, then 3^^10 points are
needed to estimate 2-nd order polynomial in 10dimensional space.
Similar results in signal processing
Never enough data points to estimate

multivariate functions in most practical
applications (image recognition, genomics etc.)
For multivariate function estimation, the
number of free parameters increases
16
exponentially with problem dimensionality
(the
Properties of high-dimensional
data
Sparse data looks like a porcupine: the volume of a unit sphere

inscribed in a d-dimensional cube gets smaller even as the volume of
d-cube gets exponentially larger!
A point is closer to an edge than to another point
Pairwise distances between points are the same
Intuition behind kernel (local) methods no longer holds.
How generalization is possible, in spite of the curse of dimensionality?

17
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
18
The goal of learning is generalization rather

than estimation of true function (system
identification)
Loss(y, f(x,w)) dP(x,y) min
Keep-It-Direct Principle (Vapnik,

1995)
Do not solve an estimation problem of
interest by solving a more general
(harder) problem as an intermediate
step
Good predictive model reflects some
properties of unknown distribution P(x,y)
Since model estimation with finite data is illposed, one should never try to solve a more
general problem than required by given
19
application
Learning vs System Identification
Consider regression problem y g (x)

where unknown target functiong (x) E ( y / x)
R(w ) ( y f (x, w )) 2 dP (x, y ) min

Goal 1: Prediction
Goal 2: Function Approximation (system

identification)
2
or
R (w ) ( f (x, w ) g (x)) dx min

f (x, w ) E ( y / x) min
Admissible models: algebraic polynomials

Purpose of comparison: contrast goals (1) and
(2)
NOTE: most applications assume Goal 2, i.e.
Noisy Data ~ true signal + noise
20
Empirical Comparison
2
g
(
x
)
sin
(2x) x [0,1]
Target function: sine-squared
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
Input distribution: non-uniform Gaussian pdf

Additive gaussian noise with st. deviation = 0.1
21
Empirical Comparison (contd)
Model selection: use separate data sets

- training : for parameter estimation
- validation: for selecting polynomial degree
- test: for estimating prediction risk (MSE)
Validation set generated differently to contrast

(1)&(2)
Predictive Learning (1) ~ Gaussian
Funct. Approximation (2) ~ uniform fixed
sampling
Training + test data ~ Gaussian
Training set size: 30 Validation set size : 30

22
Regression estimates (2 typical realizations of

data):
Dotted line ~ estimate obtained using predictive

learning
Dashed line ~ estimate via function approximation
23
setting
Conclusion
The goal of prediction (1) is different

(less demanding) than the goal of
estimating the true target function
(2) everywhere in the input space.
The curse of dimensionality applies
to system identification setting (2),
but may not hold under predictive
setting (1).
Both settings coincide if the input
distribution is uniform (i.e., in signal
and image denoising applications)
24
Philosophical Interpretation of KID

Interpretation of predictive models
Realism ~ objective truth (hidden in
Nature)
Instrumentalism ~ creation of human
mind (imposed on the data) favored
by KID
Objective Evaluation still possible (via
prediction risk reflecting application
needs) Natural Science
Methodological implications
Importance of good learning
formulations (asking the right
question)
25
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
26
VC-theory has 4 parts:

1. Analysis of consistency/convergence
of ERM
1 n
Remp ( ) L( yi , f (xi , )) min
n
i 1
2. Generalization bounds
3. Inductive principles (for finite
samples)
4. Constructive methods (learning
algorithms) for implementing (3)
NOTE: (1)(2)(3)(4)
27
Consistency/Convergence of
ERM
Empirical Risk known but Expected Risk

unknown
Asymptotic consistency requirement:
under what (general) conditions models
providing min Empirical Risk will also
provide min Prediction Risk, when the
number of samples grows large?
Why asymptotic analysis is needed?
- helps to develop useful concepts
- necessary and sufficient conditions ensure
that VC-theory is general and can28not be
Consistency of ERM
Remp
Convergence of empirical risk
to expected
R
risk does not imply consistency of ERM
Models estimated via ERM (w*) are always biased
estimates of the functions minimizing true risk:
Remp n* R n*
29
Conditions for Consistency of

ERM
Main insight: consistency is not possible without
restricting the set of possible models

Example: 1-nearest neighbor classification method.
- is it consistent ?
Consider binary decision functions (classification)
How to measure their flexibility, or ability to
explain the training data (for binary classification)?
This complexity index for indicator functions:
- is independent of unknown data distribution;
- measures the capacity of a set of possible models,
rather than characteristics of the true model
30
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
31
SHATTERING
Linear indicator functions: can split 3 data points
in 2D in all 2^^3 = 8 possible binary partitions
If a set of n samples can be separated by a set of

functions in all 2^^n possible ways, this sample is
said to be shattered (by the set of functions)
Shattering ~ a set of models can explain a given
sample of size n (for all possible labelings)
32
VC DIMENSION
Definition: A set of functions has VC-dimension h
is there exist h samples that can be shattered by this

set of functions, but there are no h+1 samples that
can be shattered
VC-dimension h=3 ( h=d+1 for linear

functions )
VC-dim. is a positive integer (combinatorial
33
index)
VC-dimension and Consistency of

ERM
VC-dimension is infinite if a sample of size

n can be split in all 2^^n possible ways
(in this case, no valid generalization is
possible)
Finite VC-dimension gives necessary and
sufficient conditions for:
(1) consistency of ERM-based learning
(2) fast rate of convergence
(these conditions are distribution-independent)
Interpretation of the VC-dimension via

falsifiability:
34
VC-dimension and
Falsifiability
A set of functions has VC-dimension h if

(a) It can explain (shatter) a set of h samples
~ there exists h samples that cannot falsify it
and
(b) It can not shatter h+1 samples
~ any h+1 samples falsify this set
() Finiteness of VC-dim is necessary and
sufficient condition for generalization
(for any learning method based on ERM)
35
Recall Occams Razor:

Main problem in predictive learning
- Complexity control (model selection)
- How to measure complexity?
Interpretation of Occams razor (in
Statistics):
Entities ~ model parameters
Complexity ~ degrees-of-freedom
Necessity ~ explaining (fitting) available
data
Model complexity = number of parameters
(DoF)
Consistent with classical statistical view:
36
Philosophical Principle of VCfalsifiability

Occams Razor: Select the model that
explains available data and has the
smallest number of free parameters
(entities)
VC theory: Select the model that
explains available data and has low VCdimension (i.e. can be easily falsified)
New principle of VC-falsifiability
37
Calculating the VC-dimension

How to estimate the VC-dimension (for a
given set of functions)?
Apply definition (via shattering) to derive
analytic estimates works for simple sets of
functions
Generally, such analytic estimates are not
possible for complex nonlinear
parameterizations (i.e., for practical machine
learning and statistical methods) 38
Example: VC-dimension of spherical indicator

functions.
Consider spherical decision surfaces in a d2
2
dimensionalf xx-space,
parameterized
by center c
,
c
,
r
I
(
x
c
)
and radius r parameters:
In a 2-dim space (d=2) there exists 3 points that can

be shattered, but 4 points cannot be shattered
h=3
39
Example: VC-dimension of a linear combination of fixed basis

functions (i.e. polynomials, Fourier expansion etc.)
Assuming that basis functions are linearly independent, the
VC-dim equals the number of basis functions (free
parameters).
f x ,w I sin wx 0
Example: single parameter but infinite VC-dimension
40
Example: Wide linear decision boundaries

D x ( w x) b
Consider linear functions such that the distance
btwn D(x) and the closest data sample is larger
than given value
Then VC-dimension depends on the width
R2
2 , d 1
h minmodels):
parameter, rather than d (as in linear

41
Linear combination of fixed basis

m
functions
f x, w I w g (x) w 0
i 1
is equivalent to linear functions in m-dimensional

space
VC-dimension = m + 1
(this assumes linear independence of basis
functions)
In general, analytic estimation of VC-dimension is
hard
VC-dimension can be
- equal to DoF
42
VC-dimension vs number of
parameters
VC-dimension can be equal to DoF
(number of parameters)
Example: linear estimators
VC-dimension can be smaller than DoF
Example: penalized estimators
VC-dimension can be larger than DoF
Example: feature selection
sin (wx)
43
VC-dimension for Regression

Problems
VC-dimension was defined for indicator

functions
Can be extended to real-valued functions, i.e.
2
third-order
f ( x, wpolynomial
, b) w3 x 3 for
w2 xunivariate
w1 x b
regression:
linear parameterization VC-dim = 4
Qualitatively, the VC-dimension ~ the ability
to fit (or explain) finite training data for
44
regression.
Example: what is VC-dim of kNN

Regression?
Ten training
samples
from2
2
2
Using k-nn regression with k=1 and

k=4:
y x 0.1x N (0, ), where 0.25
45
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
Structural Risk Minimization
(SRM)
46
Recall consistency of ERM
Two Types of VC-bounds:

(1)How close is the empirical risk
Remp *
R risk
*
to the true
(2) How close is the empirical risk to the minimal

possible risk ?
47
Bounds for learning machines
(implementing ERM) evaluate the difference btwn

(unknown) risk and known empirical risk, as a
function of sample size n and general properties of
admissible models (their VC-dimension)
Classification:
the following bound holds with
1
probability of
for all approximating functions
R ( ) Remp ( ) Remp ( ), n / h, ln / n
where
is called the confidence interval
Regression:
1
the following bound holds with
probability of
for all approximating
Rfunctions
( ) Remp ( ) / 1 c
where
a n
h ln 2 1 ln / 4
n ln
h
,
a1
h
n
n
48
Practical VC Bound for

regression
Practical regression bound
can be
min 4level
/ n ,1
obtained by setting the confidence
and theoretical constants:
h h h ln n
R (h) Remp (h) 1

ln
n
n
n
2
n
can be used for model selection (examples given

later)
Compare to analytic bounds (SC, FPE) in Lecture

Set 2
Analysis (of denominator) shows that

h < 0.8 n for any estimator
In practice:
49
VC Regression Bound for model

selection
VC-bound can be used for analytic model

selection
(if the VC-dimension is known)
Example: polynomial regression for estimating

Sine_Squared target function
from 25 noisy samples
1.5
Optimal model found: 1

6-th degree polynomial
0.5
(no resampling needed)
0
-0.5
0
0.2
0.4
0.6
0.8
50
Modeling pure noise with x in [0,1] via poly

regression
1
sample size
n=30, noise
10
Risk(MSE)
10
10
DegreeofFreedom
10
10
fpe
gcv
vc
cv
fpe
gcv
vc
cv
30
20
10
0
Comparison of different model selection

methods:
- prediction risk (MSE)
- selected DoF (~ h)
51
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
(SRM)
52

Analysis of generalization bounds
R ( ) Remp ( ) Remp ( ), n / h, ln / n
suggests that when n/h is large, the term

is small
R ( ) ~ Remp ( )
This leads to parametric modeling approach (ERM)

When n/h is not large (say, less than 20), both terms
in the right-hand side of VC- bound need to be
minimized
make the VC-dimension a controlling variable
SRM = formal mechanism
f (x, )for controlling model
complexity
S1 S2 ... Sk ...
h1 h2 ... hk ...
Set of admissible models
has a nested
structure
such that
53

An upper bound on the true risk and the empirical
risk, as a function of VC-dimension h (for fixed
sample size n)
54
SRM vs ERM modeling
55
SRM Approach
Use VC-dimension as a controlling parameter for
minimizing VC bound:
R( ) Remp ( ) n / h
Two general strategies for implementing

SRM:
Remp ( )
n / h
1. Keep
fixed and minimize
(most statistical and neural network methods)
n / h
2. Keep
fixed and minimize
(Support Vector Machines)
Remp ( )
56
Common SRM structures

Dictionary structure
f m x, w b
A set of algebraic polynomials
f 1 f 2 .... f k ....
is a structure since
i
w
x
i
i 0
More generallyf m x, w, V b wi g x , v i
i 0
where gx,vi is a set of basis functions
(dictionary).
The number of terms (basis functions) m

specifies an element of a structure.
For fixed basis fcts, VC-dim ~ number of
parameters
57
wi

Feature selection (aka subset selection)
Consider sparse polynomials of degree m:
k
for m=1: f 1 ( x, w, b, k1 ) b wx 1
for m=2: f 2 ( x, w, b, k1 , k 2 ) b w1 x k1 w2 x k 2
etc.
Each monomial is a feature. The goal is to select a
set of m features providing min. empirical risk (MSE)
This is a structure since f 1 f 2 .... f m ....
m
More generally, f m x, w,V wi gx,vi

i0
where m basis fcts are selected
from a (large) set of
M fcts
58
Note: nonlinear optimization, VC-dimension
is

Penalization
Consider algebraic polynomial of fixed degree
10
2
i
where
w ck
c1 c 2 c3 ...
f x, w
wi x
i 0
For each (positive) value c this set of functions

2
specifies an element of aSstructure
x,w
,
w
ck
k
Minimization of empirical risk (MSE) on each
Sk
element
of a structure is a constrained
minimization problem
This optimization problem can be equivalently
2 the penalized empirical
stated
as
minimization
of
R
w
k ~ ck
pen
k
emp
k
risk
functional:
where the choice of
Note: VC-dimension is unknown
59
Example: SRM structures for

regression
Regression data set
x-values~ uniformly sampled in [0,1]
2
y
0
.
8
sin(
2
x
)
0
.
2
x
0.5 x
y-values ~ target fct
additive Gaussian noise with st. dev 0.05
Experimental set-up
training set ~ 40 samples
validation set ~ 40 samples (for model
selection)
SRM structures defined on algebraic
polynomials
- dictionary (polynomial degrees 1 to 10)
- penalization (fixed degree-10 polynomial)
60
Estimated models using different SRM

structures:y 0.4078 6.4198 x 68.2162 x 163.7679 x 158.3952 x
2
55.9565 x 5
- dictionary
y 0.6186 22.7337 x 2 41.1772 x 3 19.2736 x 4
- penalization lambda=1.013e-005
- sparse polynomial
Visual results: target fct~ red line, feature selection~
black solid, dictionary ~ green, penalization ~ yellow
line
1
0.5
0.5
1.5
0.2
0.4
0.6
x
0.8
61
SRM Summary
SRM structure ~ complexity ordering on a
set of admissible models (approximating
functions)
Many different structures on the same
set of approximating functions (possible
models)
How to choose the best structure?
- depends on application data
- VC theory cannot provide answer
SRM = mechanism for complexity
control
- selecting optimal complexity for a given
data set
62
OUTLINE of Set 4
Setting
Analysis of ERM
VC-dimension
(SRM)
63
Summary and Discussion: VCtheory
Methodology
- learning problem setting (KID principle)
- concepts (risk minimization, VCdimension, structure)
Interpretation/ evaluation of existing
methods
Model selection using VC-bounds
New types of inference (TBD later)
What theory can not do:
- provide formalization (for a given
application)
- select good structure
- always a gap between theory and
64
applications

Aprendizaje de Maquinas

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aprendizaje de Maquinas

Uploaded by

Copyright:

Available Formats

Introduction to

Electrical and Computer Engineering

Goal: to introduce Predictive Learning as a

History and Overview

New approach to induction:

Connection to philosophy of science

Inductive Learning Setting

The Problem of Inductive

Given: finite training samples Z={(xi,

Empirical Risk Minimization

Statistical Learning Theory

Probabilistic Modeling vs ERM

Probabilistic Modeling vs ERM:

Known class distribution optimal decision

Estimate parameters of Gaussian class

Is it possible to estimate a function from finite

To estimate accurately a d-dimensional function one

Never enough data points to estimate

Sparse data looks like a porcupine: the volume of a unit sphere

How generalization is possible, in spite of the curse of dimensionality?

The goal of learning is generalization rather

Loss(y, f(x,w)) dP(x,y) min

Keep-It-Direct Principle (Vapnik,

Learning vs System Identification

Consider regression problem y g (x)

R(w ) ( y f (x, w )) 2 dP (x, y ) min

Goal 2: Function Approximation (system

R (w ) ( f (x, w ) g (x)) dx min

Admissible models: algebraic polynomials

Input distribution: non-uniform Gaussian pdf

Empirical Comparison (contd)

Model selection: use separate data sets

Validation set generated differently to contrast

Training + test data ~ Gaussian

Training set size: 30 Validation set size : 30

Regression estimates (2 typical realizations of

Dotted line ~ estimate obtained using predictive

The goal of prediction (1) is different

Philosophical Interpretation of KID

VC-theory has 4 parts:

Empirical Risk known but Expected Risk

Conditions for Consistency of

restricting the set of possible models

If a set of n samples can be separated by a set of

is there exist h samples that can be shattered by this

VC-dimension h=3 ( h=d+1 for linear

VC-dimension and Consistency of

VC-dimension is infinite if a sample of size

Interpretation of the VC-dimension via

A set of functions has VC-dimension h if

Recall Occams Razor:

Philosophical Principle of VCfalsifiability

Calculating the VC-dimension

Example: VC-dimension of spherical indicator

and radius r parameters:

In a 2-dim space (d=2) there exists 3 points that can

Example: VC-dimension of a linear combination of fixed basis

Example: Wide linear decision boundaries

Linear combination of fixed basis

is equivalent to linear functions in m-dimensional

VC-dimension for Regression

VC-dimension was defined for indicator

Example: what is VC-dim of kNN