You are on page 1of 66

Machine Learning, spring 2005.

Extended hints and tentative solutions to exercises.


Send flames and corrections to jtlindgr@cs.helsinki.fi.

On many cases, I have given only a general outline of the solution. It should be
enough to let you fill in the details. For a student, these kinds of answers
may not be acceptable in a test you should give as detailed account as
you can.
On some exercises, I have included a paragraph tagged rant. This is a piece
of assistents subjective speculation and handwaving that usually tries to relate
the exercise to the universe as the empirist sees it. If you bother to read them,
do so with a large grain of salt. They can be skipped safely.
Exercise 1.1
Mostly, use what you know from elementary logic.
b) The formula can be multiplied from left to right, resulting in DNF
f (x) = x1 x2 + x1 x3 x4 + x1 x3 x5 x6 .
c) Can be expanded e.g. deepest first, using the distributive law ( x + yz =
(x + y)(x + z) ), towards CNF
f (x)

= x1 (x2 + x3 (x4 + x5 )(x4 + x6 ))

(1)

= x1 (x2 + x3 )(x2 + x4 + x5 )(x2 + x4 + x6 )

(2)

a) Can be reasoned from the CNF, giving


((x1 , ), (x2 , +), (x3 , ), (x4 , +), (x5 , ), (x6 , +), )
rant: Algorithms exist for conversions like these. Less formal attempts with
small amount of literals can be verified for correctness using a truth table (basically brute force checking of all variable/truth assignment possibilities). The
moral of the story could be that the truth might exist in different forms and
they can occasionally represent the same thing, although it might not be apparent at all. Around the eighties, it was a great pastime to prove the relations
between different hypotheses (or -classes) . Current status: on real data, algorithms working explicitly on logic tend to produce less accurate classifiers than
their more graded competitors. Often a linear classifier or a nearest neighbour
does reasonably. With DNF-style rule learners, your mileage might vary. Attempts to enhance (e.g. logic-based) algorithms by heuristical kludging easily
hit a Murphy-variant of the the law of diminishing returns (gains grow logarithmically w.r.t. the amount of workhours).

Exercise 1.2
Informally, start by drawing a 2D illustration of the situation and generalize to
larger dimensions. The idea is to see that we are in effect separating one corner
of a hypercube, we dont have to care about the rest of the examples as they
are not close to the margin.
a) As the situation is symmetrical, set w1:k = 1 and wk+1:n = 0. By flipping
one bit of the positive example that has just bits 1 : k true, reason that the
threshold
b = k 1. Hence, by simple arithmetic the margin will turn out to
be 1/ k.

b) Divide the (w, b) of part a) by 1/ k. Youll see that the unnormalized margin
ends up the same as the normalized margin was in part a).
c) The same as in a).
d) Using the method from the lectures, add coefficient w n+1 = (k 1) = 1 k
and set b = 0. As the biaspis now a part of the
vector, its calculated along in
the margin, resulting in 1/ k + (1 k)2 1/ k.

note: you could show, using linear inequalities, that any linear classifier having
a larger margin than those given leads to a contradiction. Or, you might apply
geometrical ideas of convexity with optimization theory. Not required here.
rant: Why we are so interested in the margin might get clearer further down
the course. For example, the perceptron convergence theorem (Novikoff) gives
a mistake-bound on the perceptron algorithm that is inversely related to the
square of the margin. Also, according to Structural Risk Minimization theory
(SRM), getting the classes well-separated with a simple model is often a good
thing. But. Real-life data might be continuous and the labeling might be just
an opinion of some person. There is not necessarily a margin in the input space,
if the data comes from smooth, partially overlapping distributions (imagine e.g.
a candy space. Co-ordinate axis represent percentages of candy ingredients,
i.e. sugar. Sample candies. Taste. Label each as good or bad. Is there a
margin?). We might be more interested about keeping the most of the data
away from the decision surface than attempting to get a hard margin by
some complexity (or norm) increasing tricks.
Exercise 1.3
i) Note that a parity function is TRUE iff an odd number of literals are TRUE.
Hence, create a binary decision tree with three levels that essentially counts the
number of TRUE literals in each root-to-leaf -path. The tree is not unique, as
you can do sums in any order.
ii) Write out parity of xi , xj as xi xj + xi xj . Expand the formula by basic logic.
Up to ordering, should give something like
x1 x2 x3 + x 1 x2 x3 + x 1 x2 x3 + x 2 x1 x3 .
b) Hard way: we have four examples, (-1,-1,FALSE),(-1,1,TRUE),(1,-1,TRUE),
(1,1,FALSE). Set up a system of linear inequalities to say that these should
be classified correctly by some (w, b), i.e. w x b for positive examples etc.
Trying to solve this linear system should result in a contradiction. Less formally,
2

looking at a 2D illustration of the case and trying to put a hyperplane there


should demonstrate the case rather convincingly to everyone except the most
unreformed theorists. A third possibility is to note the monotonicity of linear
models: each weight can only increase or decrease the relation of the example
to a class. In the XOR problem, whether value of x1 should have a positive or
negative effect depends on the value of x2 .
Exercise 1.4
M. Anthony 2002, Decision Lists and Threshold Decision Lists, CDAM Research Report LSE-CDAM-2002-11, has an induction proof of the required mapping at pages 15-16 (theorem 5.1) 1 .
The idea shortly is to start from the first non-redundant (positive-label) node
of the decision list tail and make a linear classifier for that single node. Then,
moving from the declist tail to front, start to add weights to the linear classifier
(while editing the bias) so that the new weight is always large enough to override
all the weights to the right (those nearer the tail) as they are less important.
The hinted sequence wi = 2 wi+1 should also manage this. This should give
a linear classifier where the weights grow exponentially w.r.t. the decision list
length, so the margin will get small really fast.
rant From this you can see e.g. that with a linear classifier, you could simulate
the fragility of a 1-decision list: try adding a little errors to the data to the right
place.
Exercise 1.5
Note that the data sample conforms to some multiliteral, monotone conjunction
h and it has no errors. The sample does not necessarily suffice to specify the
conjunction exactly, but here we are interested just in conforming to just the
sample. Put the positive examples to a matrix, each row an example. Apply
logical AND on the matrix columnwise. The result is the hypothesis. It predicts
TRUE on all positive examples, as it necessarily has atleast the bits on that h
has (because all positive examples must have them on). It predicts FALSE on
all negative examples, because they cannot have those all the bits in h on (that
our hypothesis atleast requires). If we have m examples and dimension n, the
complexity is clearly O(mn), which is polynomial in n.
rant: It should be remembered that any algorithm that makes irrevocable
choices based on just one example is almost always unsuitable for practical
use. This is due to errors/uncertainty present in natural data. Also, often
the hypothesis class used by the method can only approximate the underlying
labeling phenomena, not equal it (resulting in errors).

1 http://www.maths.lse.ac.uk/Personal/martin/cdambfdls.pdf

Exercise 2.1
i) For each starting location in 1, ..., k, sum the number of intervals starting
from it. This results in
k + (k 1) + (k 2) + ... + 1 =

Pk

i=1

i = k (k + 1)/2 O(k 2 ),

the well-known arithmetic series. The result can also be seen as the number of
all unique pairs k(k 1)/2 incremented with k more cases where each number
is paired with itself.
ii) First, note that we can not represent and update the intervals explicitly by
brute force, we dont have enough time for that. One attempt is as follows.
1. Create two abstract intervals P OS and N EG that are both initially [1, k].
2. Interpret that each of the two intervals covers all of the possible subintervals
in its range.
3. On receiving a new example x [1, k], find out if it belongs to P OS, N EG
or both. This can be done in O(k). For intervals x belongs to (maximally two),
calculate how many of its subintervals overlap the location. If the interval is
[a, b], the overlap can be calculated atleast in time O(b a) O(k) by traversal.
4. Vote the class that got more weight

5. If there was a nonconforming abstract interval [a, b] covering x, split it to


[a, x 1] and [x + 1, b] with pathological cases handled by common sense. In
subsequent rounds, P OS and N EG are sets of abstract intervals, but neither
can have more than O(k) components and in O() sense the running time is not
affected.
6. Read next example, repeat from 3.
Visualization of the algorithm state after receiving one positive and one negative
example from the range [1, k]:
POS
NEG
range

[-----] [-------------------]
[------------] [------------]
[
+
]
1
k

This algorithm, although not the cleanest one possible, is relatively easy to
understand and can operate in O(k). It is a halving algorithm, as each time an
example is received, all nonconforming intervals are removed. Unfortunately,
the hypothesis class supported by this algorithm (set of intervals) is larger than
the one asked for in the exercise. A more conformant solution delivering just
one [a, b] interval would start with a h = [1, k] interval and truncate it each time
a nonconforming example is seen; here we would need to upkeep a couple of
arrays to calculate the voting strengths. That would be O(k) algorithm as well.
Details omitted.


Exercise 2.2
Jensen: if f is convex, and x is a random variable,
Ef (x) f (Ex), or,

pi f (xi ) f (

p i xi )

For this exercise, lets take f (x) = ln x. We should know from function analysis
that if f is twice differentiable and f 00 (x) 0, f is convex at x. As 2x ln x =
1
x2 0 everywhere, ln x is convex.
a) Let pi = 1/n. Start from an expression where Jensen holds and crank,

1/n ln ai

ln(

1/nai )
X
ln(1/n
ai )
X
exp(ln(1/n
ai ))
X
exp(ln(1/n
ai ))
X
1/n
ai .

1/n ln ai
X
exp(1/n
ln ai )
Y
exp(1/n ln ai )
Y
( ln ai )1/n

(3)
(4)
(5)
(6)
(7)

P
b) Let dkl (p, q) = pi ln pi /qi , the Kullback-Leibler divergence.
Is dkl (p, q) 0, q, p?

Apply Jensen for concave ln(x) (just flip the inequality direction).

Hence,

pi ln pi /qi

pi ln qi /pi
X
ln
pi qi /pi
X
= ln
qi = ln 1 = 0

(8)
(9)
(10)

pi ln pi /qi 0.

Exercise 2.3
a) We have the estimate of theorem 2.9 of the slides,

L(W A) c ln W1 c ln WT +1 ,

where Wt = i Wt,i . Now we can bound the second term on the right more
efficiently. Let I be the set of those k experts with losses bounded by M .
ln WT +1

ln

wT +1,i

(11)

iI

min( ln kw1,i exp(L(i )))

(12)

= ln k min( ln w1,i exp(L(i )))

(13)

= ln k + min L(i )

(14)

iI

iI

iI

plugging this to the L(W A) bound, we get

L(W A) c ln W1 c ln k + c min L(i )


iI

(15)

= c ln w1 /k + c min L(i )

(16)

cM + c ln N/k.

(17)

iI

Explanation: First we applied the definition of WT , then reasoned that for


decreasing ln(x), less x is more. Next, we used the update rule of the algorithm
to expand wT +1,i and used the min expert value k times. With some ln cranking,
we got the estimate for the second term that we subsequetly plugged in.
We can see from the bound that it can be benefical to have loss-bounded, preferably good experts.
b) Just switch the wi to pi and do quite similarly to a).

ln WT +1

min( ln w1,i exp(L(i )))

(18)

= min( ln pi ln exp(L(i )))

(19)

= min( ln pi + L(i )).

(20)

Again, plug to the loss bound to get

L(W A) c ln W1 + c(min( ln pi + L(i )))


i

= 0 + min(c ln 1/pi + cL(i )),


i

(21)
(22)

as the initial potential W1 = 1. We can see from the bound that it benefits
the algorithm if we can give it a vector of good starting weights. These weights
could then be interpreted as our prior beliefs of the expert performances.

Exercise 2.4
Start by simplifying the log loss
Llog (y, y) =

1y
2

ln 1y
1
y +

1+y
2

ln 1+y
1+
y

for the two cases {1, 1},


L(1, y)
L(1, y)

1 y
2
1+y
= ln
2

= ln

(23)
(24)

which allows us to simplify the exp terms (with = 1) as

exp(L(1, y))
exp(L(1, y))

1 y
1 y
)=
2
2
1 + y
1 + y
)=
= exp(ln
2
2

= exp(ln

(25)
(26)

Now,
Wt+1
Wt
= ln
ln
Wt+1
Wt

wt+1,i
Wt
X wt,i exp(L(yt , xt,i ))
= ln
Wt
i
X
= ln
vt,i exp(L(yt , xt,i )),
= ln

(27)
(28)
(29)

where the main idea was to apply the update rule of the WA algorithm. Finally,
inputting the exponentiated losses derived earlier, we can get the log losses,
shown below for the case yt = 1.

yt = 1 : ln

X
i

1 xt,i
vt,i
2

= ln(

i vt,i

v t xt
)
2

1 v t xt
2
= Llog (1, vt xt )

= ln

The other case is handled similarly.

(30)
(31)
(32)


Exercise 2.5
Tedious, brute-force differentiations, such as in this exercise, might be better
done on a computer, if you already know how to differentiate. If you dont, this
exercise might not be the best place to start learning it.
Suitable computer software to attempt symbolic computation are e.g. Maple
(which is commercial) and Maxima (GPL, can be downloaded from Sourceforge).
Below we show how Maple can be used to help solving this exercise.
>

# Machine Learning spring 2005 exercise 2.5, using MAPLE

>

# a) log loss

>

logl := (1-y)/2*ln( (1-y)/(1-x) ) + (1+y)/2*ln( (1+y)/(1+x) );

>

logld1:=diff(logl,x);

>

logld2:=diff(logld1,x);

>

logtmp:=simplify( (logld1^2)/logld2);




1+y
1y
+ 1/2 (1 + y) ln 1+x
logl := 1/2 (1 y) ln 1x
1+y
1y
1/2 1+x
logld1 := 1/2 1x
1y
1+y
logld2 := 1/2 (1x)2 + 1/2 (1+x)
2
2

>

(x+y)
logtmp := 1x
2 +2 yx
logt1:=simplify(eval(logtmp,y=1));

logt2:=simplify(eval(logtmp,y=-1));
values

>

>

# consider y=-1,y=1 separately


# -> results in two constant

logchat:=max(logt1,logt2);
logt1 := 1
logt2 := 1
logchat := 1

>
>

# b) hellinger loss
hell := 1/2*( (sqrt(1-y) - sqrt(1-x))^2 + (sqrt(1+y) - sqrt(1+x))^2);

>

helld1:=diff(hell,x);

>

helld2:=diff(helld1,x);

helltmp:=simplify( (helld1^2)/helld2);
2
2

hell := 1/2
1 y 1 x + 1/2
1+y 1+x
>

1+y
1+x

1+x

1
1x
helld2 := 1/4 (1 x) + 1/4 1y
+ 1/4 (1 + x)
(1x)3/2

2
1+x 1x( 1+x 1y 1x 1+y )
helltmp := 1+x1yx1x1+yx+1+x1y+1x1+y

helld1 := 1/2

1y
1x

1x
1

1/2

+ 1/4

1+y 1+x
(1+x)3/2

>

hellt1:=simplify(eval(helltmp,y=1));

>

hellt2:=simplify(eval(helltmp,y=-1));

>

hellm1:=maximize(hellt1,x=-1...1);

>

hellm2:=maximize(hellt2,x=-1...1);

>

hellhat:=max(hellm1,hellm2);

>

# 2

# consider y=-1,y=1 separately

>


hellt1 := 2 1 +
x
hellt2 := 1 x 2
hellm1 := 2
hellm2 := 2
hellhat := 2
# For comparison, square loss goes nice and clean ;)

>

tmp := (y-x)^2;

>

tmpd1 := diff(tmp,x);

>

tmpd2 := diff(tmpd1,x);

>

sqrrat := simplify( (tmpd1^2)/tmpd2);

>

sqrchat := maximize(sqrrat,x=-1...1,y=-1...1);

>

# 8
2

tmp := (x + y)
tmpd1 := 2 x 2 y
tmpd2 := 2
2
sqrrat := 2 (x + y)
sqrchat := 8


Exercise 3.1
i) Again, lets continue exploiting Maple to handle tedious differentiations.
Lemma 2.18 gives the formula for calculating cL, so just get to it.
First, specify Hellinger loss,
> L := 1/2*( (sqrt(1-y) - sqrt(1-x))^2 + (sqrt(1+y) - sqrt(1+x))^2);
L := 1/2

1y

2
2

1 x + 1/2
1+y 1+x

We are interested only in y {1, 1},


> Lminus:=eval(L,y=-1);

2

2 1 x + 1/2 + 1/2 x
Lminus := 1/2
> Lplus:=eval(L,y=1);

2

Lplus := 1/2 1/2 x + 1/2


2 1+x
Need resp. 1st and 2nd order derivatives,
> Lminusd1:=diff(Lminus,x);
Lminusd1 := 1/2
>

2
1x
1x

Lminusd2:=diff(Lminusd1,x);
Lminusd2 := 1/4 (1 x)

>

>

+ 1/2

+ 1/4

2 1x
(1x)3/2

Lplusd1:=diff(Lplus,x);

Lplusd1 := 1/2 1/2


2
1+x
1+x

Lplusd2:=diff(Lplusd1,x);

Lplusd2 := 1/4 (1 + x)

+ 1/4


2 1+x
(1+x)3/2

Construct the ratio of the lemma,


> upper:= Lminusd1*(Lplusd1)^2 - Lplusd1*(Lminusd1)^2;


2


2
1x + 1/2
1+x
upper := 1/2 2
1/2

1/2
1x
1+x


2


2
1+x
2
1x
1/2 1/2 1+x
1/2 1x + 1/2
>

>

>

lower:= Lminusd1*Lplusd2 - Lplusd1*Lminusd2;







1
2 1+x
1x + 1/2
lower := 1/2 2
1/4
(1
+
x)
+
1/4
3/2
1x
(1+x)





1
2
1+x
2
1x
1/2 1/2 1+x
1/4 (1 x) + 1/4 (1x)3/2
ratio := simplify(upper/lower);


ratio := 1/2 1 + x 1 x 1 x + 1 + x 2
maximize(ratio,x=0...1);

The maximization can of course be done by hand, by setting the first derivative
to zero, etc.

10

ii) This can be handled similarly to ex. 2.17. of the lecture notes. We are
interested in cases y {1, 1}. Simplify the Hellinger loss for these two cases,
leading to
L(1, x) = 2
L(1, x) = 2


2 1x


2 1 + x.

According to the prediction rule of the Aggregating Algorithm (AA), we need


L(y, y) (y),
where is as specified in the lecture notes. Plugging in the simplified losses,
2


2 1 x (1)


2 1 + x (1)

and solving for x, we get

(2 (1))2
(2 (1))2
+1x
1
2
2

We can suppose that a reasonable choice for prediction lies in the middle of
these two bounds (i.e. [a, b] a + 21 (b a)), leading to the selection of
y = (1) 1/4(1)2 (1) + 1/4(1)2 .


11

Exercise 3.2
Following the hint, we can proceed by first transforming the given constraint
to an easier form and then showing that the related function is convex. As we
know that the constraint holds for the range endpoints {1, 1}, the convexity
of the function will show us that the resp. constraint must also hold for every
y [1, 1] (intuition e.g. through visualizing the idea for convex f (x) = x2 ).
Now that we have this plan, tweak the constraint,
(y y)2 (y)

(y y)2 (y) 0

f (y) := exp((y y)2 (y)) exp(0) = 1.


Separating the exp() for f (y), we get
f (y) = exp((y y)2 ) exp((y)),
and plugging in the definition of
t
(y)) = c ln(
(y) = c ln( WWt+1

PN

Wt,i
i=1 Wt

exp((y xt,i )2 )),

we can arrive at
f (y) =

PN

Wt,i
i=1 Wt

exp((y y)2 (y xt,i )2 ).

Since Wt,i
is positive, it suffices to examine the exp() part on the right. Denote
t
it by g(y). Now
g 0 (y) = (2(y y) 2(y xi ) exp((y y)2 (y xi )2 )
g 00 (y) = (2(y y) 2(y xi ))2 exp((y y)2 (y xi )2 ) 0,
clearly. From the form of f (y) we can conclude that f 00 (y) 0 as well. According to a well-known result from analysis, f (y) is then convex (prodiving it
is continuous and twice differentiable, as our f is).

12

Exercise 3.3
Use the same potential technique as in the lecture notes.
Write out the one-step potential difference as

Pt Pt+1

1
1
||u w||22 ||u w0 ||22
2
2
1
0
= (w w)(u w) ||w w0 ||22
2
1
= (wt+1 wt )(u wt ) ||wt wt+1 ||22 ,
2

(33)
(34)
(35)

where on the last line we just substituted w = wt and w0 = wt+1 .


Supposing a mistake made by the algorithm and applying the perceptron update
rule, we know that wt+1 wt = yt xt and wt wt+1 = yt xt . Plugging these
in, we get

Pt Pt+1

1
= (yt xt )(u wt ) || yt xt ||22
2
1
= (yt xt )(u wt ) 2 ||xt ||22
2
1
= yt xt u yt xt wt 2 ||xt ||22
2
1 2 2
yt xt u X
2
1 2 2
X
2
X 2
),
= (1
2

(36)
(37)
(38)
(39)
(40)
(41)

where we applied the knowledge that wt made a mistake (yt xt wt is positive),


that u had a margin 1, and that the norm of each x was bounded by X.
Summing the potential differences over the trials gives
T
X

T
X
1
(Pt Pt+1 ) (1 X 2 )
t
2
t=1
t=1

(42)

P1 PT +1

T
X
1
(1 X 2 )
t
2
t=1

(43)

P1

T
X
1
(1 X 2 )
t
2
t=1

(44)

1
||u winit ||22
2

T
X
1
2
(1 X )
t
2
t=1

(45)

where we knew that PT +1 is positive and the value of the starting potential P1 .
From this and selecting = X12 , we can get the mistake bound
13

PT

t=1

||u winit ||22 X 2 .

Exercise 3.4
Although this excercise could be solved by optimization techniques (i.e. minimize a target function w.r.t. a constraint), it is rather tedious. An easier
way for now is to visualize the feasible region, the old wt , and the new x we
made a mistake on. As we made a mistake, old wt is not in the feasible region
{w|yt wx > 1}, whereas x is. Now clearly the point w 0 that both fulfills the margin criteria and is closest to the old wt , relies on the hyperplane {w|w x = 1}.
(You can think of w 0 as a sum of two vectors, perpendicular to each other and
the other collinear to the hyperplane. Set the length of the collinear vector to
0 to minimize the total norm. In 2D, think of what kind of triangle gives the
smallest hypothenusa length).
Thus, the solution is just the orthogonal projection of wt to the hyperplane
{w|w x = 1}. From the orthogonal projection formula we get the update rule
t xwt
wt+1 = wt + ( 1y
)x.
||wt ||2
2

As the Euclidean distance to the hyperplane is ||wt+1 wt ||2 , the distance is also
the norm of the vector (perpendicular to the hyperplane!) we are incrementing
our hypothesis wt with. To put it another way, we are adding a scaled version
of the hyperplane normal to our previous hypothesis.
Exercise 3.5
Here you were supposed to learn how to do basic things on high-level languages
such as R or Matlab. In empirical machine learning, data-analysis and other similar areas, it is often the best to start with some interactive, high-level language
to experiment with prototype algorithms2 . Possibility to use matrix operations,
plots and all kinds of helpful add-on toolboxes can make exploratory work much
smoother.
After implementing the requirements of this particular assignment, we can see
that
The algorithm always converges to zero training error as there is no noise
in the data/labels.
Repeating the experiment shows the high variance in the number of iterations required, depending on the data we happened to create (especially
its margin).
With 48 irrelevant attributes, the test error varies roughly around 10%
and 30%, compared to the typical rate (around 2%) of the basic twodimensional case. Thus, the perceptron algorithm is clearly misled by the
extra attributes.
2 Later on, if your algorithm requires e.g. explicit loops, it may be fruitful to either implement the loops in e.g. C, but this probably only after you already have fixed your specification.

14

This time I have included an implementation on R. Troubles with availability


of Matlab (and licenses for its different toolboxes) should probably make R (or
Octave) your first choice, unless you specifically need some features provided
only by Matlab. Next time, Ill present a Matlab solution for comparison.

#################################################################
EXC<-200;
DIM<-2;
# Create random data from [-1,1]^DIM
X<-matrix(data=runif(EXC*DIM),nrow=EXC,ncol=DIM)*2-1;
# Make the target concept classifier (just a vector)
w<-rep(0,DIM);
w[c(1,2)]<-0.5;
# Label (classify) the data
labels<-as.numeric(X%*%w>=0)*2-1;
# split the data and labels to train and test sets
trainX<-X[1:(EXC/2),];
testX<-X[(EXC/2+1):NROW(X),];
trainLab<-labels[1:(EXC/2)];
testLab<-labels[(EXC/2+1):NROW(X)];
wm<-rep(0,DIM);
alpha<-1;

# initial hypothesis
# learning rate coeff.

# loop the perceptron training until no mistakes are made


doIter<-TRUE;
while(doIter) {
didMistake<-FALSE;
cat(Start iter...\n);
for(i in 1:NROW(trainX)) {
pred<-as.numeric(sum(wm*trainX[i,])>=0)*2-1;
if(pred!=trainLab[i]) { # update on mistake
wm<-wm + trainLab[i] * alpha*trainX[i,];
didMistake<-TRUE;
# visualize the current model
wmt<-wm/sqrt(sum(wm^2));
cat( , i, wm, angle , acos( (wmt%*%w)/sqrt(sum(w^2))) ,\n);
plot.default(c(0,wmt[1]),c(0,wmt[2]),type="l",col="red",
xlim=c(-1,1),ylim=c(-1,1))
par(new=TRUE);
plot.default(c(0,w[1]),c(0,w[2]),type="l",col="blue",
xlim=c(-1,1),ylim=c(-1,1))
points(trainX,col=palette()[trainLab+2]); # use 2 colors
15

Sys.sleep(0.5);
}
}
doIter<-didMistake;
}
# show the final model
wmt<-wm/sqrt(sum(wm^2));
plot.default(c(0,wmt[1]),c(0,wmt[2]),type="l",col="red",
xlim=c(-1,1),ylim=c(-1,1))
par(new=TRUE);
plot.default(c(0,w[1]),c(0,w[2]),type="l",col="blue",
xlim=c(-1,1),ylim=c(-1,1))
points(trainX,col=palette()[trainLab+2]);
# calculate final training and test set errors
trainErr<-sum(as.numeric(trainX%*%wm>=0)*2-1 != trainLab)
/ length(trainLab) ;
testErr<-sum(as.numeric(testX%*%wm>=0)*2-1 != testLab)
/ length(testLab) ;
cat(train: , trainErr, test: , testErr,\n);
cat( final angle , acos( (wmt%*%w)/sqrt(sum(w^2))) ,\n);
par(new=FALSE);
################################################################

Most of the mess of the code was related to data handling and diagnostic output
and plots. For comparison, if you were doing LMS linear regression (a standard
method for numerical prediction in statistics), you could have learned the model
with just one line of code (using matrix inversion).
rant: Considering domains like visual learning, bioinformatics or text classification, 50 dimensions is small beans. Next time we will be having a Winnowexercise, and to that, Ill write a more hearty rant regarding the nature of
irrelevant attributes. Stay tuned!

16

Exercise 4.1
Let kq (x, z) = (x z + c)q , x, z <n . Consider n = 2, q = 2, c = 1. Now
k2 (x, z)

= (x) (z)
= (x z + 1)2
X
= ( (xi zi ) + 1)2

(46)
(47)

= (x1 z1 + x2 z2 + 1)2
= x21 z12 + 2x1 z1 x2 z2 + 2x1 z1 + 2x2 z2 + x22 z22 + 1

= (1, 2x1 , 2x2 , x21 , 2x1 x2 , x22 )

(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 ),

(49)
(50)

(48)

so c1 = 1, c2 = c3 = c5 =

(51)
(52)

2, c4 = c6 = 1.

rant: Doing the feature transformation explicitly by some engineered(!) ()


has been standard statistical practice. The tradition in this regard has been
to first fit a linear model (more or less our perceptron) to the data, and if
its not good enough, scrutinize the data and the model, and then try to add
such nonlinearities to the representation that you have reason to believe could
be fruitful, and fit another linear model to the new representation. Possibilities regarding what might be done are virtually endless. Usually e.g. power
transformations, products of features, etc. are tried. Adding such new features
is called basis expansion. The idea of constructing () explicitly has the benefit that it is driven by an attempt to understand the data. In comparison,
pulling some standard kernel from the sleeve and hoping for the best does not
reveal much about the problem at hand. Trying to make very specific kernels
for specific problems is called kernel engineering, which is basically just a more
difficult form of basis expansion, because (atleast in theory) youll have to make
sure that the kernel is a kernel, i.e. it fulfills certain conditions. The actual
underlying problem of hypothesis class selection does not vanish by tricks, but
the representation change trick has the benefit that if you are willing to tinker
with your representation or kernel, you might get quite far with just a single
good algorithm to learn linear models.

17

Exercise 4.2
For an ANOVA kernel,
A (x) =

iA

xi .

where A {1, ...n}. Start by writing out the kernel,


kqn (x, z)

A (x)A (z)

(53)

|A|=q

X Y

xi z i

|A|=q

|A|=q,nA

= x n zn

(54)
xi z i +

|A|=q,nA
/

|A|=q1,nA
/

xi z i +

xi z i

|A|=q,nA
/

n1
= xn zn kq1
(x, z) + kqn1 (x, z).

(55)
Y

xi z i

(56)
(57)

The idea was to separate the sum to two parts, where the left part has all those
subsets having attributes with index n.
It is apparent from this recursive formula that it can be computed by dynamic
programming. The basic algorithm would have an (q + 1) (n + 1) matrix k,
with initialization k(1, :) = 1, k(:, 2 : (q + 1)) = 0. Using the recursive formula,
the matrix would have been filled e.g. columnwise (process all rows of a single
column, then proceed to next column) and hence the algorithm would require
time O(nq) to process the matrix.

18

Exercise 4.3
This exercise again demonstrates the potential method that weve already got
some familiarity with. Denote Pt = 21 ||u wt ||2 . All norms here are euclidean.
Lets start playing with the potential difference,

Pt Pt+1

=
=

1
1
||u wt ||2 ||u wt+1 ||2
2
2
X
1 X
(ui wt+1,i )2 ).
( (ui wt,i )2
2 i
i

(58)
(59)

Now by plugging in wt+1,i = wt,i (


yt yt )xt,i (note the minus sign!) and
doing some tedious manipulations, we can end up with

Pt Pt+1

1
= (
yt yt )2 ( 2 ||xt ||2 ).
2

(60)

One potential problem with the manipulations to reach the above expression is
to forget to square the when you take it out from the squared norm.
Now well just sum over t and observe the telescoping property, giving us

P1 Pt+1

P1

X
1
(
yt yt )2 ( 2 ||xt ||2 )
2
t
X
1
(
yt yt )2 ( 2 ||xt ||2 )
2
t
X
1
(
yt yt )2 ( 2 X 2 )
2
t
X
1
(
yt yt )2 ( 2 X 2 ),
2
t

where we used the assumption that ||xt || X, t. By selecting =


remembering that P1 = 12 ||u||2 , we get
1
||u||2
2
||u||2 X 2

(yt yt )2 (
(yt yt )2 .

1
1 X2

)
2
X
2 X4

(61)
(62)
(63)
(64)
1
X2

and

(65)
(66)


19

Exercise 4.4 - 4.5


Here we were supposed to try out some simple algorithms and experience their
behaviour on some easily understandable artificial data.
Something like the following can be learned from doing so,
Even on a simple problem like this, learning rate can have great effect on
convergence and the test results of the algorithms.
The theorem-suggested learning rate for winnow does not appear to be
particularly good for this problem. Instead, a rate of 2 seems to work
well.
Perceptron and marginalised perceptron tend to increase their test error as
the dimension is increased. Winnow still works acceptably with d = 5000.
The development of the f () ratio suggested in the exercise gets worse and
worse for perceptron and marginalised perceptron as the dimension grows.
With high learning rates, f () can be very erratic for Winnow.
The lower the learning rate for Winnow is, the more it behaves like the
two other algorithms. Empirically this can be seen from the plot of f () or
the test error behaviour.
We do not appear to get anywhere near the mistake bounds proposed by
the theorems. For example, the bounds that are inversely related to the
margin can be particularly out of scale here.
For completeness, Ive included a code draft. This time its in Matlab.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
EXC=200;
DIM=50;
algorithms={winnow,perceptron,margperceptron};
% create data and target concept
X=rand(EXC,DIM)*2-1;
w=zeros(DIM,1);
w(1:2)=0.5;
% label data by concept
labels=(X*w>=0)*2-1;
% calc margin
margin=min(abs(X*w));
% calc norms of the data
Xnorm = max( sqrt(sum(X.^2,2)) );
XmaxNorm = max(max(abs(X),[],2));

20

% split it to train and test


trainX=X(1:(EXC/2),:);
testX=X((EXC/2+1):EXC,:);
trainLab=labels(1:(EXC/2));
testLab=labels((EXC/2+1):EXC);
% initialize our hypotheses
wM=zeros(length(algorithms),DIM);
% winnow needs positive starting coefficients
wM(1,:)=1;
rho=margin - 0.1; % cheating a bit for marg. perc.
thresh=zeros(length(algorithms),1);
thresh(1)=0;
thresh(2)=0;
thresh(3)=rho;
alpha=zeros(length(algorithms),1);
%alpha(1)= margin/(XmaxNorm^2);
alpha(1)= 2;
alpha(2)= (Xnorm^2) / (margin^2);
alpha(3)=(margin-thresh(3))/(Xnorm^2);
% winnow bound
wbound=2*(Xnorm^2)*log(DIM) / margin^2;
fprintf(1, winnow bound %f\n, wbound);
% perceptron bound
pbound=(Xnorm^2)/(margin^2);
fprintf(1, perceptron bound %f\n, pbound);
% margperc bound
% for our reference model, hinge loss is zero due to our choice
% of mu. hence forget the first term.
mbound=1/(2*alpha(3)*(margin-thresh(3)-alpha(3)*(Xnorm^2)/2));
fprintf(1, marg. perceptron bound %f\n, mbound);
% learn the models
for alg=1:length(algorithms)
doIter=1;
f=[];
totIters=0;totMistakes=0;
while(doIter)
didMistake=0;
for i=1:size(trainX,1)
pred=(sum(wM(alg,:).*trainX(i,:))-thresh(alg))*trainLab(i);
if(pred<=0)
wM(alg,:)=feval(char(algorithms(alg)),wM(alg,:), ...
21

trainX(i,:),alpha(alg),trainLab(i));
didMistake=didMistake+1;
end
f=[f; (abs(wM(alg,1))+abs(wM(alg,2))) / ...
(eps+sum(abs(wM(alg,:))))];
end
totMistakes=totMistakes+didMistake;
if(rand(1)<0.01)
fprintf( alg %d did %d mistakes, tot %d \n, ...
alg, didMistake, totMistakes);
en
doIter=didMistake;
totIters=totIters+1;
end
fprintf( alg %d reqd %d iters %d mistakes \n, ...
alg, totIters, totMistakes);
testerr=sum( ((testX*wM(alg,:)).*testLab)<0);
fprintf(
testerr %1.2f \n, testerr/length(testLab));
plot(f)
pause;
end
function RES = winnow(w,x,alpha,label)
RES=w.*exp(label.*alpha.*x);
function RES = perceptron(w,x,alpha,label)
RES=w+label*alpha*x;
function RES = margperceptron(w,x,alpha,label)
RES=w+label*alpha*x;
rnorm=sqrt(sum(RES.^2));
if(rnorm>1)
RES=RES./rnorm;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
rant: Now that we have applied Winnow on a data where there are clearly
relevant and irrelevant attributes, a word of caution is in order. The irrelevant
attributes expression, together with the property of attribute efficiency (such
as Winnow has) are convenient theoretical abstractions adapted by the online
machine learning community. There, the attributes are deemed irrelevant if they
do not belong to a particular supposed target concept. This might be fine in e.g.
logic (a literal either belongs to the formula or it does not), but in continuous,
natural phenomena, the usefulness of some attribute might be more graded.
Also, for the given concept class (like a linear classifier for the parity problem)
the attributes might be irrelevant in another way, as no model of the hypothesis
class is expressive enough to make use of them. In comparison, instead of
speaking of relevancy as a binary phenomenon, a statistician might claim some
22

level of dependency to exist between feature xi and the class label according
to the model he estimated (i.e. feature xi can explain the class variable with
... and so on.). In machine learning, the various kinds of relevancy have been
attempted to be conceptualized by e.g. Blum & Langley: Selection of Relevant
Features and Examples in Machine Learning. Artif. Intell. 97(1-2): 245-271
(1997). Suffice to say, an empirist would often be in a very lucky situation if
the only thing he had to do was to remove a few irrelevant variables, in order
to make the learning succeed. More likely we have to do expand the basis like
in exercise 4.1 and then do pruning in a potentially combinatorially exploding
situation.
It should be noted that some variables might not be only irrelevant, but even
downright harmful. For example, if one is learning a decision tree, and there is
a complex but correct explanation, and a wrong, but very tempting and easy
explanation, it might be that the only practical way to prevent the learner from
going the easy way is to remove that possibility from the data (in a Bayesian
setup we would make the prior probability of the easy solution a really low one
or even zero). However, it might be very difficult to know when the learning
method has somehow cheated on real world data.
Next time we will have an exercise with natural data of car images. There
it is rather intuitive to imagine what properties of the data could be irrelevant/harmful with relation to the concept we really wish to learn.

23

Exercise 5.1
a) Write out the error,

err(h; p)

= P (h(x) 6= f (x))P (f (x) = y)


+P (h(x) = f (x))P (f (x) 6= y)
= P (h(x) 6= f (x))(1 )
+P (h(x) = f (x))

= P (h(x) 6= f (x))(1 2) + ,
from which we can see that the only possibility to reduce the error is to reduce
the probability of disagreeing with f . Clearly f is the best choice.
b) Let f be the classifier that predicts the most probable class for each x
according to P . Suppose that h 6= f , i.e. h sometimes disagrees on some
instances. This means that h occasionally predicts the less likely class. It is
clear err(h; P ) > err(f ; P ).
This can also be done formally e.g. by writing out the expected zero-one loss
and minimizing it. This leads to selection of the label with smaller probability
of error w.r.t. P .
c) This time, minimize the square loss,

E(L2 ) = E[(f (x) y)2 ]


Z Z
=
(f (x) y)2 P (x, y)dxdy
x y
Z Z
=
(f (x) y)2 P (y|x)P (x)dxdy
x y
Z
Z
=
P (x) (f (x) y)2 P (y|x)dydx
x
y
Z
=
P (x)Ey|x [(f (x) y)2 ]
x

= Ex Ey|x [(f (x) y)2 ],


where we transformed expectations to integrals and then used the basic knowledge from analysis to play with them. The result can be minimised pointwise,
i.e. select
f (x) = argminc Ey|x [(c y)2 |X = x],
which equals selecting h(x) = f (x) = E[y|X = x]. Writing the expectation out
in this particular case gives h(x) = (1)P (y = 1|x) + (1)P (y = 1|x).


24

Exercise 5.2
Basically by rewriting theorem 3.4 of the lecture notes.
Let p = err(h). Select k = min{k|1 Bin(k, m, err) }. Note that k,
= 1 Bin(k, m, p) = ,
P (merr(h;

S) k)
1 . Now from definition of k,

from which follows P (merr(h,

S) k)
merr(h;

S) k

1 Bin(merr(h;

S), m, p)
Bin(merr(h;

S), m, p) 1

Bin(merr(h;

S), m, 1 ) p.

Hence
P (Bin(merr(h;

S), m, 1 ) p) 1
and finally
P (Bin(merr(h;

S), m, 1 ) p) .
Changing Bin to Bin form and using Hoeffding bound for it should give us
 err(h;

S)

1
2m

ln 1 .


25

Exercise 5.3
From the lecture notes and the previous exercise we know

Choose 0 =
esis is

2|H| .

P1 = P (e(h) e(h; S) +

1
2m

ln 1 )

P2 = P (e(h) e(h; S)

1
2m

ln 1 )

The probability of violating the first bound by some hypothS

hH

P1

P1

0 =

|H| 2|H|

= 2 .

Doing the same approximation for the lower bound we get that the probability
of some hypothesis violating either bound is at most 2 + 2 = . Hence, the
probability that no hypothesis violates the bounds is 1 . Now we have new
bounds similar to those in P1 and P2 but for all the hypothesis. These two can
now be combined as
P (|err(h; P ) err(h;

S)|

1
2m

ln 2|H|
) 1 .


26

Exercise 5.4-5.5
In this exercise we can see that the simple perceptron algorithm does relatively
well on a set of high-dimensional natural data if we are just looking at measures like the test error.
Running the algorithm as suggested, you should get test errors around 6% and
convergence usually before 20 iterations. The early stopping criteria will usually
pick a hypothesis from some iteration in the first half of the iterations. If the
norms of the perceptrons are examined, these can be seen to grow each iteration,
which can sometimes be interpreted as overfitting (i.e. in a vague sense the
algorithm is just memorizing the training examples and losing the ability to
generalize). However, the early stopping method does not appear to reduce the
test error in any significant manner in this case. The method can be seen as a
heuristic that sometimes works (for example when training feedforward neural
networks with backpropagation), but your mileage might vary.
If visualized, the final hypothesis should resemble a car, you can see a form of
the chassis and the tires. It could be interpreted as a prototype.
We were also asked to compute the test set bound. The test set bound could
be easier to understand if you visualize the Bin() density as a function of p
with a fixed m (number of trials == test set size). Now we are interested in
finding the largest p (true error rate, the mean of the distribution) that fulfills
our requirement for left tail area (right tail area is 1 ). Increasing p equals
moving the mode of the reference distribution until we find a location of a good
fit to what we experienced empirically.
Test set bound can be plotted against , for example. Or you can give just some
value, such as 0.05, and get rather reasonable claims from the bound, such as
P (truee rr >= 0.09) < 0.05.
The code for R follows.
##########################################################
library(pixmap);
# for plotting a greyscale image
load(Lcars.RData); # contains instances, labels
# lets specify the test set bound function first
testSetBound<-function(testErr,testSetSizeM,delta,step=0.01) {
# note that Bin() is just the standard cumulative distribution function
# of the binomial distribution. This is already in R as pbinom(). Then
# we do just a lazy mans search with a specified granularity "step".
# note how R allows specifying default values in the function definition
# (in contrast to matlab).
failedExamples<-floor(testErr*testSetSizeM);
p<-0;go<-TRUE;tmp<-delta;
while(tmp>=delta) {
tmp<-pbinom(failedExamples,testSetSizeM, p);
p<-p+step;
27

}
return(p-step);
}
# ok, the actual learning script
dims<-dim(instances); # (rows,cols,instance#)
X<-matrix(data=0,nrow=length(labels),ncol=prod(dims[1:2]));
# turn it to a matrix
for(i in 1:length(labels)) {
X[i,]<-as.vector(instances[,,i]);
# X[i,]<-X[i,]-mean(X[i,]); # standardization is standard practice
# X[i,]<-X[i,]/sd(X[i,]); # for certain methods, see what happens here...
}
# sample train,val,test indexes
idxsleft<-1:length(labels);
trainidxs<-sample(idxsleft,600,replace=FALSE);
idxsleft<-setdiff(idxsleft,trainidxs);
validxs<-sample(idxsleft,199,replace=FALSE);
testidxs<-setdiff(idxsleft,validxs);
# split the data and labels to train and test sets
trainX<-X[trainidxs,];
valX<-X[validxs,]
testX<-X[testidxs,];
trainLab<-labels[trainidxs];
valLab<-labels[validxs];
testLab<-labels[testidxs];
wm<-rep(0,dim(X)[2]); # initial hypothesis
wstack<-NULL;
# our stored hypotheses
eta<-1;
# learning rate coeff.
# loop the perceptron training until no mistakes are made
doIter<-1;totalMistakes<-0;
while(doIter) {
mistakesNow<-0;
cat(Starting iter , doIter, ... );
for(i in 1:NROW(trainX)) {
pred<-as.numeric(sum(wm*trainX[i,])>=0)*2-1;
if(pred!=trainLab[i]) { # update on mistake
wm<-wm + trainLab[i]*eta*trainX[i,] ;
mistakesNow<-mistakesNow+1;
# visualize the current model
plot(pixmapGrey(matrix(data=wm,nrow=dims[1],ncol=dims[2])));
Sys.sleep(0.01);
}
}
28

wstack<-rbind(wstack,matrix(data=wm,nrow=1)); # append current hypothesis


totalMistakes<-totalMistakes+mistakesNow;
normNow<-sqrt(sum(wm^2));
cat(norm, normNow, , made, mistakesNow, mistakes now, ,
totalMistakes, total.\n);
doIter<-(mistakesNow>0)*(doIter+1);
}
# pick the best one according to validation set
minErr<-1;pickIdx<-1;
for(i in 1:dim(wstack)[1]) {
tmpErr<-sum(as.numeric(valX%*%wstack[i,]>=0)*2-1 != valLab)
/ length(valLab) ;
if(tmpErr<minErr) {
pickIdx<-i;
minErr<-tmpErr;
}
}
cat(picked hypothesis # , pickIdx, of, dim(wstack)[1], \n);
wm<-wstack[pickIdx,];
# show the final model
plot(pixmapGrey(matrix(data=wm,nrow=dims[1],ncol=dims[2])));
# calculate final training and test set errors
trainErr<-sum(as.numeric(trainX%*%wm>=0)*2-1 != trainLab)
/ length(trainLab) ;
testErr<-sum(as.numeric(testX%*%wm>=0)*2-1 != testLab)
/ length(testLab) ;
cat(train: , trainErr, test: , testErr,\n);
# what does the test set bound say?
delta<-0.05;
tSB<-testSetBound(testErr, length(testLab), delta);
cat(TSB claims P( true_error >=, tSB, ) < ,delta,\n);
##########################################################

rant: Sometimes machine learning has been motivated by potentially allowing


the machines to learn how to do things that we dont exactly know how to code
in. Recognition of patterns (like cars, or characters) is such a problem, and
sometimes machine learning methods have been proposed for such problems,
or, atleast the methods have been marketed on the basis of their ability to
handle such problems to some extent.
One educative aspect of this exercise is the possibility to look at such data and

29

the related classification problem. We can visualize the data, have an intuitive
interpretation for it, and we can visualize the hypothesis. Then we can ask, if
the method really is even close solving the problem or not? We should be able
to see that this can not be the case.
The first reason is that the dataset is not a representative sample from any
realistic application situation. In the case where we wish to detect a car, we are
either having photographs from various sources, or photographs from a camera
that can either be in a fixed location or mounted on a moving platform (such
as a robot). In the realistic case, we get for example the following problems
The classes are not nearly balanced

(Note that a 6% error can be disastrous in e.g. 25 frames/sec operation


or a sliding window approach.)

The cars are not necessarily centered


The cars are not necessarily sideviews
The cars are not necessarily vertical
The cars can be monster trucks
The cars can be lighted oddly
The cars can be partly occluded by other objects
...
The dataset used in this exercise doesnt reflect these problems. Instead of being
adversarial, it is almost beneficial. Even if it did adequately represent reality, a
linear algorithm on raw data could not solve this problem (for example, inverting
the pixel values of a car image will not make it a non-car, but will make the
prediction fail).
Often machine learning algorithms are run on data that we do not understand.
There is not necessarily a-priori reason to believe that all kinds of similar problems (and many others) were absent from such data. Analysing the data and
the domain, perhaps using machine learning methods as a guiding tool, may
be required if the problem is to be solved. Especially when attempting to do
industrial applications, it should be made very certain that the supplied data
reflects the problem adequately.
As a final note, the dataset also shows how on some problems there are no lowlevel irrelevant attributes, but that the relevance can depend on the values of
surrounding attributes. Visual datasets can often also allow the classifier to be
statistically misled, i.e. it can learn to classify the surroundings instead of the
objects.For example, learning to recognize humans in office environment and
sheep on pasture will lead your average method to say that a human on a green
pasture is definitely a sheep.

30

Exercise 6.1
Let Xn = {1, 1}n, Hn : Xn Y, Y = {1, 1} and V CDim(Hn ) = d.
if. V CDim(Hn ) P oly(n) log |Hn | P oly(n)?

The set D of all boolean functions in {1, 1} is finite. Restrict the function class
H to this set, choose m = 2n (want to shatter all possible boolean functions)
and use Sauers lemma.
SH (m) = max SH (D) = |HD | (
|D|=m

e2n d
em d
) =(
)
d
d

(67)

where SH () is the shattering coefficient, that is, the maximum number of ways
H is able to split some set of size m (or a given set D). Applying log on both
sides shows
log |Hn | d log

e2n
,
d

(68)

which is polynomial in n.
only. Shattering d examples clearly requires atleast 2d functions. Hence
|Hn | 2d log |Hn | d log 2 d.
Thus, if d is not polynomial, neither is log |Hn |.

(69)

Exercise 6.2
a) Lets handle k = 1. Suppose V CDim(H2 ) = d+2. Drop arbitrary hypothesis
from H2 , i.e.
H1 = H2 \ {h} H2 = H1 h.

(70)

Doing this, you lose the ability to classify 1 example. This is because H2 is able
to make all 2V CDim(H2 ) dichotomies for a set of size V CDim(H2 ), there must
be one or more hypotheses in H2 that differ from the removed hypothesis h
only by the label of 1 example. Thus H1 is able to make atleast 2V CDim(H2 )1
dichotomies for a set of size V CDim(H2 )1 and it follows that V CDim(H1)
d + 1.
General k inductively. Result: the claim is true.
b) Take two hypothesis classes, H1 = {1} and H2 = {1}. That is, both classes
contain just one constant function. As neither function class can shatter any set
of one point (for example X = {1}) arbitrarily, V CDim(H1 ) = V CDim(H2 ) =
0. However, let H = H1 H2 . Now the combined hypothesis class can shatter
a set of single point. Thus,
1 = V CDim(H) > V CDim(H1 ) + V CDim(H2 ) = 0.

(71)

Result: the claim is false.



31

Exercise 6.3
With our hypothesis class, we can set upper and lower limit separately for each
coordinate axis. So, in each dimension, we can label a set of 2 well-chosen
points arbitrarily, by selecting either of the points, both points, or no points
at all. Hence, with n dimensions, we can label atleast 2n points arbitrarily if
theyre well chosen. Then V CDim(H) 2n.
The following dataset illustrates a suitable configuration.
-1 0
1 0
0 1
0 -1
0
0

0
0
0
0
...
0 0
0 0

...
...
...
...

0
0
0
0

... 1
... -1

However, trying to add a 2n+1th point will necessarily result in a set containing
atleast three examples with each having a nonzero value in a single dimension.
Our hypothesis class is clearly not expressive enough to label three such points
arbitrarily. Geometrically, the example added after 2n will result in some point
being necessarily inside the box and thus cant be labeled in both ways.


32

Exercise 6.4
Graphically, the two interval classifiers can be illustrated roughly as follows.
+
a1

f
fhat

+
+ ...
a2
a3
a4
(XXXX]
(XXXX]
(XXXX]
(XXXX]

+
ak+1
XXX]
(XXXXXX)

Given a set of labeled examples on the real axis, the task of learning an interval
classifier is to find k split locations a1 , ...ak to classify the data as well as
possible. The other way to think about is that you have a set of intervals and
you wish to choose their endpoints so, that each interval labels the examples
ending up inside it as positive while minimizing the total classification error.
The number of intervals given designate the maximum complexity we allow to
the classifier (clearly selecting number of splits on the same order as the number
of examples allow us to overfit randomly labeled data quite well, unless some
labels of otherwise identical points differ).
The interval classifier can also be seen as a special case of the one-dimensional
segmentation problem (assume signal that comes from source 1 or source -1.
Segment the signal to segments denoting either origin 1 or -1).
a) Note that having k splits means that we can make all dichotomies where
the sign is changed at most k times (consider reading a label stream +-+++-...
from left to right). Now k examples can change sign k 1 times. Hence we can
split k + 1 examples with k splits. k + 2 examples can no longer be splitted, we
are out of splits. It follows that V CDim(H) = k + 1.
b) Dynamic programming can be used. First proceed by sorting the onedimensional data of n points to get (x01 , x02 , ..., x0n ). Now create a dynamic
programming matrix M
x1
1
2
.
k

x2

...

xn

0
0

The idea is to fill the matrix from top-down, left-right so that we consider
putting the first split to data point x01 and so on. While progressing, we sum up
the errors made by the previous choices and add the error caused by the current
choice. At M (i, j) we use the information from M (i 1, j) and M (i 1, j 1)
and add the local increase. We might have to keep a separate matrix for fa and
fa . The final hypothesis is found by backtracking the path that resulted in the
smallest total error (the empirical risk minimizer). The time-complexity of this
solution is O(kn) for filling the matrix plus O(nlogn) for sorting the data.


33

Exercise 6.5
The lecture notes presented how empirical risk minimization can be used to
estimate the Rademacher complexity. This solution is a slight modification of
the technique shown in the slides following theorem 3.23.
Theorem 3.23 the second part states that given fixed S Z m , we have
m

Rm (F ) sup |
f F

2 X
ri f (zi )| +
m i=1

1
8
ln
m

(72)

with prob. atleast 1 over random choice of r and S. Now we are interested
in estimating the first term on the right.
Let F = L01 (H) be the discrete loss class for some H.
Assume h H h
/ H(h = h).

Let S = ((x1 , y1 ), ..., (xm , ym )) (X Y )m , r {1, 1}m fixed,


S 0 = ((x1 , r1 y1 ), ..., (xm , rm ym )). L01 (h, x, y) = 21 (1 yh(x)). Now
2

ri f (zi )

= 2

ri L(h, xi , yi )

(73)

1
= 2
ri (1 yi h(xi ))
2
X
X
=
ri
yi ri h(xi )
X
X
=
ri
(1 2L(h, xi , ri yi )
X
X
=
ri m + 2
L(h, xi , ri yi )
X
=
ri m + 2merr(h,
c
S 0 ).

(74)
(75)
(76)
(77)
(78)

P
P
similarly 2 ri f (zi ) = (ri ) + m 2merr(h,
c
S 0 ). The assumption that the
hypothesis class is not closed under complementation means must look at the
flipped hypotheses separately, unlike what could be done in the lecture notes
case.
Notice that sup |x| = max(sup(x), sup(x)). We can write
sup |

2 X
2 X
2 X
ri f (zi )| = max(sup(
ri f (zi )), sup(
ri f (zi ))),
m
m
m

(79)

which is
max(sup

1
m(

1
ri m + 2merr
d1 (h, S 0 )), sup m
(

Hence, maximize err


d1 and minimize err
d2 . Calculate
error estimates. Choose the max of the two.

34

ri + m 2merr
d2 (h, S 0 ))).
P

ri and plug in with the




Exercise 7.1
In this exercise we are interested about Rademacher complexities. It suffices to
examine the term
m

X
m (F, S) = Er sup | 2
ri f (zi )|.
R
f F m i

(80)

Denote conv(F) as Fc and absconv(F) as Fa .


m (Fa , S) R
m (F, S), as in Fa we are allowed to choose v = e1 (without
i) R
loss of generality), where e1 is the standard basis vector (i.e. the first slot is 1
and the rest are 0). That is, we can atleast choose all members from F by Fa
(and by Fc too).
ii)
m

X
m (Fa , S) = Er sup | 2
ri f (zi )|
R
f Fa m i
n

Er sup |
f F ,v

Er sup

f F ,v

sup
v

sup
v

n
X
j
n
X

2 X X
vj f (zi )|
ri
m i
j

n
X
j

(81)
(82)

vj |

2 X
ri f (zi )|
m i

(83)

vj Er sup |
f F

2 X
ri f (zi )|
m i

m (F, S)
vj R

(84)
(85)

m (F, S),
= 1R

(86)

where we used the definition


P of functions in Fa and Jensen for abs (convex. by
triangle ineq.) to get out
v. The hint
Pgiven in the exercise sheet could also
be used. Finally, due to definition, 1 =
v.

We have shown that Rm (F) = Rm (Fa ). The equality to conv(Fc ) follows from
noting that Fc Fa .
Note: an interesting practical corollary of this result is that in the Rademacher
complexity sense, it appears that we can take weighted (linear) combinations
of models without increasing the complexity cost. Some practically successful
methods, such as boosting, can be seen as optimizing (roughly) similar kind of
convex hull hypothesis by an incremental, greedy search.

35

Exercise 7.2
a) Note that we are throwing dice and that the throws are independent (
E[C|A] = E[C], for example). It follows that

E[E[X|A, B]|A]

= E[E[A + B + C|A, B]|A]

(87)

= E[A + B + E[C]|A]
= A + E[B] + E[C].

(88)
(89)

and similarly E[A + B + C|A] = A + E[B] + E[C]. Just keep in mind which
variables have been fixed in the given condition and can be thus seen as
constants. Numerically, E[A] = E[B] = E[C] = 61 (1 + 2 + 3 + 4 + 5 + 6) = 3.5,
so the sum is A + 7 in both cases.
b) Need to show E[E[X|A, B]|A] = E[X|A]. This is sometimes called the law
of total probability for conditional expectations. Lets follow the hint given in the
exercise sheet and denote Y (a, b) = E[X|A = b, B = b]. By definition,

and

P
xP r(X = x, A = a, B = b)
Y (a, b) = Px
x P r(X = x, A = a, B = b)

E[Y |A = a] =
=
=
=
=
=
=

Py

yP r(Y = y, A = a)

P r(Y = y, A = a)
P P
P r(Y = y, A = a, B = b)
yy
Pb
y P r(Y = y, A = a)
P P
b
y yP r(Y = y, A = a, B = b)

(90)

(91)

P r(A = a)
b Y (a, b)P r(A = a, B = b)
P r(A = a)
P P
b
x xP r(X = x, A = a, B = b)
P r(A = a)
P
xP r(X = x, A = a)
Px
x P r(X = x, A = a)
E[X|A = a].
P

(92)
(93)
(94)
(95)
(96)
(97)

In the derivation we used the definition of Y (a, b) to get an expression for


Y (a, b)P
P r(A = a, B = b) which we inserted to (94). We also used the knowledge
that x P (X = x, A) = P (A) (i.e. we can safely add a dimension if we sum
over it, and we can also cut away such a dimension - just by summing over all
the possibilities).


36

Exercise 7.3
a)
The equivalence classes of are the basic events of F. Our requirement for
all Y means that each Y can tell the difference between any two basic events,
but not between two members of the same event.
Fix some arbitrary Y1 and Y2 . Given w , let [w] F be its equivalency class
w.r.t. . Thus Y1 and Y2 are constant over [w], a. Let now X be a function
(not necessarily F -measurable), and let Ci be the r.v. E[X|Yi ], i {1, 2}. Now
Ci

= E[X|Yi = yi ]
P
xP (X = x, Yi = yi )
= Px
P (X = x, Yi = yi )
Px
0
0
0
0
w X(w )(yi , Yi (w ))P (w )
P
=
(yi , Yi (w0 ))P (w0 )
0
P w 0
0
w 0 [w] X(w )P (w )
=
,
P ([w])

(98)
(99)
(100)
(101)

where is the Kronecker delta. Hence, Ci does not depend on i and C1 , C2 are
the same r.v.
b)
Note that in the fixed version of the exercise sheet we assume that X i is a
function of Z0 , ..., Zi .
For i {0, ..., n}, and w , define
{w}i = {w0 |Zj (w) = Zj (w0 ), j i},

(102)

which can be interpreted as the set of cases that are equivalent to w w.r.t. the
side-information available up to index i.
Due to assumptions, {w 0 |Zj (w) = a} is Fj -measurable for all j. Since Fj Fi
when j i, it follows that {w}i is Fi -measurable for all w . Also, since Xi
is determined by Z0 , ..., Zi , Xi is constant inside {w}i .

Define Y as in part (a) and [w]i as the equivalency class (basic event) of w
w.r.t. Fi similarly.

Since [w]i is the smallest Fi -measurable set containing w, we have [w]i {w}i
(think of set {w 0 |Zi (w0 ) = a}).
Further, each {w}i is a disjoint union

{w}i = [w1 ]i ... [wk ]i ,

(103)

for some w1 , ..., wk . Thus, for w let S(w) = {w 1 , ..., wk } so that (103)
holds.

37

Assume now that E[Xi+1 |Y ] = Xi , or, for any w ,


P
xP (Xi+1 = x, Y = Y (w))
Xi (w) = Px
.
x P (Xi+1 = x, Y = Y (w))

(104)

Lets start to expand the formula with the side information,


E[Xi+1 |Z0 = Z0 (w), ..., Zi = Zi (w)]
P
xP (Xi+1 = x, Z = Z(w))
= Px
P (Xi+1 = x, Z = Z(w))
Px P
0
w 0 S (w) P (Xi+1 = x, Y = Y (w ))
xx
=
P i
x P (Xi+1 = x, Z = Z(w))
P
P
0
w 0 Si (w)
x xP (Xi+1 = x, Y = Y (w ))
=
P
x P (Xi+1 = x, Z = Z(w))
Now using the definition of Xi (w) above to replace the

=
=
=

(105)
(106)
(107)

xP sum,

P
Xi (w) x P (Xi+1 = x, Y = Y (w0 ))
P
x P (Xi+1 = x, Z = Z(w))
P
0
X
w 0 Si (w) i (w)P (Y = Y (w ))
P
P (Xi+1 = x, Z = Z(w))
P x
0
w 0 Si (w) Xi (w)P (Y = Y (w ))
P
0
w 0 Si (w) P (Y = Y (w ))
w 0 Si (w)

= Xi (w).

(108)
(109)
(110)
(111)

The last equation follows from noting that Xi is constant within {w}i .

It may be possible to construct a more elementary proof. You can do that as


an optional exercise if you like.


38

Exercise 7.4-7.5
In this exercise you were asked to calculate the bound of theorem 3.27 for the
car dataset used in exercise 5, when marginalised perceptron was used as the
learning algorithm. The bounds were then to be used to select a model.
Unfortunately, the results appear discouraging. With this setting, the bound
given by the theorem appears to be monotonically decreasing as the margin
size increases, atleast on the range of margins suggested by the exercise sheet.
The starting margin 0 (the maximum of the euclidean norms of the training
vectors, around 40 on this data) gives the best bound, which is still over one.
Clearly, such margin doesnt appear realizable except perhaps on pathological
datasets. The marginalised perceptron could still work: its just updating its
hypothesis all the time. However, the test set error for the choice of 0 can be
close to 0.5 on this data - the guessing accuracy for a label-balanced set. On the
other hand, the test set error appears to achieve the minimum ( 0.07) with a
wished margin 0.17, which gives a worse bound from 3.27. This suggests
that the way we used the bound in this exercise doesnt appear to be a good
way of model selection in this setting3 .
Perhaps the lesson of this exercise is that a) theoretical bounds are often not
practical b) if you use them, you should make sure youre using them correctly,
and c) if you wish to select a model in practice, you should atleast verify that
the bounds youre using agree with cross-validation or some other empirically
reasonable error estimate.
For completeness, I have included the R code I used.

#####################################################
# 7.4-5
library(pixmap);
# for plotting a greyscale image
load(Lcars.RData); # contains instances, labels
# the actual learning script
dims<-dim(instances); # (rows,cols,instance#)
X<-matrix(data=0,nrow=length(labels),ncol=prod(dims[1:2]));
# turn it to a matrix
for(i in 1:length(labels)) {
X[i,]<-as.vector(instances[,,i]);
}
# sample train,test indexes
idxsleft<-1:length(labels);
trainidxs<-sample(idxsleft,floor(2/3*dim(X)[1]),replace=FALSE);
3 This statement is left vague on purpose, as it not too clear what the problem is: is it in
the way we used the bound, the characteristics of the dataset/algorithm combination, some
property of the bound itself, or perhaps several of these reasons at the same time, resulting
in useless bound behaviour.

39

testidxs<-setdiff(idxsleft,trainidxs);
# split the data and labels to train and test sets
trainX<-X[trainidxs,];
testX<-X[testidxs,];
trainLab<-labels[trainidxs];
testLab<-labels[testidxs];
wstack<-NULL;
eta<-0.001;
maxIter<-20;
B<-1;

# our stored hypotheses


# learning rate coeff.
# dont ever bother with more iterations
# keep the norm under this

tmp<-sqrt(apply(trainX^2,1,sum));
Xnorm<-max(tmp);
XnormSum<-sqrt(sum(tmp^2));
exCount<-length(trainLab);

#
#
#
#

||x_t||_2 for all t


max_t ||x_t||_2
sqrt(sum_t(||x_t||_2^2)) for 3.27
m, the training example count

margins<-rep(2,15);
margins<-1/cumprod(margins);
margins<-margins*Xnorm;
# generates the decreasing margin sequence
for (marg in margins) {
cat(For margin , marg, ...\n);
# loop the perceptron training until no mistakes are made
wm<-rep(0,dim(X)[2]); # initial hypothesis
doIter<-1;totalMistakes<-0;
while(doIter && doIter<maxIter) {
mistakesNow<-0;
cat( Starting iter , doIter, ... );
for(i in 1:NROW(trainX)) {
pred<-trainLab[i]*sum(wm*trainX[i,]);
if(pred<=marg) { # update if margin wasnt met
wm<-wm + trainLab[i]*eta*trainX[i,] ;
mistakesNow<-mistakesNow+1;
wnorm=sqrt(sum(wm^2));
if(wnorm>B) {
wm<-wm/wnorm;
}
# visualize the current model
if(runif(1)<0.01) {
plot(pixmapGrey(matrix(data=wm,nrow=dims[1],ncol=dims[2])));
Sys.sleep(0.01);
}
}
}

40

totalMistakes<-totalMistakes+mistakesNow;
normNow<-sqrt(sum(wm^2));
cat( norm, normNow, , made, mistakesNow, mistakes now, ,
totalMistakes, total.\n);
doIter<-(mistakesNow>0)*(doIter+1);
}
# append current hypothesis
wstack<-rbind(wstack,matrix(data=wm,nrow=1));
# show the final model
plot(pixmapGrey(matrix(data=wm,nrow=dims[1],ncol=dims[2])));
# calculate final training and test set errors
trainErr<-sum(as.numeric(trainX%*%wm>=0)*2-1 != trainLab)
/length(trainLab) ;
testErr<-sum(as.numeric(testX%*%wm>=0)*2-1 != testLab)
/length(testLab) ;
cat(

train: , trainErr, test: , testErr,\n);

# what does the test set bound say (function from prev. exer.)?
delta<-0.05;
tSB<-testSetBound(testErr, length(testLab), delta);
cat( TSB claims P( true_error >=, tSB, ) < ,delta,\n);
# calculate theorem 3.27 bound
mu <- 2 * marg;
delta <- 0.01/length(margins);
hingeLosses<-apply(mu-trainLab*(trainX%*%wm),1,max,0);
bound<- ( 1/(mu*exCount) * sum(hingeLosses)
+ (4*B)/(mu*exCount) * XnormSum
+ 3 * sqrt(1/(2*exCount) * log(4/delta)) );
cat( theorem 3.27 bound says , bound, \n);
a<-readline(Next... ?);
}
#####################################################

41

Exercise 8.1
a) Let Y be the expected value of a single throw. Now,
Xi

= E[Y |Y1 , ..., Yi ] = E[Y1 + ... + Yn |Y1 , ..., Yi ]


= Y1 + ... + Yi + (n 1)E[Y ],

(112)
(113)

as the throws are independent. But also

E[Xi+1 |Y1 , ..., Yi ] = E[Y1 + ... + Yi+1 + (n (i + 1))E[Y ]|Y1 , ..., Yi ]


= Y1 + ... + Yi + E[Yi+i + (n i 1)E[Y ]|Y1 , ...Yi ]
= Y1 + ... + Yi + E[Yi+1 + (n i 1)E[Y ]]
= Y1 + ... + Yi + E[(n i)E[Y ]]
= Y1 + ... + Yi + (n 1)E[Y ],

due to independence assumptions and E[Yi ] = E[Y ], i.

b) For (Xt ) to be a martingale sequence w.r.t. (Ft ), we need to show E[Xt |z0 , ..., zt1 ] =
Xt1 for all random sequences (Zt ) such that Zt is Ft measurable.

Due to exercise 7.3, it suffices to show that E[Xt+1 |Ft ] = Xt . The wanted result
follows from this. As we defined Xt = E[Y |Ft ], insert this definition to the right
side of the previous formula, yielding
E[E[Y |Ft+1 ]|Ft ],

(114)

E[Y |Ft ] = Xt ,

(115)

which is equal to

according to the more general version of 7.2 mentioned in the exercise sheet.
We also need measurability of Xi w.r.t Fi for this to work, but this follows from
the definition of Xi . Hence, we are done.


42

Exercise 8.2
Here we are interested in proving that the expected number of colors required
to color a graph G is concentrated around its expectation.
First we need to show
|Xi+1 Xi | = |E[(G)|Fi+1 ] E[(G)|Fi ]| 1.

(116)

Do this by numbering the nodes of the graph arbitrarily. Examine node i + 1.


Due to measurabiliy of Fi+1 , this is equal to revealing information about the
edges from the node i + 1 to the nodes with smaller index. Clearly adding all
such i + 1 -ending possible edges to G, or removing all such edges from G can
only increase or decrease the number of required graph colorings by at most 1
(the new node i + 1 either needs to have a new color than the rest when adding
edges, or had an unique color, when removing edges).
Now, from (Fi ) being a filter sequence it follows that (Xi ) is a martingale
(previous ex.). We also now know that if fulfills the criteria |Xi+1 Xi | < 1 = c1
as required by Azumas inequality. Now apply that by writing
P (|Xt X0 | ) 2 exp(

2
)
2t

P (|E[(G)|Ft ] E[(G)|F0 ]| ) 2 exp(


P (|(G) E[(G)]| ) 2 exp(

(117)
2
)
2t

2
),
2n

(118)

(119)

where we selected t = n and noted that F0 leaves (G) unconstrained


while Fn

defines it exactly. The result follows from choosing = / n.




43

Exercise 8.3
a) Being an inner-product space (<n , < ., . >) requires the dot-product < ., . >
to fulfill (this is sufficient),
1. < x, y >=< y, x >, x, y, <n ,

2. < x + y, z >= < x, z > + < y, z >, x, y, z <n , , <,


3. < x, x > 0, x <n

Lets see what happens with our k(x, z) = xT Az.

if. Need to show that if A is sym. pos. semdef., we have an inner-product


space.

1. < x, y >= k(x, y)

= xT Ay = (xT Ay)T = ((xT A)y)T


T

= y (x A) = y A x = y Ax
= < y, x >,

(120)
(121)
(122)

where we applied xT Ay being scalar (its transpose is itself) and the symmetry
of A.
2. < x + y, z > = (x + y)T Az
T

(123)

= (x + y )Az
= xT Az + y T Az

(124)
(125)

= k(x, z) + k(y, z).

(126)

3. < x, x >= xT Ax 0,

(127)

due to pos. sem. def. of A being defined as xT Ax 0, x <n .

only Asymmetry breaks condition 1. Being not positive semidefinite breaks


condition 3. Hence, A must be symmetric and pos.sem.def.
b) Here we want to convert our kernel into a dot product form. Use the eigendecomposition of A to get V DV T . This leads to
xT Az

= xT (V DV T )z
= (xT V )D(V T z)

= (xT V ) D D(V T z)

= (xT V D)( DV T z)

= < (xT V D), ( DV T z) >

= < (xT V D), (z T V D) >


= < (x), (z) >,

(128)
(129)
(130)
(131)
(132)
(133)
(134)

where the equivalences can be checked by examining matrix sizes of the results
of the multiplications, and noting that D is symmetric.

44

Exercise 8.4
Since we already know that (<n , < ., . >) is a Hilbert-space, if we select that,
we only need to construct the proper mappings .
n

i) Choose (<2 , < ., . >) as the space. Define : Z 2n s.t.,


(A)i = 1, if Si A, 0 otherwise,
where Si is the i:th member of P(Z). That is, (x) has one dimension (index)
for each possible subset. Now k1 (A, B) =< (A), (B) >.
All binary vectors in the feature space are of the reqd. form.
P
ii) Choose (Rn , < ., . >), and suppose P (A) = aA P (a). Now select
(A)i =

p
P (ai ), if ai A, 0 otherwise.

This results in k2 (A, B) =< (A), (B) >= P (A B).

Loosely spoken, the suitable vectors are constrained


by P s nature as a distriP
bution to be non-negative, and forced by
ai = 1 to be inside some ball. The
main point perhaps is that not all points of the feature space correspond to a
dot-product between some two examples.

45

Exercise 8.5
Let X = <n , S = ((x1 , y1 ), ..., (xm , ym )), C > 0,
V ={

Pm

R(w) = 1/2

i xi |i <}, and

P
(wxt yt )2 + C/2||w||22 .

Should show w = argmin(R) V . Follow the given hint, choose arbitrary w


from V , that is, arbitrary Rm . Now
R(w) = R(

i xi ) =

1X X
C X
i xi )xt yt )2 + ||
i xi ||22 .
((
2 t
2
i
i

(135)

Denote with z some orthogonal addition. The first piece of the right hand side
turns to
P P
(( i xi + z)xt yt )2
P
P
1
(( (i xi )xt + zxt yt )2 ,
2
1
2

noticing that zxt is zero due to the orthogonality, we see that we can not improve
the square error term. However, the norm term (second on the right hand side)
ends up as
C
2
2 (||w||

+ 2 < w, z > +||z||2 ),

that is, the norm can only increase (< w, z >= 0).
Rant: the term on the right is called the regularization term, which can be
loosely interpreted as an extra cost paid by the model complexity (here the
norm). In this setting, a learning process can be interpreted as minimization of
a cost function such as R. If we did regression and set C to zero, the minimization should result in standard least-squares linear regression with no concern
for e.g. very high absolute values of the coefficients. Such an estimation procedure is easily thwarted by outlying examples (the square error is unbounded).
Hence, regularization can be seen as both addressing overfitting and as a way
to control the sensitivity of the learning process to errors in the data (which are
not necessarily the same issue).


46

Exercise 9.1
Reminder: A is pos. sem. def. iff z <m holds z T Az 0.
i. Is k(x, y) = k1 (x, y) + k2 (x, y) a kernel if k1 and k2 are?

Take arbitrary n points from X. Now G1 are G2 are valid Gram-matrices w.r.t.
k1 and k2 , and thus pos. sem. def. The candidate Gram-matrix G for k is
clearly G1 + G2 . Now
z T (G)z = z T (G1 + G2 )z = z T G1 z + z T G2 z 0.

(136)

Hence G is pos. sem. def., k is symmetric (as a sum of two symmetric functions),
and its continuous as a sum of two continuous functions (by assumption). Now
theorem 4.5 shows that k is a kernel.
ii. Is k(x, y) = ak1 (x, y), a > 0 a kernel?
z T (aG)z = a(z T Gz) 0,

(137)

the result follows as before.


iii. Let k(x, y) =< (x), (y) > for the original . Lets expand the dot product

related to the new mapping ,


(y)
> = < (x) , (y) >
< (x),
||(x)|| ||(y)||
1
1
=
< (x), (y) >
||(x)|| ||(y)||
1
1
p
k(x, y)
= p
k(x, x) k(y, y)
k(x, y)

=
1 = k(x, y).
(k(x, x)k(y, y)) 2

(138)
(139)
(140)
(141)

Hence k is a kernel as we could explicitly construct the related dot product of the
new feature mapping, inheriting the suitable space from the original mapping
.


47

Exercise 9.2
We know from (4.7.1)-(4.7.3) of the lecture notes that kernel sums, scalar multiplies and products of kernels are kernels.
i. It appears we have to restrict ourselves to polynomials with positive coefficients.
First, show that any single monomial of the polynomial is a kernel. This can
be done by induction: We know k 1 is a kernel. Now suppose that k 0 = k n is
a kernel. Due to (4.7.3), k 0 k = k n+1 is a kernel. This completes the induction
w.r.t. exponentiation of a kernel. Now (4.7.2) tells us that ak 0 = ak n is a kernel
for all a > 0, as k 0 is a kernel. Finally, we know from (4.7.1) that k1 + k2 is a
kernel for all kernels k1 , k2 . Similar inductive argument shows that finite sums
of kernels are kernels. This shows that we can construct any polynomial out of
k while still retaining the kernel property.
ii. It can be taken as known that

X
xn
e =
, x <.
n!
n=0
x

(142)

Hence the required ek is an infinite sum of kernels (the monomials P


are kernels
n
due to previous exercise). In terms of Gram matrices, G = lim n 1 Gi . We
need to show that G is pos. sem. def. We have an infinite sum of non-negative
terms
z T Gz = z T G1 z + z T G2 z + ...
(143)
The limit is clearly non-negative as each of the terms are.
iii. k(x, y) = exp(||x y||2 /(2 2 )), where ||z||2 =< z, z >= z T z. Is it a
kernel?
First, lets show exp(xT y/ 2 ) is a kernel. This is true because < ., . > is a
(linear) kernel, k is a kernel (where = 1/ 2 ), exp(k) is a kernel due to the
previous part. We also know that a normalized kernel is a kernel (prev. ex.), so
we normalize the exp kernel. A few manipulations show the result as follows,
T

exp( x2z )
T

exp( x2x ) 2 exp( z2z ) 2

= ...

(144)

xT z
xT x z T z 1
)
exp(
+ 2 ) 2
2
2

1
T
T
exp( 2 (x x x z + z T z xT z)
2
1
exp( 2 (xT (x z) + z T (z x))
2
1
exp( 2 (x z)T (x z))
2
1
exp( 2 ||x z||2 )
2

= exp(

(145)

(146)

=
=
=

(147)
(148)
(149)


48

Exercise 9.3
i. The optimization problem can be written as
x2

min

(x 2)2 1 0

s.t.
The related Lagrangian is

L(x, ) = x2 + ((x 2)2 1)

(150)

The KKT conditions can be written as


L
= 2x + (2x 4) = 0
x
(x 2)2 1 0
0
((x 2)2 1) = 0

(151)
(152)
(153)
(154)

Suppose 6= 0. Then from condition 4 follows that (x 2)2 1 = 0 x = 1.


Inserting this to the first condition reveals 2 + (2 4) = 0 = 1. The
pair (x = 1, = 1) can be seen to fulfill the KKT conditions, so we are done
(supposing = 0 would result in condition 2 being violated).
The dual is specified as d() = inf x L(x, ). Solving the derivative in the first
2
KKT condition for x results in x = 1+
. Inserting this back to the Lagrangian
and simplifying a little, results in dual
d() =

( 3)
1+

(155)

(inserting = 1 to d returns 1).


ii. In the case g(x) 8 the second and last conditions change to
(x 2)2 8 0

((x 2)2 8) = 0

(156)
(157)

The choice 6= 0 allows no solutions. Hence, = 0. Solving for this choice


results in x = 0. With the new Lagrangian L0 (omitted) the dual is 4(2+1)
,
1+
solved as previously.
Plotting the situation graphically instantly reveals the intuitive acceptability of
these results: with g(x) 8 the feasible region is large enough to include the
lowest point of the figure f (x) = x2 , that is, it is no longer constraining the
optimization. In the first part the g(x) 1 constraint limited x = 0 outside the
feasible region.

49

Exercise 9.4
In this exercise, Wt is the set of all linear models with the squared error for
current example (x, y) less than R2 . We wish to choose our new model w from
this set so that it is as close as possible to our current model wt . Drawing a
picture can help here too.
This can be formulated as an optimization problem,
||wt w||2
(wT x y)2 R2 .

min
s.t.

This is a convex problem, the functions are differentiable and due to Slaters
theorem, strong duality holds (we can satisfy its conditions), so the problem
seems suitable for a KKT style solution. The Lagrangian is
L(w, ) = ||wt w||2 + ((wT x y)2 R2 ).

(158)

5w L = 0 gives the first KKT condition, and writing out the rest,
2wt + w + 2((wT x y)x
(wT x y)2 R2

((w x y) R )
T

= 0
0

0
= 0.

(159)
(160)
(161)
(162)

Supposing = 0 results in the solution w = wt . This means that wt is already


in the feasible region, and naturally it is the closest one to itself.
If 6= 0, we know from the last KKT condition that
(wT x y)2 R2 = 0 wT x y = kR, k {1, 1}.

(163)

Inserting this to the first KKT condition gives


w = wt kRx.

(164)

Inserting this back to (163) returns after a few manipulations


=

kR y + wtT x
kRxT x

(165)

which we can insert to (164) to get the update rule


w = wt +

(kR + y wtT x)x


.
xT x

(166)

Finally, we can solve k from the second KKT condition, leading to the selection
of k = 1 if wtT x y R and 1 otherwise.
Note the close resemblance of the solution to the one found in exercise 3.4. The
geometric intuition should be similar.


50

Exercise 9.5
This problem might be solvable more easily thru convexity, Jensen, demonstrating lower bound of 0 and showing that choice p = q meets the bound. However,
here we take another route to drill the KKT techniques.
Reformulating, we need to solve the problem
p
min
pT ln
q
s.t.
pi 0, i {1, ..., n}
X
pi 1 = 0
As a difference to the previous exercises, we now have n inequality constraints
and one equality constraint. The Lagrangian can be written as
L(p, , ) =

pi ln

X
i

X
pi X
pi 1)
i pi + (

qi
i
i

pi (ln pi ln qi i + ) .

(167)
(168)

Now
L
= ln pi i + + 1 ln qi .
pi

(169)

Setting this to zero gives


pi = qi exp(i 1).

(170)

Further, inserting this back to the Lagrangian reveals us the dual


g(, ) =

qi exp(i 1) .

(171)

For this choice of pi we know that


pi = qi exp(i 1) > 0

(172)

because qi > 0 by assumption and exp() is always positive. This means that
i = 0, i, because all the related constraints are inactive. From the condition
requiring p to be a distribution, we get
1=

pi

qi exp(i 1)
X
= exp( 1)
qi

= exp( 1) = 1.

(173)
(174)
(175)

Inserting these to (170) results in pi = qi , i. It can be checked that the KKT


conditions are fulfilled by these choices.

51

Exercise 10.1
The optimization criteria is
min.
s.t.

Pm
+ C i 2i
||w||2 1 0
(i 0,
i yi (wiT (xi ) b) 0,

i 1, ..., m)
i 1, ..., m

Notice that we have dropped the second constraint as it is redundant. This can
be seen by noting that for all i, if ei < 0, we can set ei = 0 while still fulfilling
the constraints and yet getting a smaller objective function.
Continue as usual by writing the lagrangian,
X
X
i (+i +yi (wT (xi )b))+(||w||2 1)
2i
L(w, b, , , , ) = +C
i

(176)
and differentiating (note that w, and  are vectors) and setting to zero,
L
w

L
b
L

L


i yi (xi ) + 2w = 0

(177)

1 X
i yi (xi )
2
X
X
ai yi = 0
ai yi = 0

w=

(178)

(179)

ai

1 +

2C i =

ai = 1

i
2C

(180)
(181)

Inserting these back into the lagrangian and beautifying eventually yields the
dual,
1 XX
1 X 2
i
i j yi yj k(xi , xj ) ,
(182)
G(, ) =
4C
4 i j
where we have changed to the kernel notation (k(xi , xj ) = (xi )T (xj )). Denote
the double summation term in the above formula as a. Now

1
G
a
= 2a1 = 0 =
.
(183)

4
2
Note that must be positive as a Lagrange multiplier, and that if a were
negative, the derivative wouldnt have a valid (real-valued) root. Inserting this
back to the dual G results in the wanted solution
XX
1
1 X 2
W () =
i (
i j yi yj k(xi , xj )) 2 .
(184)
4C
i
j


52

Exercise 10.2
a) The task is to minimize

G()

X
i

exp(yi h(xi ) yi f (xi ))

(185)

exp(yi h(xi )) exp(yi f (xi ))

(186)

exp(yi h(xi ))di

(187)

X
i

where we noticed that the exp term on the right is actually the weighting distribution di . Then proceed by decomposing the sum to parts where the hypothesis
makes mistake or does not (since y, h {1, 1}, the exps simplify nicely),
X
di exp(yi h(xi ))
(188)
=
i

di exp() +

err

= exp()

di exp()

ok

di + exp()(

err

= exp()

X
err

di + exp()(1

which, remembering the definition of  =

err

(189)

di
X

di )

(190)

err

di ),

(191)

err

di , is straightforwardly

= exp() + exp() exp().

(192)

Differentiating this gives


G
=  exp() exp() +  exp().

(193)

Setting the above to zero and solving for yields


=

1 1
ln
2


53

(194)

b) Lets denote our new distribution with u. Then we have from AdaBoost that

err(h,

S, u)

n
X
i=1

ui

ui I(h(xi ) 6= yi ) = (u)

(195)

1
di exp(yi h(xi ))
Z
X
ui = 1

(196)
(197)

1 1 (u)
ln
2
(u)

(198)

We need to show that (u) = 12 . Lets insert ui into (u) first. This results in
1 X
exp() X
di exp(yi h(xi ))I(h(xi ) 6= yi ) =
di .
(199)
Z
Z
err
P
Requiring Z such that
ui = 1 allows us to solve for Z, decompose the error
sum, and plug it in, giving
P

X
exp()
P
di ,
err di exp() +
ok di exp() err

which we can further manipulate to get


P
exp() err di
P
.
(exp() exp()) err di + exp()

Now inserting the algorithms choice for and noting again that  =
we get a slightly formidable expression of

(200)

(201)
P

err

di ,

2
( 1
 ) 
1

1 2
2
2
(( 1
) + ( 1
 ) (  )
 )

(202)

However, by some sweating this can be simplified to 12 .




54

Exercise 10.3
Here we wish to find the closest distribution d (in the relative entropy sense) to
the previous distribution q, such that the previous hypothesis has zero edge on
the new distribution. This can be written as
P
di
min.
i di ln qi
s.t.
d
i
Pi 0
d

1
=
0
P i
di yi h(xi ) = 0
The related lagrangian is
L(d, , , )

=
=

di ln

X
X
di X

i di + (
di 1) +
di yi h(xi )
qi

di (ln di ln qi i + + yi h(xi )) .

ln di ln qi i + + yi h(xi ) = 0

Differentiating and setting to zero gives


L
di

di = qi exp(i yi h(xi ) 1)
di = qi exp( 1) exp(i yi h(xi ))
di = qi exp( 1) exp(yi h(xi )),

(203)
(204)
(205)
(206)

where we dropped i , since similarly to exercise


P 9.5, we know that di > 0 and
hence i = 0 for all i. From the requirement di = 1 we can also use the above
to solve for , giving
1
1
= .
Z
j qj exp(yj h(xj ))

exp( 1) = P

(207)

Inserting this to the previous expression for di yields


di =

1
qi exp(yi h(xi )),
Z

(208)

the AdaBoost update formula for di . Note that Z is a function until we fix
. Hence, to wrap this up, we need to show that the AdaBoosts choice of
minimizes the primal problem (maximizes the dual).
Inserting di , i = 0 and back to L() gives the dual
G()

=
=

di (ln(

qi exp( 1) exp(yi h(xi ))


) + yi h(xi ))
qi

di ( + yi h(xi ) 1 + yi h(xi ))
X
=
di
= 1 = ln Z
X
= ln
qj exp(yj h(xj )).
j

55

Now differentiating the dual w.r.t. yields


1 X
qi exp(yi h(xi ))
Z i
1 X
=
qi (yi h(xi )) exp(yi h(xi )),
Z
P
which can be zero iff i qi yi h(xi ) exp(yi h(xi )) = 0. By decomposing this
sum to the cases where the hypothesis make an error and was correct, we get
G

(1)qi exp() +

err

(1)qi exp() = 0.

(209)

ok

Then, remembering that  =

err qi ,

 exp() + (1 ) exp() = 0.
Solving this for yields
=

1 1
ln
,
2


(210)

(211)

as is also chosen by AdaBoost.




56

Exercise 10.4
In this and the next exercise you were supposed to experiment with an existing
SVM package called OsuSVM. With more complex machine learning algorithms
(such as SVMs, decision trees, neural networks, etc...) making an efficient and
practical implementation can be very time consuming. Its often a better idea
to make feasibility tests (i.e. does the solution given by some type of algorithm
look anything like acceptable?) with existing toolboxes if available.
I have included the code below. Its more illustrative to run it yourself and play
with it, but suffice to say that a linear kernel is unsuitable for the problem, 2nd
order kernel does nicely, and larger orders can overfit. Choosing the value of C
is not very critical here, and proper kernel is often more important of the two
(and its selection is clearly a much more difficult problem). More noise in the
data would make C more important.
clear;
% add the SVM package to path
addpath(/home/jtlindgr/packages/osu_svm/);
load(ball.mat);

% load the provided data

for degree=[1 2 3 4];


for C=[0.01 0.1 0.5 1 2 4 8];
% learn a model. notice the changed data order reqd. by OsuSVM
[AlphaY,SVs,Bias,Parameters,nSV,nLabel] = PolySVC(x,y,degree,C);
% draw a nice plot
SVMPlot2(AlphaY,SVs,Bias,Parameters,x,y);
% all models can be tested with the same command on osu svm
[Label,DecisionValue] = SVMClass(x,AlphaY,SVs,Bias, ...
Parameters,nSV,nLabel);
% what about the achieved training error?
err = sum(Label~=nLabel)/length(nLabel);
fprintf(1, d=%d C=%f - Train err was %f\n, degree, C, err);
pause
end
end
With a linear kernel I got training errors around 40%, and the algorithm selected
nearly 400 support vectors (but remember that one vector is enough to represent
a linear classifier, and the solution could be converted if wished). Using a second
order polynomial kernel and C = 1, training error was a little over 5%. Also,
the required number of support vectors got smaller as C was increased. C = 1
resulted in only 60 support vectors used. With a very expressive kernel and
large C, the training error could probably be pushed to zero - not a good idea;
wed just be fitting noise and getting worse testing errors.

57

Exercise 10.5
This exercise is similar to the previous one, with the difference that now we
are considering the car dataset again, trying out RBF (Gaussian) kernel, and
looking at the test error too.
The code could look as follows,
%%%%%% 10.5 %%%%%%%
clear;
% add the SVM package to path
addpath(/home/jtlindgr/packages/osu_svm/);
load(Lcars.mat);
% turn the car image array into an example matrix
[rows,cols,pics]=size(instances);
instances=reshape(instances,[rows*cols pics]);
% permute randomly
rp=randperm(pics);
instances=instances(rp,:);
labels=labels(rp);
% split data to train and test sets
split=floor(0.7*pics);
trainX=instances(1:split,:);
testX=instances((split+1):end,:);
trainLab=labels(1:split);
testLab=labels((split+1):end);
gammas=[0.001 0.01 0.05 0.1];
Cs=[0.5 1 2 4 8];
errmat=zeros(length(gammas),length(Cs));
didInit=0;
for i=1:length(gammas)
for j=1:length(Cs)
% train SVM with C=1. note the transposed inputs. for tuning
% parameters, use "help PolySVC" etc.
%[AlphaY,SVs,Bias,Parameters,nSV,nLabel] = LinearSVC(trainX,trainLab);
%[AlphaY,SVs,Bias,Parameters,nSV,nLabel] = PolySVC(trainX,trainLab);
[AlphaY,SVs,Bias,Parameters,nSV,nLabel] = RbfSVC(trainX,trainLab, ...
gammas(i),Cs(j));
% training error
[Label,DecisionValue] = SVMClass(trainX,AlphaY,SVs,Bias, ...
Parameters,nSV,nLabel);
trainErr = sum(Label~=trainLab)/length(trainLab);

58

% prediction error on test set


[Label,DecisionValue] = SVMClass(testX,AlphaY,SVs,Bias, ...
Parameters,nSV,nLabel);
testErr = sum(Label~=testLab)/length(testLab);
fprintf(1, For gamma=%f, C=%f, trainerr %f, testerr %f...\n, ...
gammas(i), Cs(j), trainErr, testErr);
errmat(i,j)=testErr;
end
end
%%%%%%%%%%%%%%%%%%%
In this problem, the choice of the kernel parameter is critical. Small values
as in the code above usually work well, whereas with larger values the training
error is easy to get to zero, but the testing accuracy starts to resemble random
guessing. Using = 0.001 and C = 1, test error around 3% can be reached, with
fluctuation due to the random permutations of the data. Note that although
the achieved error is generally better than what we previously got with linear
perceptrons (that was around 6%), the critical notes about the dataset made in
reference solutions of exercise 5.4-5.5 still apply.
rant: In this mostly theoretical course we have not too seriously considered
empirically comparing the solutions returned by the learning algorithms. In
some situations the comparisons should be done more properly using established
statistical methods, i.e. calculate statistical significance of differences in error
rates (if any), analyze error variances, consider importance of the type and
amount of errors to the application domain, make very sure that the test set
is not contaminated by parameter tuning or the training set data, and so on.
Such practices, however, could merit a course of their own.


59

Exercise 11.1
a) Let X, Y, f : X Y 7 < be arbitrary. Need to show that
sup inf f (x, y) inf sup f (x, y).

xX yY

yY xX

(212)

The reasoning goes quite straightforwardly from the definitions of sup and inf,
f (x, y)

xX

inf f (x, y)

xX

xX yY

sup inf f (x, y)

xX

sup inf f (x, y)

yY xX

yY

sup f (x, y), x, y

(213)

sup f (x, y)

(214)

sup f (x, y)

(215)

inf sup f (x, y).

xX yY

(216)

On the first line we noted that f (x, y) must always be smaller than its supremum.
Since this holds for all x, y, it definitely holds for inf yY , and on the next line, for
supxX too (sup can at most make the bound tight). The last step is reasoned
similarly.
P
b) Let I = {1, ...n}, (I) = {p <n |pi 0, pi = 1}. Show that
sup
p(I)

X
i

pi f (i) = max f (i), f : I 7 <.


iI

(217)

That is, (I) is the set of all n-dimensional probability distributions and we wish
to show that in the above setting the supremum is achieved by a distribution
that puts all weight on the single case (this will be useful in the next exercise).
Denote f (m) = maxi f (i). By selecting pm = 1 and pi = 0, i 6= m, we can
instantly see that the maximum can be achieved. It remains to show that the
right side upper-bounds the left side.

sup
p

pi fi

sup
p

pi f (m)

(218)

(219)

= f (m) sup
p

= f (m) 1

pi

(220)

= f (m) = max f (i)


i

(221)


60

Exercise 11.2
a) Von Neumanns theorem states,
max min V (p, q) = min max V (p, q)
p

(222)

In this case, we denote the distribution chosen by the max margin player as p
and one used by the edge player as q (w and d in the lecture notes). The payoff
matrix and the expected payoffs can be written as
M (i, j) = hi (xj )yj

V (p, q) =

n X
m
X
i

(223)

pi Mij qj

(224)

Writing out the Von Neumanns theorem gives

max min
p

max min
p

XX

qj

pi hi (xj )yj qj

hi (xj )yj pi

XX

= min max
q

= min max
q

pi

pi hi (xj )yj qj

hi (xj )yj qj

(225)
(226)

P
P
Lets denote g(j) = i hi (xj )yj pi and f (i) = j hi (xj )yj qj . Note that f : I 7
< and g : J 7 <, where I = {1, ..., n} and J = {1, ..., m}. This results in
max min
p

qj g(j) = min max


q

pi f (i)

(227)

and we can apply the result from exercise 11.1.b (and the easily seen similar
result for inf), this turns into
max min g(j) = min max f (i)
p

jJ

(228)

iI

and again writing out g and f ,


max min
p

jJ

hi (xj )yj pi = min max


q

iI

hi (xj )yj qj .

(229)

Reformatting, we get the wanted result of page 322 (remember that the set of
hypothesis H was assumed finite),
min max
q

iI

qj yj hi (xj ) = max min yj


p

jJ

pi hi (xj ).

(230)

61

b) Interpret player A as the row-player. The row-player wishes to select a row


of M such that his gains are maximized regardless of which column B (the
col-player) selects. Note that due to the exercise 11.1, we know that pure
strategies are optimal (i.e. the optimal distribution puts all weight on a single
choice).
From As point of view, the task is
max.
s.t.

Pn
Pi=1 pi Mij
pi = 1
pi 0

(== min )
, j 1, ..., m

Writing out the Lagrangian for the resp. minimization problem gives
L(, p, q, , ) =

m
X
j

qj (

n
X
i

pi Mij )

n
X
i

i pi + (1

n
X

pi ) (231)

Differentiating and setting to zero gives,


L

L
pi

=
=

1 +
X
j

X
j

X
j

qj = 0

qj = 1

(232)

qj Mij i = 0

(233)

qj Mij = i

(234)

qj Mij

(235)

(since i 0 for all i, from the KKT-cond.). Inserting these back into L leaves
us with just
G() =

(236)

Using the derived constraints in the dual maximization problem, we get


max.
s.t.

Pm
q M
Pj=1 j ij
q
j j =1
qj 0

62

, i 1, ..., n

By changing sign and renaming, the dual problem can be rewritten as


min.
s.t.

Pm
q M
Pj=1 j ij
j qj = 1
qj 0

, i 1, ..., n

Consider the optimal values and . These are

= max min
p

= max min
p

pi Mij

(237)

XX
i

pi Mij qj

(238)

= max min V (p, q)

(239)

= ... = min max V (p, q)

(240)

Due to the optimization problem being convex (linear), strong duality holds and
the duality gap is zero. Hence = and Von Neumanns theorem holds.


63

Exercise 11.3
We need to show that the optimization problem 5.16 of the lecture notes is the
dual of problem 5.15. Lets start with 5.15. Letting <, w <n ,  <m , the
problem was
P
max PC m
i i
n
s.t.
yi ( j wj hj (xi )) i , i 1, ..., m
i 0
, i 1, ..., m
w

0
, j 1, ..., n
j
Pn
j wj = 1
which equals
min
s.t.

Pm
P
+ C i i
n
yi ( j wj hj (xi )) + i 0
i 0
wjP
0
n
1 j wj = 0

, i 1, ..., m
, i 1, ..., m
, j 1, ..., n

The lagrangian is

L(, w, , , , , )

= + C

m
X

i
m
X
i

i

(241)

i (yi (

n
X

i  i

wj hj (xi )) + i )

j
n
X
j

j wj + (1

n
X

wj )

(242)
(243)

Differenting L w.r.t. the primal variables and setting to zero yields


L

L
wj
L
i

= 1 +
=

m
X
i

i = 0

i = 1

i (yi hj (xi ))j = 0 j =

(244)
X
i

i yi hj (xi )

= C i i = 0 i = C i i C

(245)
(246)

As usual, insert these to L to get the dual. The dual simplifies into just

G() =

64

(247)

Hence, using G and the constraints we got from the differentiation, the dual
optimization problem can be written as
max
s.t.

P
m
i i yi hj (xi ))

Pi 0
i i = 1
i C

, j 1, ..., n
, i 1, ..., m
, i 1, ..., m

The first constraint straightforwardly follows from (245) and the fourth one from
(246). The constraints i 0 follow from i being Lagrange multipliers related
to inequality constraints in the primal problem. Finally, the dual optimization
can be easily written as
min
s.t.

Pm

i i yi hj (xi ))
0

P i C
i i = 1,

, j 1, ..., n
, i 1, ..., m

the optimization problem 5.16 of the lecture notes.




65

Exercise 11.4
Consider the optimization problem 5.16 of the lecture notes. This time, let
1
for some 0 < v 1. Formulated as a minimization problem (see prev.
C = vm
exercise), the problem was
min
s.t.

Pm

i i yi hj (xi ))
0

P i C
i i = 1.

, j 1, ..., n
, i 1, ..., m

Note i to be the Lagrange multiplier related to the bound i C. In the


primal, i are the slack variables. The fourth KKT-condition then states that
for the optimal solution
i (i C) = 0.

(248)

Similar to theorem 4.21 of the lecture notes, we can use this condition and the
constraints of the optimization problem to reason about the solution.
P
1
Suppose first that i > 0. Then (i C) = 0 and i = C = vm
. Since i = 1,
this case can happen at most vm times. Hence at most vm examples violate
the margin (the resp. slack variable i in primal is active).
1
We also know that i C = vm
P, and that i 0. Hence we need atleast vm
examples with i > 0 to reach
i = 1. As m = vm + (1 v)m, we can have
at most (1 v)m examples left that are allowed to exceed the margin.

- Finis -

66

You might also like