You are on page 1of 4

Homework 7: Gaussian Processes & Neural Networks

Brown University CSCI1420 & ENGN2520: Machine Learning


Homework due at 11:59pm on November 12, 2015

Question 1:
The first question explores binary Gaussian process classification on a dataset of images
of pedestrians, and background regions from similar environments. The data, collected by
Daimler-Chrysler, has been preprocessed into vector form. We will compare two sets of
features: raw intensity values (from cropped image windows), and so-called histogram of
oriented gradient (HOG) features, which have been hand-engineered to improve robustness.
For each feature set, we will consider classifiers based on linear and squared exponential
kernels. The linear kernel is defined as
1
k(xp , xq ) = (1 + xTp xq )

where is a variance hyperparameter. The squared exponential kernel is defined as
(xp xq )T (xp xq )
 
2
k(xp , xq ) = exp
2`2
where 2 is the variance hyperparameter, and ` is the length-scale hyperparameter.
In this homework, we will use the Gaussian process Matlab toolbox (Rasmussen et al.),
which is distributed as part of the pmtk3 Matlab toolbox. The two covariances of interest
are supported by the toolbox through the functions covLINone and covSEiso. We will use
the Laplace approximation to implement binary GP classifiers, which is also supported in
the toolbox through the binaryLaplaceGP function. Laplace approximations are based on
MAP parameter estimates, but also approximate the variance around the posterior mode,
and account for this when making predictions.
We will select the hyperparameters by maximizing the marginal likelihood. A (possibly
local) minimum of the negative log marginal likelihood can be found via the minimize
function from the GP toolbox, as illustrated in the example Matlab script. For the questions
below, the specified initializations are for log-domain representations of the hyperparameters,
as assumed by the GP toolbox.

a) Load the intensity dataset. Use the minimize function to estimate the linear kernel
hyperparameter . Initialize minimize by setting init_guess to 15. What is the optimal
hyperparameter determined by the marginal likelihood optimization? Using this hyperpa-
rameter, train a linear kernel Gaussian process classifier on the train data. Report the
corresponding error rate on test data.

1
1.5 1

0.8
1
0.6

0.4
0.5
0.2

0 0

0.2
0.5
0.4

0.6
1
0.8

1.5 1
1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1

(a) Toy Data A (b) Toy Data B

Figure 1: Toy 2D datasets for neural network classification.

b) Repeat part (a) with init_guess equaling -5. Do you observe any differences? Explain.

c) Repeat parts (a-b) with the hog dataset.

d) Load the intensity dataset. Use the minimize function to estimate the squared expo-
nential kernel hyperparameters ` and 2 . Initialize minimize by setting init_guess to
[10; 3]. What are the optimal hyperparameters found by the optimization? Using these
hyperparameters, train a squared exponential kernel Gaussian process classifier on the
train data. Report the corresponding error rate on the test data.

e) Repeat part (d) with init_guess equaling [1; 3]. Do you observe any differences? Explain.

f ) Repeat parts (d-e) with the hog dataset.

g) For this data, are the results sensitive to the choice of kernel hyperparameters? Which has
a bigger impact on performance, the chosen kernel function or the chosen input feature
representation?

Question 2:
This problem focuses on the two-dimensional datasets illustrated in Figure 1, each with a few
hundred examples. Let xn = [xn1 xn2 ]T denote a training input vector, and tn {1, +1}
its binary class label. Assume classification decisions are made by thresholding the predicted
probability at 0.5, minimizing the probability of error.

A two-layer neural network classifier combines multiple layers of simple classifiers to produce
a more complex, combined decision rule. The model begins by independently processing an
input vector xn with J different logistic regression models, producing a vector of interme-
diate responses zn = [zn1 zn2 . . . znJ ]T as follows:

znj = (wjT j (xn ))

2
The final, predicted class label is then determined by applying another logistic regression
model to these intermediate responses:
p(yn = 1 | zn , w0 ) = (w0T 0 (zn )).
The weights w = {w0 , w1 , . . . , wJ } are all parameters of the neural network model, which
can be learned from training data.
We let the features functions at each layer simply just add a bias term to the input.
0 (zn ) = [1 zn1 . . . znJ ]T , j (xn ) = [1 xn1 xn2 ]T .
Thus, w0 has length J + 1, while each of w1 , w2 , . . . wJ has length 2 + 1 = 3.
a) Consider a standard, single-layer logistic regression model with features (xn ) = [1 xn1 xn2 ]T .
Could this model perfectly classify the training data in Figure 1(a)? Justify your answer.
b) Consider a two-layer neural network model of the data in Figure 1(a). What is the
smallest number of input neurons J for which the neural network can perfectly classify
this data? Clearly explain your reasoning. Numeric specification of the weight vectors is
not required.
We now use gradient descent algorithms to fit two-layer neural networks to the data in
Figure 1. Specifically, we train the weight vector to minimize the following negative-log-loss
objective function, with L2 regularization penalty :
N
X J
X
J(w) = log p(yn | xn , w) + ||wj ||22 .
n=1 j=0

Follow the provided demonstration code, where the key training routines (including random
restarts) are implemented in trainMLP. This relies on PMTK3s minFunc for optimization
and plotDecisionBoundary for viewing decision boundaries. For all questions, set the
regularization parameter = 108 .
c) Fit a neural network with J = 1 hidden units to ToyDataA.mat. Run the optimization
from 10 random initial guesses, as outlined in the template code. Report the accuracy (on
training data) and the objective function (on training data) for the estimated parameters
w that yield the best objective function value across all initializations. Are the resulting
decision boundaries always the same across initializations? Why or why not?
d) Now fit a neural network with J = 2 hidden units to ToyDataA.mat. Run the optimization
from 10 random initial guesses, as outlined in the template code. Report the accuracy (on
training data) and the objective function (on training data) for the estimated parameters
w that yield the best and worst objective function value across all initializations. Plot the
decision boundaries of the best and worst estimated parameters w.
e) Fit neural networks with J = 2, 3, 4, 5, 10, 20 hidden units to ToyDataB.mat. Keep only
the best estimate w of 10 random guesses at each value of J, as outlined in the template
code. Create plots of the resulting decision boundaries for each value of J. How many
hidden units are needed to perfectly classify this training data?

3
Question 3:
In this question, we examine the properties of max-margin separating hyperplanes on toy,
two-dimensional data. First consider a dataset with N = 4 points. There are 2 examples
of class +1, located at x = (1, 1) and x = (+1, +1). There are 2 examples of class 1,
located at x = (1, +1) and x = (+1, 1).

a) Consider the feature mapping (x) = [x1 , x1 x2 ]T . Plot the four input points in this space,
and the maximum margin separating hyperplane. What max-margin weight vector w would
produce predictions wT (x) that, when thresholded, perfectly classify the training data?
What is the corresponding margin?

b) Plot the max-margin separator from part (a) as a curve in the original input space.

Now consider a different dataset with N = 6 points. There are 3 examples of class +1,
located at x = (1, 1), x = (2, 2), and x = (2, 0). There are 3 examples of class 1, located
at x = (0, 0), x = (1, 0), and x = (0, 1).

c) Consider the feature mapping (x) = [1, x1 , x2 ]T , which adds a bias feature to the raw
inputs. Plot the data and a maximum margin separating hyperplane in the original input
space. What max-margin weight vector w would produce predictions wT (x) that, when
thresholded, perfectly classify the training data? What is the corresponding margin?

d) Which data points are support vectors? If you remove one of these support vectors does
the size of the optimal margin decrease, stay the same, or increase? Justify your answer.

You might also like