Professional Documents
Culture Documents
a.h.thiery@nus.edu.sg
Version: 0.1
Contents
1 A simple polynomial example
1.1 Least square estimate . . . . . . . . . . . . .
1.2 Performance v.s. complexity of the model . .
1.3 Estimation of the generalization performances
1.4 b is a Maximum Likelihood Estimator . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
6
8
10
Pd
X[,k] = x_data**(k-1)
}
return( X )
}
#create a polynomial
poly_deg = 2
poly_coef = c(0,0,1)
#P(x) = x**2
5
4
3
0
0
x
1.1
Degree = 1
0
x
Degree = 2
0
x
Degree = 11
1.2
Let us look at the performances of the least square estimate for different value
of d. One needs a way of measuring performance and a common approach in
this situations is to define
n
X
(performance) =
Loss(yi , ybi )
i=1
where the Loss function Loss() measures how well the prediction ybi approximate the true value yi . It is a standard practice in this case, mainly because
6
this leads to tractable computations, to use the squared error loss function
Loss(y, yb) (y yb)2 . The resulting measure of performance is called the
Residual Sum of Square,
RSS =
N
X
(yi ybi )2 .
i=1
We will now simply compute the RSS for different value of d; indeed, it is
completely equivalent too look at the Mean Squared Error MSE = (1/n) RSS.
#generalization estimation
deg_max = 10
mse_list = rep(0, deg_max)
for(d in 1:deg_max){
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
y_fit = create_X_matrix(x_data, d) %*% beta
mse_list[d] = mean( (y_data - y_fit)**2 )
}
#display the results
plot(mse_list, col="red", type="o", pch=20,
main = "Mean Squared Error v.s Degree",
xlab = "degree", ylab="MSE")
0.6
0.2
0.4
MSE
0.8
1.0
1.2
10
degree
The higher the degree d, the lowest the MSE [Exercise]: this is indeed not
helpful at all if one wants to find a suitable value for d. In most situations
of interest, we are trying to do some prediction on data that have not indeed
been used to train the model. In the above situations, the coefficient b has
been determined by using the whole dataset {yi }ni=1 and the MSE has been
estimated on the same dataset!
1.3
#generalization estimation
n_bootstrap = 100
deg_max = 6
mse_list = rep(0, deg_max*n_bootstrap)
deg_list = rep(0, deg_max*n_bootstrap)
for(d in 1:deg_max){
for(k in 1:n_bootstrap){
sampled_index = sample(1:n_data, round(length(x_data)/2),
replace=FALSE)
XX_train = create_X_matrix(x_data[sampled_index], d)
XX_test = create_X_matrix(x_data[-sampled_index], d)
yy_train = y_data[sampled_index]
yy_test = y_data[-sampled_index]
beta = compute_beta(yy_train,XX_train)
yy_fit = XX_test %*% beta
mse = mean( (yy_test - yy_fit)**2 )
mse_list[(d-1)*n_bootstrap + k] = mse
deg_list[(d-1)*n_bootstrap + k] = d
}
}
Let us now plot the estimate of the MSE as a function of d.
validation = data.frame(mse = mse_list, deg = deg_list)
boxplot(mse ~ deg, data = validation,
log = "y", col = "bisque",
main="Generalization",
xlab="degree",
ylab="mean MSE")
1e+03
1e01
1e+01
mean MSE
1e+05
1e+07
Generalization
degree
It is now clear that choosing too high a degree leads to suboptimal performances.
1.4
Recall that we postulated that the data were generated through the model
Y = X + for some noise . Under the assumption that is Gaussianly
distributed the least square estimate b is also the maximum likelihood estimate
[Exercise].
10