You are on page 1of 3

Linear vs k-NN classification

Simulation study*

Arce Plata, M. I., Corrales Alvarez,


J. D., Garca Baccino, C., Kohler, P.
and Herrera Rozo, C. A., Coordinated by Munilla, S.
February 2013

* This study was undertaken as a group exercise during a linear models course held last semester
at the Facultad de Agronoma, Universidad de Buenos Aires, Argentina.
Corresponding e-mail: munilla@agro.uba.ar

INTRODUCTION
In their book Elements of statistical learning (2011), Chapter 2, Hastie, Tibshirani and Friedman
describe a simulation study aimed at testing two prediction methods in a classification context. The
methods are the linear model fit by least squares and the k-nearest neighbor (k-NN) prediction
rule. Training data consisted in a two-class output variable with each sample point simulated on
a pair of inputs variables. The data generation process they have used produced a mixture of
Gaussian clusters for each class. In this scenario, as authors show classifying new, independent
data generated from the model, k-NN outperforms the linear model fit. They argue that this would
not be the case if data in each class were simulated from independent Gaussian distributions. Our
objective was to undertake such stochastic simulation study.

METHODS
Simulation scenario outline
Each data point simulated pertained to one of two classes labeled blue and orange, respectively. Within each class, two input variables were sampled from the following bivariate Gaussian
distribution:
 
 
1
0
YBlue =
YOrange =
0
1
where
(X1 , X2 ) N2 (YClass , I2 )
A total of 100 points for each class was generated for training. First, a linear model was fitted
to these data, coding the class color as a binary variable (0 for blue and 1 for orange). Next,
the k-NN method with k = 15 was employed, using the knn function of the R-project library
class. A decision boundary for each method was overlaid on the input space scatterplot to get
an overview of the performance of both procedures. Finally, 10,000 new, independent sampling
points were simulated using the same data generation process, and later classified by the decision

Figure 1: Decision boundaries and classification regions for the training sample using a linear
model fit and the 15-nearest neighbor prediction rule. For both plots, the classes were
coded as a binary variable (blue = 0, orange = 1), and the classification regions were
obtained after the evaluation of 100 100 grid points according to the corresponding
decision rule aided by the R function knn in the case of the nearest neighbor method.
boundaries obtained from the training sample. Results were compared graphically, in terms of an
error test defined by the rate of misclassified points. In each step, we have tried to stick closely to
the simulation by Hastie et al. (2011).

R-project software
All steps necessary to undertake the simulation study were coded and executed using the R-project
software (2012) the script is appended. The program was written such that the user may change
several of the parameters to test different scenarios. Bearing in mind self-containment, the source
code was extensively commented.

RESULTS
Figure 1 shows the decision boundary and the classification regions for both, the linear model fit
and the 15-NN classification rule. The linear model misclassified 43 out of 200 data points (i.e.,
21.5%), whereas the 15-NN misclassified 45 out of 200 data points (i.e., 22.5%).
Figure 2 shows the results of classifying 10,000 new input points generated by the same data
generation process. Notice that as the numbers of neighbors used to assess the majority vote
approaches one, the training misclassification approaches zero, whereas the error test for testing
data is maximum. On the other hand, when the number of neighbors is greater than around 40, the
misclassification rates for the linear model fit and for the k-NN prediction rule are alike. In fact,
the decision boundaries for k > 40 are much less wiggly and approximate a straight line. This is
verifiable by changing the kk parameter in the R script appended.

Figure 2: Misclassification curves for training (blue) and test (orange) data. Each point represents
the k-NN misclassification rate for a range of values of k, chosen to be approximately
equally spaced on the logarithmic scale. The blue and orange squares correspond to the
linear regression misclassification rate for training and test data, respectively. A straight
dashed line was drawn at these values to improve the visual comparison.

DISCUSSION
In summary, we have undertaken a simulation study for testing the two procedures used by Hastie
et al. (2011) as an example in their book, when class data arose from independent Gaussian distributions. We have chosen the parameters of both distributions such that their supports overlap
enough. In our study, the linear model fit by least squares outperformed the 15-NN method when
the classification rule was applied to data not used in training the procedures. Of course, the
value of k = 15 was arbitrarily chosen to emulate Hastie et al. (2011) simulation. However, after
scanning the entire range of possible values of k, we observed that the k-NN delivered approximately the same misclassification rate than the linear model when enough neighbors were used for
training.

REFERENCES
Hastie, T., Tibshirani, R. & Friedman, J. 2011. The Elements of Statistical Learning. Data
mining, Inference, and Prediction. 2nd Edition. Springer-Verlag, Inc.
R Core Team. 2012. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

You might also like