You are on page 1of 27

3/17/2017 Homework__4.

html

Problem 1
Recall the body dataset from problem 4 of Homework 3. In that problem we used PCR and PLSR to predict
someones weight. Here we will re-visit this objective, using bagging and random forests. Start by setting aside
200 observations from your dataset to act as a test set, using the remaining 307 as a training set. Ideally, you
would be able to use your code from Homework 3 to select the same test set as you did on that problem.

load("/Users/alexnutkiewicz/Downloads/body.rdata")

#separation of data into testing and training data


set.seed(36)
testing = sample(seq_len(nrow(X)), size = 200)
OJtest = X[testing,]
OJtrain = X[-testing,]

Using the ranger package in CRAN, use Bagging and Random Forests to predict the weights in the test set, so
that you have two sets of predictions. Then answer the following questions:

library(ranger)
predictData = data.frame(Weight = Y$Weight[-testing],OJtrain)
#randomForest requires a fewer number of trees to be created than bagging
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), importance = "imp
urity")

#bagging considers all X predictors when splitting the tree


bag.Weight = ranger(Weight ~., data = predictData, mtry = ncol(X), importance = "impurit
y")

The MSR and % variance explained are based on out of bag estimates. Because mtry = __, this is the number of
variables randomly chosen at each split.

a. Produce a plot of test MSE (as in Figure 8.8 in the text) as a function of number of trees for Bagging and
Random Forests. You should produce one plot with two curves: one corresponding to Bagging and the
other to Random Forests.

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 1/23
3/17/2017 Homework__4.html

rfMSE = rep(0,300)
bagMSE = rep(0,300)
for(i in 1:300){
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), num.trees = i,
importance = "impurity")
rfPreds = predict(rf.Weight, data = OJtest)$predictions
bag.Weight = ranger(Weight ~., data = predictData, mtry = ncol(X), num.trees = i, impo
rtance = "impurity")
bagPreds = predict(bag.Weight, data = OJtest)$predictions
rf.MSE = mean(((rfPreds - Y$Weight[testing])^2))
rfMSE[i] = rf.MSE
bag.MSE = mean(((bagPreds - Y$Weight[testing])^2))
bagMSE[i] = bag.MSE
}
allData = data.frame(num = 1:300, rfMSE, bagMSE)

library(ggplot2)
library(reshape2)
#id.vars = variable you want to keep constant
iceCream = melt(allData, id.vars = "num")
ggplot(iceCream, aes(x = num, y = value, col = variable)) + geom_line() + labs(title =
"Test MSE of Random Forest and Bagging", x = "Number of Trees", y = "Test MSE")

b. Which variables does your random forest identify as most important? How do they compare with the most
important variables as identied by Bagging?

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 2/23
3/17/2017 Homework__4.html

set.seed(36)
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), importance = "imp
urity")
ranger::importance(rf.Weight)

## Wrist.Diam Wrist.Girth Forearm.Girth


## 830.4598 2345.3725 6386.7745
## Elbow.Diam Bicep.Girth Shoulder.Girth
## 2169.4676 4098.4510 5176.0857
## Biacromial.Diam Chest.Depth Chest.Diam
## 826.5204 1406.5377 2656.3576
## Chest.Girth Navel.Girth Waist.Girth
## 7625.7852 825.9595 7453.7234
## Pelvic.Breadth Bitrochanteric.Diam Hip.Girth
## 301.2205 424.3708 2085.0039
## Thigh.Girth Knee.Diam Knee.Girth
## 1005.8425 934.6713 1831.6493
## Calf.Girth Ankle.Diam Ankle.Girth
## 1256.9831 267.7527 1120.4317

bag.Weight= ranger(Weight ~., data = predictData, mtry = ncol(X), importance = "impurit


y")
ranger::importance(bag.Weight)

## Wrist.Diam Wrist.Girth Forearm.Girth


## 301.34315 379.00006 7021.25154
## Elbow.Diam Bicep.Girth Shoulder.Girth
## 226.32558 3033.17893 3235.13185
## Biacromial.Diam Chest.Depth Chest.Diam
## 170.78835 297.45789 576.25497
## Chest.Girth Navel.Girth Waist.Girth
## 17175.25933 276.15938 10138.24990
## Pelvic.Breadth Bitrochanteric.Diam Hip.Girth
## 132.30439 188.58822 3237.71442
## Thigh.Girth Knee.Diam Knee.Girth
## 483.41152 526.93071 1784.72922
## Calf.Girth Ankle.Diam Ankle.Girth
## 900.09897 84.94235 429.03810

Based on the values above, we see that the most importance variables in Random Forest is Chest.Girth,
Forearm.Girth, and Waist.Girth. In Bagging, similarly, the most important variables are Forearm.Girth, Waist.Girth,
and Chest.Girth. So it seems like these methods would identify similar type of variables as being most important.

c. Compare the test error of your random forest (with 500 trees) against the test errors of the three methods
you evaluated in Homework 3. Does your random forest make better predictions than your predictions from
Homework 3?

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 3/23
3/17/2017 Homework__4.html

set.seed(36)
rf.500 = predict(rf.Weight, data = OJtest)$predictions
rf.MSE500 = mean(((rf.500 - Y$Weight[testing])^2))
rf.MSE500

## [1] 9.160329

The Test MSE values of our PCR, PLSR, lasso predictions were 8.562, 7.952, and 8.141, respectively. So,
compared with our Random Forest approach, it seems like it does worse, which makes sense because CART
approaches dont really do feature selection in their model creation.

d. The ranger() function uses 500 as the default number of trees. For this problem, is 500 enough trees? How
can you tell?

set.seed(36)
rf.2000 = ranger(Weight ~., data = predictData, num.trees = 2000, importance = "impurit
y")
preds2000 = predict(rf.2000, data = OJtest)$predictions
rf.MSE2000 = mean(((preds2000 - Y$Weight[testing])^2))
rf.MSE2000

## [1] 9.295284

After running the model at 2000 trees, we get a worse Test MSE, so clearly it doesnt matter at this point if we add
more trees to our model.

Problem 2
Here we explore the maximal margin classier on a toy data set.

a. We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class
label. Sketch the observations.

X1 = c(5.29, 3.30, 7.30, 1.28, 3.32, 7.30, 7.29)


X2 = c(7.30, 3.29, 7.29, 7.30, 1.29, 5.31, 1.30)
Y = c("green", "green", "green", "green", "red", "red", "red")
plot(X1, X2, col = Y, xlim = c(0, 8), ylim = c(0, 8))

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 4/23
3/17/2017 Homework__4.html

b. Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form 0 +
1X1 + 2X2 = 0).

plot(X1, X2, col = Y, xlim = c(0, 8), ylim = c(0, 8))


abline(a=-1, b=1)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 5/23
3/17/2017 Homework__4.html

Based on the plot, we see an optimal separating hyperplane between (3.32,1.28) and (3.30,3.29). Using
estimation, we nd that the equation for this line is 1 + X - X2 = 0.

c. Describe the classication rule for the maximal margin classier. It should be something along the lines of
Classify to Red if 0 + 1X1 + 2X2 > 0, and classify to Green otherwise. Provide the values for 0, 1,
and 2.

Classify to green if 0 + 1X + 2X < 0 and classify to red if 0 + 1X + 2X > 0. 0 = 1 1 = 1 2 = -1

d. On your sketch, indicate the margin for the maximal margin hyperplane. How wide is the margin?

plot(X1, X2, col = Y, xlim = c(0, 8), ylim = c(0, 8))


abline(a=-1, b=1)
abline(a=-2, b=1, untf=FALSE, lty = 2)
abline(a=0, b=1, untf=FALSE, lty = 2)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 6/23
3/17/2017 Homework__4.html

The

maximal margin hyperplane 1 units wide.

e. Indicate the support vectors for the maximal margin classier.

All of the points along the support vectors are support vectors for the maximal margin classier.

f. Argue that a slight movement of the seventh observation would not aect the maximal margin hyperplane.

By moving the 7th observation, we would not aect the maximal margin hyperplane it only based on a small set of
observations. Because this observation is far from that hyperplane and support vectors, there should be a
minimal impact.

g. Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this
hyperplane.

plot(X1, X2, col = Y, xlim = c(0, 8), ylim = c(0, 8))


abline(a=-0.55, b=2)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 7/23
3/17/2017 Homework__4.html

This

new hyperplane has the equation -0.55 + 2X1 - X2 = 0.

h. Draw an additional observation on the plot so that the two classes are no longer separable by a
hyperplane.

newX1 = c(5.29, 3.30, 7.30, 1.28, 3.32, 7.30, 7.29, 6.02)


newX2 = c(7.30, 3.29, 7.29, 7.30, 1.29, 5.31, 1.30, 2.19)
newY = c("green", "green", "green", "green", "red", "red", "red", "green")
plot(newX1, newX2, col = newY, xlim = c(0, 8), ylim = c(0, 8))

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 8/23
3/17/2017 Homework__4.html

Oh

no! As we can see there is a new 8th point inltrating the classied red section.

Problem 3
This problem involves the OJ data set which is part of the ISLR package.

a. Create a training set containing a random sample of 535 observations, and a test set containing the
remaining observations.

library(ISLR)
summary(OJ)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 9/23
3/17/2017 Homework__4.html

## Purchase WeekofPurchase StoreID PriceCH PriceMM


## CH:653 Min. :227.0 Min. :1.00 Min. :1.690 Min. :1.690
## MM:417 1st Qu.:240.0 1st Qu.:2.00 1st Qu.:1.790 1st Qu.:1.990
## Median :257.0 Median :3.00 Median :1.860 Median :2.090
## Mean :254.4 Mean :3.96 Mean :1.867 Mean :2.085
## 3rd Qu.:268.0 3rd Qu.:7.00 3rd Qu.:1.990 3rd Qu.:2.180
## Max. :278.0 Max. :7.00 Max. :2.090 Max. :2.290
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.05186 Mean :0.1234 Mean :0.1477 Mean :0.1617
## 3rd Qu.:0.00000 3rd Qu.:0.2300 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :0.50000 Max. :0.8000 Max. :1.0000 Max. :1.0000
## LoyalCH SalePriceMM SalePriceCH PriceDiff
## Min. :0.000011 Min. :1.190 Min. :1.390 Min. :-0.6700
## 1st Qu.:0.325257 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000
## Median :0.600000 Median :2.090 Median :1.860 Median : 0.2300
## Mean :0.565782 Mean :1.962 Mean :1.816 Mean : 0.1465
## 3rd Qu.:0.850873 3rd Qu.:2.130 3rd Qu.:1.890 3rd Qu.: 0.3200
## Max. :0.999947 Max. :2.290 Max. :2.090 Max. : 0.6400
## Store7 PctDiscMM PctDiscCH ListPriceDiff
## No :714 Min. :0.0000 Min. :0.00000 Min. :0.000
## Yes:356 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.140
## Median :0.0000 Median :0.00000 Median :0.240
## Mean :0.0593 Mean :0.02731 Mean :0.218
## 3rd Qu.:0.1127 3rd Qu.:0.00000 3rd Qu.:0.300
## Max. :0.4020 Max. :0.25269 Max. :0.440
## STORE
## Min. :0.000
## 1st Qu.:0.000
## Median :2.000
## Mean :1.631
## 3rd Qu.:3.000
## Max. :4.000

set.seed(36)
train = sample(1:nrow(OJ), 535)
OJ.train = OJ[train,]
OJ.test = OJ[-train,]

b. Fit a (linear) support vector classier to the training data using cost=0.05, with Purchase as the response
and the other variables as predictors. Use the summary() function to produce summary statistics about the
SVM, and describe the results obtained.

library(e1071)
set.seed(36)
OJ.svm = svm(Purchase~., data=OJ.train, kernel = "linear", cost = 0.05)
summary(OJ.svm)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 10/23
3/17/2017 Homework__4.html

##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear",
## cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
## gamma: 0.05555556
##
## Number of Support Vectors: 262
##
## ( 131 131 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM

This summary shows us that the model selects 262 out of the 535 observations as support points to predict 2
classes, 131 in one class and 131 in the other.

c. What are the training and test error rates?

#training table and prediction rate


set.seed(36)
OJtrainPreds = predict(OJ.svm, newdata = OJ.train)
train.table = table(obs = OJ.train$Purchase, pred = OJtrainPreds)
train.table

## pred
## obs CH MM
## CH 267 46
## MM 46 176

1-sum(diag(train.table))/sum(train.table)

## [1] 0.1719626

#testing table and prediction rate


set.seed(36)
OJtestPreds = predict(OJ.svm, newdata = OJ.test)
test.table = table(obs = OJ.test$Purchase, pred = OJtestPreds)
test.table

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 11/23
3/17/2017 Homework__4.html

## pred
## obs CH MM
## CH 302 38
## MM 49 146

1-sum(diag(test.table))/sum(test.table)

## [1] 0.1626168

Based on the above classication results, were getting reasonablly close testing (80.9%) and training (83.7%)
classication rates.

d. Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(36)
svmTune = tune(svm,Purchase~.,data=OJ.train,
ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)),kernel="linear")
summary(svmTune)

##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.2
##
## - best performance: 0.1718728
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1832285 0.04877999
## 2 0.02 0.1868623 0.04710860
## 3 0.05 0.1850454 0.04910731
## 4 0.10 0.1756813 0.04388344
## 5 0.20 0.1718728 0.04230377
## 6 0.50 0.1811670 0.03851654
## 7 1.00 0.1792802 0.04410072
## 8 2.00 0.1793152 0.03971763
## 9 5.00 0.1830538 0.04395067
## 10 10.00 0.1812020 0.04424007

plot(svmTune)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 12/23
3/17/2017 Homework__4.html

By

tuning our SVM function, we found that many dierent values of cost result in similar amounts of error, roughly
around 17%. However, weve determined that 0.1 is our best parameter.

e. Compute the training and test error rates using this new value for cost.

newOJsvm = svm(Purchase~., data=OJ.train, kernel = "linear", cost = 0.1)


#training
newOJtrainPreds = predict(newOJsvm, newdata = OJ.train)
newtrain.table = table(obs = OJ.train$Purchase, pred = newOJtrainPreds)
newtrain.table

## pred
## obs CH MM
## CH 263 50
## MM 39 183

1-sum(diag(newtrain.table))/sum(newtrain.table)

## [1] 0.1663551

#testing
newOJtestPreds = predict(newOJsvm, newdata = OJ.test)
newtest.table = table(obs = OJ.test$Purchase, pred = newOJtestPreds)
newtest.table

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 13/23
3/17/2017 Homework__4.html

## pred
## obs CH MM
## CH 298 42
## MM 44 151

1-sum(diag(newtest.table))/sum(newtest.table)

## [1] 0.1607477

By creating a new SVM model and re-running a testing and training prediction code, we nd slightly improved
classication rates (83.4% and 81.5%). However, as we found from our code in part d, the costs result in similar
amounts of error, which explains why our improvement isnt that signicant.

f. Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for
gamma.

radSVM = svm(Purchase~., data=OJ.train, kernel = "radial", cost = 0.05)


summary(radSVM)

##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial",
## cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.05
## gamma: 0.05555556
##
## Number of Support Vectors: 447
##
## ( 222 225 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM

#training table and prediction rate


set.seed(36)
OJradtrainPreds = predict(radSVM, newdata = OJ.train)
radsvm.train.table = table(obs = OJ.train$Purchase, pred = OJradtrainPreds)
radsvm.train.table

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 14/23
3/17/2017 Homework__4.html

## pred
## obs CH MM
## CH 286 27
## MM 88 134

1-sum(diag(radsvm.train.table))/sum(radsvm.train.table)

## [1] 0.2149533

#testing table and prediction rate


set.seed(36)
OJradtestPreds = predict(radSVM, newdata = OJ.test)
radsvm.test.table = table(obs = OJ.test$Purchase, pred = OJradtestPreds)
radsvm.test.table

## pred
## obs CH MM
## CH 316 24
## MM 95 100

1-sum(diag(radsvm.test.table))/sum(radsvm.test.table)

## [1] 0.2224299

set.seed(36)
radialSVM = tune(svm , Purchase~. , data=OJ.train ,
ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)), kernel="radial")
summary(radialSVM)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 15/23
3/17/2017 Homework__4.html

##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.1866177
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4149196 0.05553547
## 2 0.02 0.4149196 0.05553547
## 3 0.05 0.2356045 0.06045365
## 4 0.10 0.2018868 0.05730822
## 5 0.20 0.1980084 0.06225772
## 6 0.50 0.1885744 0.05898148
## 7 1.00 0.1866177 0.05977225
## 8 2.00 0.1978686 0.05464430
## 9 5.00 0.1959818 0.05531844
## 10 10.00 0.2034242 0.06308126

radialOJsvm = svm(Purchase~., data=OJ.train, kernel = "radial", cost = 1)

#training
radialOJtrainPreds = predict(radialOJsvm, newdata = OJ.train)
radtrain.table = table(obs = OJ.train$Purchase, pred = radialOJtrainPreds)
radtrain.table

## pred
## obs CH MM
## CH 275 38
## MM 49 173

1-sum(diag(radtrain.table))/sum(radtrain.table)

## [1] 0.1626168

#testing
radialOJtestPreds = predict(radialOJsvm, newdata = OJ.test)
radtest.table = table(obs = OJ.test$Purchase, pred = radialOJtestPreds)
radtest.table

## pred
## obs CH MM
## CH 312 28
## MM 62 133

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 16/23
3/17/2017 Homework__4.html

1-sum(diag(radtest.table))/sum(radtest.table)

## [1] 0.1682243

Now, using a radial kernel with a cost = 2, were getting better prediction rates (85.4% and 82.0%)!

g. Repeat parts (b) through (e) using a support vector machine with a polynomial kernel of degree 2.

polyOJSVM = svm(Purchase~., data=OJ.train, kernel = "polynomial", degree=2, cost = 0.05)


summary(polyOJSVM)

##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.05
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 427
##
## ( 212 215 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM

#training table and prediction rate


set.seed(36)
OJpolytrainPreds = predict(polyOJSVM, newdata = OJ.train)
polysvm.train.table = table(obs = OJ.train$Purchase, pred = OJpolytrainPreds)
polysvm.train.table

## pred
## obs CH MM
## CH 307 6
## MM 172 50

1-sum(diag(polysvm.train.table))/sum(polysvm.train.table)

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 17/23
3/17/2017 Homework__4.html

## [1] 0.3327103

#testing table and prediction rate


set.seed(36)
OJpolytestPreds = predict(polyOJSVM, newdata = OJ.test)
polysvm.test.table = table(obs = OJ.test$Purchase, pred = OJpolytestPreds)
polysvm.test.table

## pred
## obs CH MM
## CH 330 10
## MM 164 31

1-sum(diag(polysvm.test.table))/sum(polysvm.test.table)

## [1] 0.3252336

set.seed(36)
polySVM = tune(svm , Purchase~. , data=OJ.train , ranges=list(cost=c(.01,.02,.05,.1,.2,.
5,1,2,5,10)), kernel="polynomial", degree = 2)
summary(polySVM)

##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 5
##
## - best performance: 0.19413
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4149196 0.05553547
## 2 0.02 0.3851153 0.05320332
## 3 0.05 0.3458071 0.05901895
## 4 0.10 0.3233054 0.06237070
## 5 0.20 0.2858840 0.06399353
## 6 0.50 0.2278127 0.06978194
## 7 1.00 0.2221523 0.07434275
## 8 2.00 0.2016073 0.07409238
## 9 5.00 0.1941300 0.06193817
## 10 10.00 0.2016073 0.06151487

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 18/23
3/17/2017 Homework__4.html

polyOJsvm = svm(Purchase~., data=OJ.train, kernel = "polynomial", degree = 2, cost = 5)

#training
polyOJtrainPreds = predict(polyOJsvm, newdata = OJ.train)
polytrain.table = table(obs = OJ.train$Purchase, pred = polyOJtrainPreds)
polytrain.table

## pred
## obs CH MM
## CH 286 27
## MM 55 167

1-sum(diag(polytrain.table))/sum(polytrain.table)

## [1] 0.153271

#testing
polyOJtestPreds = predict(polyOJsvm, newdata = OJ.test)
polytest.table = table(obs = OJ.test$Purchase, pred = polyOJtestPreds)
polytest.table

## pred
## obs CH MM
## CH 309 31
## MM 69 126

1-sum(diag(polytest.table))/sum(polytest.table)

## [1] 0.1869159

Looking at our results, we see that a polynomial kernel doesnt quite beat out the results from our radial kernel
from earlier.

h. Repeat parts (b) through (e) using a linear support vector machine, applied to an expanded feature set
consisting of linear and all possible quadratic terms for the predictors. How does this compare to the
polynomial kernel both conceptually and in terms of the results for this problem?

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 19/23
3/17/2017 Homework__4.html

set.seed(315)
#2 refers to changing columns
#quadratic = apply(OJ[,-1], 2, as.numeric)
#quadratic = quadratic^2
#quadOJ = cbind(quadratic,OJ)

#newOJ = as.numeric(OJ[,-1])
#quadratic = do.call(poly, c(lapply(2:18, function(x) as.numeric(OJ[,x])), degree=2, raw
=TRUE))
#quadOJ = cbind(quadratic, OJ$Purchase)

newquadOJ = data.frame(model.matrix(~.^2-1, OJ))


keepquadOJ = newquadOJ[-c(1,2)]
quadOJ = cbind(OJ$Purchase, keepquadOJ)
colnames(quadOJ)[1] = "Purchase"

quadOJtrain = quadOJ[train,]
quadOJtest = quadOJ[-train,]

hquadSVM = svm(Purchase ~polym(PriceCH,PriceMM,DiscCH,DiscMM,LoyalCH,SalePriceMM,SalePri


ceCH,PriceDiff,PctDiscMM,PctDiscCH,ListPriceDiff, degree=2), data=quadOJtrain, kernel =
"linear", cost = 0.05)
summary(hquadSVM)

##
## Call:
## svm(formula = Purchase ~ polym(PriceCH, PriceMM, DiscCH, DiscMM,
## LoyalCH, SalePriceMM, SalePriceCH, PriceDiff, PctDiscMM,
## PctDiscCH, ListPriceDiff, degree = 2), data = quadOJtrain,
## kernel = "linear", cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
## gamma: 0.01298701
##
## Number of Support Vectors: 256
##
## ( 127 129 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 20/23
3/17/2017 Homework__4.html

quadOJsvm = tune(svm, Purchase~polym(PriceCH,PriceMM,DiscCH,DiscMM,LoyalCH,SalePriceMM,S


alePriceCH,PriceDiff,PctDiscMM,PctDiscCH,ListPriceDiff, degree=2), data=quadOJtrain, ran
ges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)), kernel = "linear")
summary(quadOJsvm)

##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.2242837
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.2242837 0.07316077
## 2 0.02 0.2673305 0.06020821
## 3 0.05 0.3026555 0.04910140
## 4 0.10 0.3009783 0.05457131
## 5 0.20 0.3046820 0.04299572
## 6 0.50 0.3066038 0.05260060
## 7 1.00 0.3028302 0.04732529
## 8 2.00 0.2990566 0.05611746
## 9 5.00 0.3008386 0.05382667
## 10 10.00 0.2989518 0.04491017

hquadtrainPreds = predict(hquadSVM, newdata = quadOJtrain)


hquadtrain.table = table(obs = quadOJtrain$Purchase, pred = hquadtrainPreds)
hquadtrain.table

## pred
## obs CH MM
## CH 278 35
## MM 62 160

1-sum(diag(hquadtrain.table))/sum(hquadtrain.table)

## [1] 0.1813084

hquadtestPreds = predict(hquadSVM, newdata = quadOJtest)


hquadtest.table = table(obs = quadOJtest$Purchase, pred = hquadtestPreds)
hquadtest.table

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 21/23
3/17/2017 Homework__4.html

## pred
## obs CH MM
## CH 304 36
## MM 64 131

1-sum(diag(hquadtest.table))/sum(hquadtest.table)

## [1] 0.1869159

quadsvm = svm(Purchase~polym(PriceCH,PriceMM,DiscCH,DiscMM,LoyalCH,SalePriceMM,SalePrice
CH,PriceDiff,PctDiscMM,PctDiscCH,ListPriceDiff, degree=2), data=quadOJtrain, kernel = "l
inear", cost = 0.01)

#training
quadOJtrainPreds = predict(quadsvm, newdata = OJ.train)
quadtrain.table = table(obs = OJ.train$Purchase, pred = quadOJtrainPreds)
quadtrain.table

## pred
## obs CH MM
## CH 275 38
## MM 65 157

1-sum(diag(quadtrain.table))/sum(quadtrain.table)

## [1] 0.1925234

#testing
quadOJtestPreds = predict(quadsvm, newdata = OJ.test)
quadtest.table = table(obs = OJ.test$Purchase, pred = quadOJtestPreds)
quadtest.table

## pred
## obs CH MM
## CH 301 39
## MM 65 130

1-sum(diag(quadtest.table))/sum(quadtest.table)

## [1] 0.1943925

Compared to the polynomial kernel, we see worse classication rates. A polynomial kernel has a more exible
decision boundary than a linear one, which would generally be better when dealing with a higher dimensional
space.

i. Overall, which approach seems to give the best results on this data?

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 22/23
3/17/2017 Homework__4.html

Overall, it seems like the polynomial kernel provides the best test error results on our data.

Problem 4
Consider a dataset with n observations, xi Rp for i = 1,,n. In this problem we show that the K-means
algorithm is guaranteed to converge but not necessarily to the globally optimal solution.

All solutions attached as images.

a. At the beginning of each iteration of the K-means algorithm, we have K clusters C1, , CK Rp, and each
data point is assigned to the cluster with the nearest centroid (at this point, the centroids are not
necessarily equal to the mean of the data points assigned to the cluster). Show (according to problem
specications):

b. Dene (according to problem specications):

c. Show that the K-means algorithm is guaranteed to converge.

d. Give, as an example, a toy data set and a pair of initial centroids for which the 2-means algorithm does not
converge to the globally optimal min.

le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 23/23
Problem 4a
Problem 4b

Problem 4c
Problem 4d
Problem 4d

You might also like