You are on page 1of 6

The 5th Tribe, Support Vector Machines and caret

by Joseph Rickert
In his new book, The Master Algorithm, Pedro Domingos takes on the heroic task of explaining
machine learning to a wide audience and classifies machine learning practitioners into 5 tribes*,
each with its own fundamental approach to learning problems. To the 5th tribe, the analogizers,
Pedro ascribes the Support Vector Machine (SVM) as it's master algorithm. Although the SVM
has been a competitive and popular algorithm since its discovery in the 1990's this might be the
breakout moment for SVMs into pop culture. (What algorithm has a cooler name?) More people
than every will want to give them a try and face the challenge of tuning them. Fortunately, there
is plenty of help out there and some good tools for the task.
In A Practical Guide to Support Vector Classification, which is aimed at beginners Hsu et al.
suggest the following methodology:

Transform data to the format of an SVM package

Conduct simple scaling on the data

Consider the RBF kernel K(x, y) = exp (||xy||2)

Use cross-validation to find the best parameter C and

Use the best parameter C and to train the whole training set5

Test

In this post, we present a variation of the methodology using R and the caret package. First, we
set up for an analysis, loading the segmentation data set from the caret package and using the
caret's createDataPartition() function to produce training and test data sets.
# Training SVM Models
library(caret)
library(dplyr)
# Used by caret
library(kernlab)
# support vector machine
library(pROC)
# plot the ROC curves
### Get the Data
# Load the data and construct indices to divide it into training and test
data sets.
data(segmentationData)
# Load the segmentation data set
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE)
trainData <- segmentationData[trainIndex,]
testData <- segmentationData[-trainIndex,]
trainX <-trainData[,4:61]
# Pull out the variables for training
sapply(trainX,summary)
# Look at a summary of the training data

Next, we carry out a two pass training and tuning process. In the first pass, shown in the code
block below, we arbitrarily pick some tuning parameters and use the default caret settings for
others. In the trainControl() function we specify 5 repetitions of 10 fold cross validation.
in the train() function which actually does the work, we specify the radial kernel using the
method parameter and the ROC as the metric for assessing performance. The tuneLength
parameter is set to pick 9 arbitrary values for the C, the "cost" of the radial kernel. This
parameter controls the complexity of the boundary between support vectors. The radial kernel
also requires setting a smoothing parameter, sigma. In this first, pass we let train() use its
default method of calculating an analytically derived estimate for sigma. Also note that we
instruct train() to center and scale the data before running the analysis with the preProc
parameter.
## SUPPORT VECTOR MACHINE MODEL
# First pass
set.seed(1492)
# Setup for cross validation
ctrl <- trainControl(method="repeatedcv",
# 10fold cross validation
repeats=5,
# do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the
best model
classProbs=TRUE)
#Train and Tune the SVM
svm.tune <- train(x=trainX,
y= trainData$Class,
method = "svmRadial",
# Radial kernel
tuneLength = 9,
# 9
values of the cost function
preProc = c("center","scale"), # Center and scale data
metric="ROC",
trControl=ctrl)
svm.tune
# Support Vector Machines with Radial Basis Function Kernel
#
# 1010 samples
# 58 predictor
# 2 classes: 'PS', 'WS'
#
# Pre-processing: centered, scaled
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 908, 909, 909, 909, 909, 909, ...
# Resampling results across tuning parameters:
#
#
C
ROC
Sens
Spec
ROC SD
Sens SD
# 0.25 0.8695054 0.8540559 0.6690476 0.03720951 0.04389913
# 0.50 0.8724147 0.8592774 0.6912857 0.03618794 0.04234003
# 1.00 0.8746137 0.8718648 0.6968254 0.03418700 0.04204607
# 2.00 0.8709825 0.8755478 0.6969048 0.03345607 0.03927223
# 4.00 0.8609396 0.8795478 0.6702063 0.03437846 0.04189803
# 8.00 0.8456799 0.8703357 0.6310635 0.03610988 0.04105803
# 16.00 0.8293339 0.8666667 0.5943492 0.03717344 0.04773906
# 32.00 0.8220839 0.8636131 0.5759365 0.03622665 0.04531028

Spec SD
0.07282584
0.07830249
0.06918850
0.06714838
0.06597494
0.07540066
0.08006023
0.07587914

#
#
#
#
#

64.00

0.8123889

0.8605315

0.5541746

0.03795353

0.04494173

0.07140262

Tuning parameter 'sigma' was held constant at a value of 0.01521335


ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01521335 and C = 1.

The results show that the best model resulted from setting .
In the second pass, having seen the parameter values selected in the first pass, we
use train()'s tuneGrid parameter to do some sensitivity analysis around the values C = 1 and
sigma = 0.015 that produced the model with the best ROC value. Note that R's
expand.grid() function is used to build a dataframe contain all the combinations of C and
sigma we want to look at.
# Second pass
# Look at the results of svm.tune and refine the parameter space
set.seed(1492)
# Use the expand.grid to specify the search space
grid <- expand.grid(sigma = c(.01, .015, 0.2),
C = c(0.75, 0.9, 1, 1.1, 1.25)
)
#Train and Tune the SVM
svm.tune <- train(x=trainX,
y= trainData$Class,
method = "svmRadial",
preProc = c("center","scale"),
metric="ROC",
tuneGrid = grid,
trControl=ctrl)
svm.tune
# Support Vector Machines with Radial Basis Function Kernel
#
# 1010 samples
# 58 predictor
# 2 classes: 'PS', 'WS'
#
# Pre-processing: centered, scaled
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 909, 909, 908, 910, 909, 909, ...
# Resampling results across tuning parameters:
#
#
sigma C
ROC
Sens
Spec
ROC SD
Sens SD
Spec SD
# 0.010 0.75 0.8727381 0.8614685 0.6817619 0.04096223 0.04183900
0.07664910
# 0.010 0.90 0.8742107 0.8633193 0.6878889 0.04066995 0.04037202
0.07817537
# 0.010 1.00 0.8748389 0.8630023 0.6873016 0.04079094 0.04061032
0.08189960
# 0.010 1.10 0.8747998 0.8642378 0.6884444 0.04076756 0.04004827
0.07892234

# 0.010 1.25 0.8749384 0.8657762 0.6923492 0.04083294 0.03911751


0.08070616
# 0.015 0.75 0.8726557 0.8660793 0.6923333 0.04171842 0.04324822
0.08203598
# 0.015 0.90 0.8727037 0.8688531 0.6945714 0.04164810 0.04082448
0.08379649
# 0.015 1.00 0.8727079 0.8713147 0.6906667 0.04184851 0.04273855
0.08174494
# 0.015 1.10 0.8724013 0.8719301 0.6895556 0.04197524 0.04108930
0.08377854
# 0.015 1.25 0.8721560 0.8722331 0.6900952 0.04207802 0.04096501
0.08639355
# 0.200 0.75 0.8193497 0.8863263 0.4478413 0.04695531 0.04072159
0.08061632
# 0.200 0.90 0.8195377 0.8903263 0.4405397 0.04688797 0.04091728
0.07844983
# 0.200 1.00 0.8193307 0.8915478 0.4361111 0.04719399 0.04004779
0.07815045
# 0.200 1.10 0.8195696 0.8958508 0.4333651 0.04694670 0.04026003
0.08252021
# 0.200 1.25 0.8198250 0.8983077 0.4271905 0.04705685 0.03900879
0.07945602
#
# ROC was used to select the optimal model using the largest value.
# The final values used for the model were sigma = 0.01 and C = 1.25.

This was quite a bit of calculation for an improvement of 0.0003247 in the ROC score, but it
shows off some of what caret can do.
To finish up, we have build a model with a different kernel. The linear kernel is the simplest way
to go. There is only the C parameter to set for this kernel and train() hardwires in a value of
C = 1. The resulting ROC value of 0.87 is not too shabby.
#Linear Kernel
set.seed(1492)
#Train and Tune the SVM
svm.tune2 <- train(x=trainX,
y= trainData$Class,
method = "svmLinear",
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune2
#
#
#
#
#
#
#
#

> svm.tune2
Support Vector Machines with Linear Kernel
1010 samples
58 predictor
2 classes: 'PS', 'WS'
Pre-processing: centered, scaled

#
#
#
#
#
#
#
#

Resampling: Cross-Validated (10 fold, repeated 5 times)


Summary of sample sizes: 909, 909, 908, 910, 909, 909, ...
Resampling results
ROC
0.8654818

Sens
0.8774312

Spec
0.6327302

ROC SD
0.03975929

Sens SD
0.04210438

Spec SD
0.0824907

Tuning parameter 'C' was held constant at a value of 1

Because, I took the trouble to set the seed for the pseudorandom number generator to the same
value before each of the resampling operations above we can use caret's resample() function
to compare the results generated by the radial kernel and linear kernel models. The following
block of code and results shows just thee first five lines of the comparison table but includes the
summary of the comparison.
rValues <- resamples(list(svm=svm.tune,svm.tune2))
rValues$values
# Resample
svm~ROC svm~Sens svm~Spec Model2~ROC Model2~Sens Model2~Spec
# 1 Fold01.Rep1 0.8995726 0.8153846 0.7777778 0.8914530
0.8461538
0.6666667
# 2 Fold01.Rep2 0.9452991 0.9076923 0.7500000 0.9346154
0.9384615
0.6666667
# 3 Fold01.Rep3 0.8153846 0.8153846 0.6111111 0.8115385
0.8153846
0.5555556
# 4 Fold01.Rep4 0.9115385 0.9076923 0.6944444 0.8918803
0.9076923
0.4722222
# 5 Fold01.Rep5 0.9017094 0.8615385 0.6944444 0.8645299
0.8923077
0.6666667
summary(rValues)
# Call:
#
summary.resamples(object = rValues)
#
# Models: svm, Model2
# Number of resamples: 50
#
# ROC
# Min. 1st Qu. Median
Mean 3rd Qu.
Max. NA's
# svm
0.7910 0.8503 0.8785 0.8749 0.9013 0.9620
0
# Model2 0.7789 0.8399 0.8609 0.8655 0.8918 0.9457
0
#
# Sens
# Min. 1st Qu. Median
Mean 3rd Qu.
Max. NA's
# svm
0.7538 0.8346 0.8769 0.8658 0.8935 0.9385
0
# Model2 0.7692 0.8462 0.8923 0.8774 0.9077 0.9538
0
#
# Spec
# Min. 1st Qu. Median
Mean 3rd Qu.
Max. NA's
# svm
0.4286 0.6312 0.6944 0.6923 0.7500 0.8333
0
# Model2 0.4722 0.5556 0.6389 0.6327 0.6667 0.8333
0
bwplot(rValues,metric="ROC",ylab =c("linear kernel", "radial kernel"))
# boxplot

The boxplot below would appear to give the radial kernel the edge on being the better model, but
of course, we are far from done. We will leave the last step of Hsu et al.'s methodology, testing
the models on held-out testing data, for another day.

You might also like