Professional Documents
Culture Documents
TEAM MATES:
Abstract Submission:
1. We have reviewed the predictor variables and dropped the variables ‘Id’ and ‘Phone
number’ which is obvious for the reason that they are unique for each customer.
This is also seen from the importance plot from randomForest package in R.
2. Looking at the importance plot from Random Forest, ‘Area Code’ is the least
important with < 5 % importance.
3. Upon changing the Categorical Variable ‘State’ into One Hot Encoding, we have seen
a decrease in the accuracy. So, we dropped this variable.
4. We have found that there are no missing values in the data. We have performed the
stratified sampling using ‘CreateDataPartition’ and divided the whole dataset into
train set and test set in 70:30 split.
RESULTS-
Reference
Prediction False True
False 1225 116
True 62 96
Accuracy: 88.12 %
Precision: 91.13 %
Recall: 95.2 %
2. For Decision Tree:
Reference
Prediction False True
False 1269 56
True 18 156
Accuracy: 95.06 %
Precision: 95.8 %
Recall: 98.6 %
GREAT STEP – SAFETY DATA ANALYTICS ABSTRACT SUBMISSION
Reference
Prediction False True
False 1275 117
True 12 95
Accuracy: 91.39 %
Precision: 91.6 %
Recall: 99.1 %
Reference
Prediction False True
False 1280 123
True 7 89
Accuracy: 91.32 %
Precision: 91.2 %
Recall: 99.5 %
5. For SVM – Linear kernel:
Reference
Prediction False True
False 1287 212
True 0 0
Accuracy: 85.85 %
Precision: 85.9 %
Recall: 100 %
Reference
Prediction False True
False 1195 190
True 92 22
Accuracy: 81.18 %
Precision: 86.3 %
Recall: 92.9 %
GREAT STEP – SAFETY DATA ANALYTICS ABSTRACT SUBMISSION
There are no parameters to be tuned in rpart and Naive bayes. In order to improve the
accuracy performance of the support vector classification we will need to select the best
parameters for the model. We trained a lot of models for the different couples
of ϵ(epsilon) and cost, and choose the best one based on the root mean square
error(RMSE) value.
For SVM :
In the figure below, dark blue regions represent the svm models with less RMSE value.
Darker the region, less is the RMSE of the model.