Assignment Kaggle

Kaggle Titanic
Chia
Wednesday, December 16, 2015
Problem Description
Competition Task: Titanic
The task is to predict whether a passenger survived on Titanic or not given the passengers attributes in the
archive data, e.g. age, sex, number of siblings or spouses aboard, number of parents or children on board,
passenger class, ticket fare paid, etc. [link]https://www.kaggle.com/c/titanic
Evaluation:
The accuracy of the predict model will be calculated by comparing the predicted result with the ground truth
using the testing data set.
Data:
Kaggle has separated the data into train.csv and test.csv. Here, the summary and some simple analysis are
run to have brief idea about the data (training set).
Note: (Source: Kaggle.com)
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Training Data Summary
##
##
##
##
##
##
PassengerId
Min.
: 1.0
1st Qu.:223.5
Median :446.0
Mean
:446.0
3rd Qu.:668.5
Survived
Min.
:0.0000
1st Qu.:0.0000
Median :0.0000
Mean
:0.3838
3rd Qu.:1.0000
Pclass
Min.
:1.000
1st Qu.:2.000
Median :3.000
Mean
:2.309
3rd Qu.:3.000
1
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Max.
:891.0
Max.
:1.0000
Max.
:3.000
Name
Sex
Age
Abbing, Mr. Anthony
: 1
female:314
Min.
: 0.42
Abbott, Mr. Rossmore Edward
: 1
male :577
1st Qu.:20.12
Abbott, Mrs. Stanton (Rosa Hunt)
: 1
Median :28.00
Abelson, Mr. Samuel
: 1
Mean
:29.70
Abelson, Mrs. Samuel (Hannah Wizosky): 1
3rd Qu.:38.00
Adahl, Mr. Mauritz Nils Martin
: 1
Max.
:80.00
(Other)
:885
NA's
:177
SibSp
Parch
Ticket
Fare
Min.
:0.000
Min.
:0.0000
1601
: 7
Min.
: 0.00
1st Qu.:0.000
1st Qu.:0.0000
347082 : 7
1st Qu.: 7.91
Median :0.000
Median :0.0000
CA. 2343: 7
Median : 14.45
Mean
:0.523
Mean
:0.3816
3101295 : 6
Mean
: 32.20
3rd Qu.:1.000
3rd Qu.:0.0000
347088 : 6
3rd Qu.: 31.00
Max.
:8.000
Max.
:6.0000
CA 2144 : 6
Max.
:512.33
(Other) :852
Cabin
Embarked
:687
: 2
B96 B98
: 4
C:168
C23 C25 C27: 4
Q: 77
G6
: 4
S:644
C22 C26
: 3
D
: 3
(Other)
:186
First Look
Some attributes (i.e.names, passengerid, ticket, embark, and cabin) seems either not to be relevant to a
passengers survival rate or to be sparse. Therefore, they are not included in the following analyses.
From the correlation table, it seems like the survival rate is relatively highly correlated with Pclass and
Sex.The correlation between age and survival rate is not yet clear. Not surprisingly, Pclass is highly correlated
with Fare.
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Survived
1.00000000
-0.53882559
-0.35965268
-0.07722109
-0.01735836
0.09331701
0.26818862
Parch
Survived 0.09331701
Sex
-0.24697204
Pclass
0.02568307
Age
-0.18911926
SibSp
0.38381986
Parch
1.00000000
Fare
0.20511888
Survived
Sex
Pclass
Age
SibSp
Parch
Fare
Sex
Pclass
Age
SibSp
-0.53882559 -0.35965268 -0.07722109 -0.01735836
1.00000000 0.15546030 0.09325358 -0.10394968
0.15546030 1.00000000 -0.36922602 0.06724737
0.09325358 -0.36922602 1.00000000 -0.30824676
-0.10394968 0.06724737 -0.30824676 1.00000000
-0.24697204 0.02568307 -0.18911926 0.38381986
-0.18499425 -0.55418247 0.09606669 0.13832879
Fare
0.26818862
-0.18499425
-0.55418247
0.09606669
0.13832879
0.20511888
1.00000000
The follwoing figures also show similar informaiton.

2
500
400
300
count
factor(Survived)
0
1
200
100
0
1
factor(Pclass)
40
count
30
factor(Survived)
0
20
10
0
4.0125
6.2375
6.4375
6.4958
6.8583
6.45
07.0458
6.75
5
6.975
7.0542
6.95
7.1417
7.125
7.05
7.2292
7.225
7.3125
7.4958
7.5208
7.25
7.6292
7.7292
7.55
7.7333
7.725
7.7375
7.65
7.7417
7.7875
7.7958
7.775
7.75
7.8292
7.8542
7.8792
7.8875
7.875
7.8958
7.8
8.0292
7.925
8.1125
8.1375
8.1583
8.05
8.3625
8.4042
8.4333
8.4583
8.5167
8.6542
8.3
8.6625
8.6833
8.7125
9.2167
8.85
9.225
9.4833
9.475
9.5875
9.35
9.8375
10.1708
9.8417
9
9.825
10.4625
9.8458
9.5
10.5167
11.1333
11.2417
10.5
12.2875
12.275
11.5
12.475
12.525
12.35
13.4167
12.875
12
12.65
13.7917
13.8583
13.8625
14.1083
13.5
14.4542
14.4583
13
15.0458
14.4
15.2458
14
14.5
15.05
15.7417
15.1
15
15.55
15.5
15.75
15.85
15.9
18.7875
16.1
19.2583
16.7
17.4
19.9667
16
18.75
17.8
20.2125
18
20.525
19.5
20.575
20.25
21.6792
21.075
22.3583
22.025
22.525
21
25.4667
23.25
25.5875
23.45
25.9292
24.15
25.925
23
26.2833
26.2875
26.3875
24
26.25
27.7208
26.55
26
28.7125
27.75
29.125
27.9
30.0708
27
28.5
30.6958
29.7
31.3875
32.3208
29
31.275
30.5
30
34.0208
34.6542
31
34.375
32.5
33.5
37.0042
33
36.75
35.5
39.6875
41.5792
35
38.5
40.125
39.4
39.6
49.5042
39
50.4958
42.4
51.4792
46.9
51.8625
47.1
49.5
52.5542
55.4417
56.4958
50
56.9292
53.1
57.9792
52
55.9
61.3792
61.9792
55
61.175
63.3583
59.4
57
71.2833
69.55
66.6
76.2917
69.3
76.7292
77.2875
65
77.9583
75.25
78.2667
73.5
71
81.8583
78.85
82.1708
83.1583
79.65
79.2
89.1042
83.475
91.0792
110.8833
106.425
80
86.5
113.275
135.6333
108.9
93.5
146.5208
90
133.65
153.4625
164.8667
211.3375
134.5
151.55
221.7792
120
247.5208
227.525
512.3292
262.375
211.5
263
factor(Fare)
600
400
count
factor(Survived)
0
1
200
0
female
male
factor(Sex)
30
20
count
factor(Survived)
0
1
10
0
0.42
0.67
0.75
0.83
0.92
123456789111
012
13
14.5
1415
16
17
18
19
20.5
2021
22
23.5
2324.5
2425
26
27
28.5
2829
30.5
3031
32.5
3233
34.5
3435
36.5
3637
38
39
40.5
4041
42
43
44
45.5
4546
47
48
49
50
51
52
53
54
55.5
5556
57
58
59
60
61
62
63
64
65
66
70.5
7071
74
80
factor(Age)
500
400
Fare
300
200
100
0
1.0
1.5
2.0
2.5
3.0
Pclass
Analysis Approach
I further separate the training data provided by Kaggle further into two sets for training and testing purpose.
Split Data
library("caret")
train_F<-train[which(train[,"Sex"]=="female"),]
train_M<-train[which(train[,"Sex"]=="male"),]
train_index_f<-createDataPartition(1:nrow(train_F),1,.5,list=F)
train_index_m<-createDataPartition(1:nrow(train_M),1,.5,list=F)
trainset_f<-train_F[train_index_f,]
testset_f<-train_F[-train_index_f,]
trainset_m<-train_M[train_index_m,]
testset_m<-train_M[-train_index_m,]
trainset<-rbind(trainset_f,trainset_m )
testset<-rbind(testset_f,testset_m)
#trainset<-train[train_index,]
#testset<-train[-train_index,]
Based on the information learned from the correlation table and figures above, I consider including Sex, Age,
Pclass/Fare in the model first.
First, a quick decision tree is built based on the training set to see what attributes are the key factors to
survivl rate.
Second, a random forest tree is built based on the attributes selected from the first step.
Initial Solution and Initial Solution Analysis

Single Tree
It shows that Sex is the most important key factor, followed by Pclass (for female) and Fare (fore male).
Building Decision Tree
library("rpart")
library("partykit")
fol<- formula(Survived~Sex+Pclass+Fare+Age)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
##
11) Age< 36.5 59 25 0 (0.57627119 0.42372881)
##
22) Pclass>=2.5 27
7 0 (0.74074074 0.25925926) *
##
23) Pclass< 2.5 32 14 1 (0.43750000 0.56250000)
##
46) Fare>=40.2896 12
3 0 (0.75000000 0.25000000) *
##
47) Fare< 40.2896 20
5 1 (0.25000000 0.75000000) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286)
##
106) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
107) Age< 22.5 16
4 1 (0.25000000 0.75000000) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *
predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ", accuracy_dt))
## [1] "accuracy: "
"0.79954954954955"
plot_dt<- as.party(model_dt)
plot(plot_dt)
1
Sex
2
male
11
female
Fare
Pclass
4
< 15.621
15.621
2.5
12
Age
< 2.5
Fare
36.5
< 36.56
22.904
< 22.904
Pclass
14
Fare
2.5< 2.58
15
Fare
< 15.675 15.675
Fare
17
8.04
< 8.04
40.29
< 40.29
Age
22.5
< 22.5
Node 3 (n
Node
= 180)
5 Node
(n = 50)
7 Node
(n = 27)
9Node
(n = 12)
10Node
(n = 13
20)
Node
(n = 16
17)
Node
(n = 18
21)
Node
(n = 19
12)
Node
(n = 20
16)
Node
(n = 21
16)(n = 76)
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0
0
0
0
0
0
0
0
0
0
0
Given that Pcalss and Fare are highly correlated, it makes sense to just use one attribute instead of two in
the model. Therefore, I first run the model including Sex, Age, and Pclass.
Remove Fare
fol<- formula(Survived~Sex+Pclass+Age)
print(model_dt)
## n= 447
##
##
9
##
##
##
##
##
##
##
##
##
##
1) root 447 163 0 (0.63534676 0.36465324)

2) Sex=male 289 56 0 (0.80622837 0.19377163)
4) Age>=8.5 274 47 0 (0.82846715 0.17153285) *
5) Age< 8.5 15
6 1 (0.40000000 0.60000000) *
3) Sex=female 158 51 1 (0.32278481 0.67721519)
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
12) Age>=5.5 74 32 0 (0.56756757 0.43243243) *
13) Age< 5.5 8
2 1 (0.25000000 0.75000000) *
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *
## [1] "accuracy: "
"0.813063063063063"
Remove Pclass, Add Fare Back

fol<- formula(Survived~Sex+Fare+Age)
print(model_dt)
## n= 447
##
##
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) Age>=36.5 47 10 0 (0.78723404 0.21276596) *
##
11) Age< 36.5 62 26 0 (0.58064516 0.41935484)
##
22) Fare>=39.34375 20
4 0 (0.80000000 0.20000000) *
##
23) Fare< 39.34375 42 20 1 (0.47619048 0.52380952)
##
46) Fare< 25.9625 14
4 0 (0.71428571 0.28571429) *
##
47) Fare>=25.9625 28 10 1 (0.35714286 0.64285714)
##
94) Fare>=30.5979 10
4 0 (0.60000000 0.40000000) *
##
95) Fare< 30.5979 18
4 1 (0.22222222 0.77777778) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Fare< 48.2021 122 49 1 (0.40163934 0.59836066)
##
12) Age>=5.5 109 47 1 (0.43119266 0.56880734)
##
24) Age< 14.75 8
1 0 (0.87500000 0.12500000) *
##
25) Age>=14.75 101 40 1 (0.39603960 0.60396040)
##
50) Fare< 10.16875 34 16 0 (0.52941176 0.47058824)
##
100) Fare>=8.0396 7
0 0 (1.00000000 0.00000000) *
##
101) Fare< 8.0396 27 11 1 (0.40740741 0.59259259)
##
202) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
203) Age< 22.5 15
4 1 (0.26666667 0.73333333) *
##
51) Fare>=10.16875 67 22 1 (0.32835821 0.67164179) *
10
##
##
13) Age< 5.5 13

2 1 (0.15384615 0.84615385) *
7) Fare>=48.2021 36
2 1 (0.05555556 0.94444444) *
## [1] "accuracy: "
"0.765765765765766"
The three models shows that including both Pclass and Fare in the model would produce higher accuracy.
Remove Age
fol<- formula(Survived~Sex+Pclass+Fare)
print(model_dt)
## n= 447
##
##
##
## 1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *
## [1] "accuracy: "
"0.831081081081081"
Revised Solution and Analysis

Random Forest
I train a random forest model with the same attributed used in previous section.
Build a Random Forest
Include Age
11
library("randomForest")
library("e1071")
fol<- formula(Survived~Sex+Age+Pclass+Fare)
model_rf <- randomForest(fol, data=trainset,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf<-predict(model_rf,newdata=testset)
survival<-which(predict_rf>.5)
predict_rf[survival]<-1
predict_rf[-survival]<-0
accuracy_rf<-(sum(predict_rf==testset[,'Survived']))/length(predict_rf)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "
"0.804054054054054"
#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
##
##
##
##
Sex
Age
Pclass
Fare
IncNodePurity
14.170492
8.330348
5.594081
9.542739
Remove Age
library("randomForest")
library("e1071")
fol<- formula(Survived~Sex+Pclass+Fare)
#testing
## [1] "accuracy: "
"0.81981981981982"
#Gini coef
12
##
IncNodePurity
## Sex
18.687389
## Pclass
5.819844
## Fare
10.649627
Interesting, a forest is not necessarily better than a tree.
Then, I am interested to know wheter adding SibSp and Parch in the model will change the accuracy and
how important each attribute is. As we can see Sex is the most important factor to surival rate, followed by
Fare and Sex.
Add SibSp and Parch
fol<- formula(Survived~Age+Sex+Pclass+Fare+SibSp+Parch)
#testing
## [1] "accuracy: "
"0.808558558558559"
#Gini coef
##
##
##
##
##
##
##
Age
Sex
Pclass
Fare
SibSp
Parch
IncNodePurity
15.009582
15.912053
6.666768
15.543072
4.386713
3.276624
Build a Random Forest for Each Gender

Male
fol<- formula(Survived~Age+Pclass+Fare+SibSp+Parch)
model_rf <- randomForest(fol, data=trainset_m,na.action=na.omit)
#testing
predict_rf_m<-predict(model_rf,newdata=testset_m)
survival<-which(predict_rf_m>.5)
13
predict_rf_m[survival]<-1
predict_rf_m[-survival]<-0
accuracy_rf<-(sum(predict_rf_m==testset_m[,'Survived']))/length(predict_rf_m)
## [1] "accuracy: "
"0.836805555555556"
#Gini coef
##
##
##
##
##
##
Age
Pclass
Fare
SibSp
Parch
IncNodePurity
5.581538
1.582550
4.395235
1.661592
1.041310
Female
model_rf <- randomForest(fol, data=trainset_f,na.action=na.omit)
#testing
predict_rf_f<-predict(model_rf,newdata=testset_f)
survival<-which(predict_rf_f>.5)
predict_rf_f[survival]<-1
predict_rf_f[-survival]<-0
accuracy_rf<-(sum(predict_rf_f==testset_f[,'Survived']))/length(predict_rf_f)
## [1] "accuracy: "
"0.743589743589744"
#Gini coef
##
##
##
##
##
##
Age
Pclass
Fare
SibSp
Parch
IncNodePurity
3.695320
3.883030
4.539020
1.830367
1.657559
Combining Female and Male data set to get the accuracy

predict_rf_mix<-c(predict_rf_m,predict_rf_f)
truth_mix<-c(testset_m[,'Survived'],testset_f[,'Survived'])
accuracy_rf<-(sum(predict_rf_mix==truth_mix))/length(predict_rf_mix)
14
## [1] "accuracy: "
"0.804054054054054"
Based on previous sections, we see that a single tree might be better than a forest. I decided to run a model
with a signle tree considering additonal attributes, SibSp and Parch.
Rebuild a Decision Tree and Add SibSp and Parch
fol<- formula(Survived~Age+Sex+Pclass+Fare+SibSp+Parch)
print(model_dt)
## n= 447
##
##
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) SibSp>=2.5 13
0 0 (1.00000000 0.00000000) *
##
11) SibSp< 2.5 96 36 0 (0.62500000 0.37500000)
##
22) Age>=9.5 89 29 0 (0.67415730 0.32584270)
##
44) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
##
45) Age< 36.5 39 18 0 (0.53846154 0.46153846)
##
90) SibSp>=0.5 19
5 0 (0.73684211 0.26315789) *
##
91) SibSp< 0.5 20
7 1 (0.35000000 0.65000000)
##
182) Fare>=31.75 10
4 0 (0.60000000 0.40000000) *
##
183) Fare< 31.75 10
1 1 (0.10000000 0.90000000) *
##
23) Age< 9.5 7
0 1 (0.00000000 1.00000000) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286)
##
106) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
107) Age< 22.5 16
4 1 (0.25000000 0.75000000) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *
print(c("accuracy: ",accuracy_dt))
## [1] "accuracy: "
"0.81981981981982"
15
plot(plot_dt)
Sex
2
male
Fare
4
< 15.621
15.621
SibSp
2.5
16
< 2.5
7
9.5
15
female
2.5
Fare
22.904 < 22.904
6
Age
< 9.5
Age
9
36.5
< 36.5
SibSp
11
0.5< 0.5
Fare
31.75
< 31.75
Pclass
< 2.5
18
Fare
19 < 15.675
15.675
Fare
21
8.04
< 8.04
Age
22.5
< 22.5
Node 3Node
(n = 180)
5Node
(n = Node
8
13)
(n =10
Node
50)(n =12
Node
19)
(n =13
Node
10)
(n =Node
14
10)(n 17
=Node
7)(n =20
Node
17)
(n =22
Node
21)
(n =23
Node
12)
(n =24
Node
16)
(n =25
16)
(n = 76)
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Rebuild Decision Trees for Each Gender

Male
model_dt <- rpart(fol, method="class", data=trainset_m)
print(model_dt)
## n= 289
##
##
##
##
1) root 289 56 0 (0.80622837 0.19377163)
##
2) Fare< 15.62085 180 20 0 (0.88888889 0.11111111)
##
4) Age>=17.5 173 16 0 (0.90751445 0.09248555) *
##
5) Age< 17.5 7 3 1 (0.42857143 0.57142857) *
##
3) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
6) SibSp>=2.5 13 0 0 (1.00000000 0.00000000) *
##
7) SibSp< 2.5 96 36 0 (0.62500000 0.37500000)
16
##
##
##
##
##
##
##
##
14) Age>=9.5 89 29 0 (0.67415730 0.32584270)

28) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
29) Age< 36.5 39 18 0 (0.53846154 0.46153846)
58) SibSp>=0.5 19 5 0 (0.73684211 0.26315789) *
59) SibSp< 0.5 20 7 1 (0.35000000 0.65000000)
118) Fare>=31.75 10 4 0 (0.60000000 0.40000000) *
119) Fare< 31.75 10 1 1 (0.10000000 0.90000000) *
15) Age< 9.5 7 0 1 (0.00000000 1.00000000) *
predict_dt_m<-predict(model_dt,newdata=testset_m)
accuracy_dt<-sum(testset_m[,'Survived']==colnames(predict_dt_m)[apply(predict_dt_m,1,which.max)])/dim(pr
## [1] "accuracy: "
"0.788194444444444"
plot(plot_dt)
1
Fare
2 < 15.621 15.6215
Age
SibSp
2.5
< 2.5
7
Age
9.5
< 9.5
Age
17.5< 17.5
36.5< 36.510
SibSp
0.5 < 0.5 12
Fare
31.75
< 31.75
Node 3 (n =Node
173) 4 (nNode
= 7) 6 (n Node
= 13) 9 (nNode
= 50)11 (nNode
= 19)13 (nNode
= 10)14 (nNode
= 10)15 (n = 7)
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0
0
0
0
0
0
0
0
Female
fol<- formula(Survived~Pclass+Fare+SibSp+Parch)
model_dt <- rpart(fol, method="class", data=trainset_f)
print(model_dt)
17
## n= 158
##
##
##
## 1) root 158 51 1 (0.32278481 0.67721519)
##
2) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
4) Fare>=22.90415 17 3 0 (0.82352941 0.17647059) *
##
5) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
10) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
20) Fare>=8.0396 21 6 0 (0.71428571 0.28571429)
##
40) Parch< 0.5 12 1 0 (0.91666667 0.08333333) *
##
41) Parch>=0.5 9 4 1 (0.44444444 0.55555556) *
##
21) Fare< 8.0396 28 11 1 (0.39285714 0.60714286) *
##
11) Fare>=15.675 16 4 1 (0.25000000 0.75000000) *
##
3) Pclass< 2.5 76 7 1 (0.09210526 0.90789474) *
predict_dt_f<-predict(model_dt,newdata=testset_f)
accuracy_dt<-sum(testset_f[,'Survived']==colnames(predict_dt_f)[apply(predict_dt_f,1,which.max)])/dim(pr
## [1] "accuracy: "
"0.846153846153846"
plot(plot_dt)
1
Pclass
2.5
< 2.5
Fare
22.904
< 22.904
4
Fare
5
< 15.675 15.675
Fare
6
8.04
< 8.04
Parch
< 0.5 0.5
Node 3 (n = 17)
Node 7 (n = 12)
Node 8 (n = 9)
Node 9 (n = 28)
Node 10 (n = 16)
Node 11 (n = 76)
0.8
0.8
0.8
0.8
0.8
0.8
0.4
0.4
0.4
0.4
0.4
0.4
0
0
0
0
0
0
18
predict_dt_mix<-c(colnames(predict_dt_m)[apply(predict_dt_m,1,which.max)],colnames(predict_dt_f)[apply(p
accuracy_dt<-(sum(predict_dt_mix==truth_mix))/length(predict_dt_mix)
## [1] "accuracy: "
"0.808558558558559"
Build a Random Forest for Each Gender with different formula

Male
fol<- formula(Survived~Age+Fare)
model_rf <- randomForest(fol, data=trainset_m,na.action=na.omit)
#testing
predict_rf_m<-predict(model_rf,newdata=testset_m)
survival<-which(predict_rf_m>.5)
predict_rf_m[survival]<-1
predict_rf_m[-survival]<-0
accuracy_rf<-(sum(predict_rf_m==testset_m[,'Survived']))/length(predict_rf_m)
## [1] "accuracy: "
"0.819444444444444"
#Gini coef
##
IncNodePurity
## Age
13.90931
## Fare
13.90745
Female
fol<- formula(Survived~Age+Pclass+Fare)
model_rf <- randomForest(fol, data=trainset_f,na.action=na.omit)
#testing
predict_rf_f<-predict(model_rf,newdata=testset_f)
survival<-which(predict_rf_f>.5)
predict_rf_f[survival]<-1
predict_rf_f[-survival]<-0
accuracy_rf<-(sum(predict_rf_f==testset_f[,'Survived']))/length(predict_rf_f)
## [1] "accuracy: " "0.75"
19
#Gini coef
##
IncNodePurity
## Age
5.090393
## Pclass
5.326668
## Fare
5.784793
predict_rf_mix<-c(predict_rf_m,predict_rf_f)
accuracy_rf<-(sum(predict_rf_mix==truth_mix))/length(predict_rf_mix)
## [1] "accuracy: "
"0.795045045045045"
Summary
Simple model might work better; one concern may be over-fitting
Notebably, for different gender, the key factors to predict survival rate would be differnt. For male,
Fare and SibSp plays the two main factors in predicting survial rate; for female, Pclass and Fare plays
two main factors in predicting surval rate
20

Assignment Kaggle

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment Kaggle

Uploaded by

Copyright:

Available Formats

Kaggle Titanic

The follwoing figures also show similar informaiton.

Initial Solution and Initial Solution Analysis

< 15.675 15.675

1) root 447 163 0 (0.63534676 0.36465324)

Remove Pclass, Add Fare Back

13) Age< 5.5 13

Revised Solution and Analysis

Build a Random Forest for Each Gender

Combining Female and Male data set to get the accuracy

## [1] "accuracy: "

Rebuild Decision Trees for Each Gender

14) Age>=9.5 89 29 0 (0.67415730 0.32584270)

< 15.675 15.675

Combining Female and Male data set to get the accuracy

Build a Random Forest for Each Gender with different formula

You might also like