You are on page 1of 20

Kaggle Titanic

Chia
Wednesday, December 16, 2015
Problem Description
Competition Task: Titanic
The task is to predict whether a passenger survived on Titanic or not given the passengers attributes in the
archive data, e.g. age, sex, number of siblings or spouses aboard, number of parents or children on board,
passenger class, ticket fare paid, etc. [link]https://www.kaggle.com/c/titanic
Evaluation:
The accuracy of the predict model will be calculated by comparing the predicted result with the ground truth
using the testing data set.
Data:
Kaggle has separated the data into train.csv and test.csv. Here, the summary and some simple analysis are
run to have brief idea about the data (training set).
Note: (Source: Kaggle.com)
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Training Data Summary
##
##
##
##
##
##

PassengerId
Min.
: 1.0
1st Qu.:223.5
Median :446.0
Mean
:446.0
3rd Qu.:668.5

Survived
Min.
:0.0000
1st Qu.:0.0000
Median :0.0000
Mean
:0.3838
3rd Qu.:1.0000

Pclass
Min.
:1.000
1st Qu.:2.000
Median :3.000
Mean
:2.309
3rd Qu.:3.000
1

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Max.

:891.0

Max.

:1.0000

Max.

:3.000

Name
Sex
Age
Abbing, Mr. Anthony
: 1
female:314
Min.
: 0.42
Abbott, Mr. Rossmore Edward
: 1
male :577
1st Qu.:20.12
Abbott, Mrs. Stanton (Rosa Hunt)
: 1
Median :28.00
Abelson, Mr. Samuel
: 1
Mean
:29.70
Abelson, Mrs. Samuel (Hannah Wizosky): 1
3rd Qu.:38.00
Adahl, Mr. Mauritz Nils Martin
: 1
Max.
:80.00
(Other)
:885
NA's
:177
SibSp
Parch
Ticket
Fare
Min.
:0.000
Min.
:0.0000
1601
: 7
Min.
: 0.00
1st Qu.:0.000
1st Qu.:0.0000
347082 : 7
1st Qu.: 7.91
Median :0.000
Median :0.0000
CA. 2343: 7
Median : 14.45
Mean
:0.523
Mean
:0.3816
3101295 : 6
Mean
: 32.20
3rd Qu.:1.000
3rd Qu.:0.0000
347088 : 6
3rd Qu.: 31.00
Max.
:8.000
Max.
:6.0000
CA 2144 : 6
Max.
:512.33
(Other) :852
Cabin
Embarked
:687
: 2
B96 B98
: 4
C:168
C23 C25 C27: 4
Q: 77
G6
: 4
S:644
C22 C26
: 3
D
: 3
(Other)
:186
First Look

Some attributes (i.e.names, passengerid, ticket, embark, and cabin) seems either not to be relevant to a
passengers survival rate or to be sparse. Therefore, they are not included in the following analyses.
From the correlation table, it seems like the survival rate is relatively highly correlated with Pclass and
Sex.The correlation between age and survival rate is not yet clear. Not surprisingly, Pclass is highly correlated
with Fare.
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Survived
1.00000000
-0.53882559
-0.35965268
-0.07722109
-0.01735836
0.09331701
0.26818862
Parch
Survived 0.09331701
Sex
-0.24697204
Pclass
0.02568307
Age
-0.18911926
SibSp
0.38381986
Parch
1.00000000
Fare
0.20511888
Survived
Sex
Pclass
Age
SibSp
Parch
Fare

Sex
Pclass
Age
SibSp
-0.53882559 -0.35965268 -0.07722109 -0.01735836
1.00000000 0.15546030 0.09325358 -0.10394968
0.15546030 1.00000000 -0.36922602 0.06724737
0.09325358 -0.36922602 1.00000000 -0.30824676
-0.10394968 0.06724737 -0.30824676 1.00000000
-0.24697204 0.02568307 -0.18911926 0.38381986
-0.18499425 -0.55418247 0.09606669 0.13832879
Fare
0.26818862
-0.18499425
-0.55418247
0.09606669
0.13832879
0.20511888
1.00000000

The follwoing figures also show similar informaiton.


2

500

400

300

count

factor(Survived)
0
1

200

100

0
1

factor(Pclass)

40

count

30
factor(Survived)
0
20

10

0
4.0125
6.2375
6.4375
6.4958
6.8583
6.45
07.0458
6.75
5
6.975
7.0542
6.95
7.1417
7.125
7.05
7.2292
7.225
7.3125
7.4958
7.5208
7.25
7.6292
7.7292
7.55
7.7333
7.725
7.7375
7.65
7.7417
7.7875
7.7958
7.775
7.75
7.8292
7.8542
7.8792
7.8875
7.875
7.8958
7.8
8.0292
7.925
8.1125
8.1375
8.1583
8.05
8.3625
8.4042
8.4333
8.4583
8.5167
8.6542
8.3
8.6625
8.6833
8.7125
9.2167
8.85
9.225
9.4833
9.475
9.5875
9.35
9.8375
10.1708
9.8417
9
9.825
10.4625
9.8458
9.5
10.5167
11.1333
11.2417
10.5
12.2875
12.275
11.5
12.475
12.525
12.35
13.4167
12.875
12
12.65
13.7917
13.8583
13.8625
14.1083
13.5
14.4542
14.4583
13
15.0458
14.4
15.2458
14
14.5
15.05
15.7417
15.1
15
15.55
15.5
15.75
15.85
15.9
18.7875
16.1
19.2583
16.7
17.4
19.9667
16
18.75
17.8
20.2125
18
20.525
19.5
20.575
20.25
21.6792
21.075
22.3583
22.025
22.525
21
25.4667
23.25
25.5875
23.45
25.9292
24.15
25.925
23
26.2833
26.2875
26.3875
24
26.25
27.7208
26.55
26
28.7125
27.75
29.125
27.9
30.0708
27
28.5
30.6958
29.7
31.3875
32.3208
29
31.275
30.5
30
34.0208
34.6542
31
34.375
32.5
33.5
37.0042
33
36.75
35.5
39.6875
41.5792
35
38.5
40.125
39.4
39.6
49.5042
39
50.4958
42.4
51.4792
46.9
51.8625
47.1
49.5
52.5542
55.4417
56.4958
50
56.9292
53.1
57.9792
52
55.9
61.3792
61.9792
55
61.175
63.3583
59.4
57
71.2833
69.55
66.6
76.2917
69.3
76.7292
77.2875
65
77.9583
75.25
78.2667
73.5
71
81.8583
78.85
82.1708
83.1583
79.65
79.2
89.1042
83.475
91.0792
110.8833
106.425
80
86.5
113.275
135.6333
108.9
93.5
146.5208
90
133.65
153.4625
164.8667
211.3375
134.5
151.55
221.7792
120
247.5208
227.525
512.3292
262.375
211.5
263

factor(Fare)

600

400

count

factor(Survived)
0
1
200

0
female

male

factor(Sex)

30

20

count

factor(Survived)
0
1
10

0
0.42
0.67
0.75
0.83
0.92
123456789111
012
13
14.5
1415
16
17
18
19
20.5
2021
22
23.5
2324.5
2425
26
27
28.5
2829
30.5
3031
32.5
3233
34.5
3435
36.5
3637
38
39
40.5
4041
42
43
44
45.5
4546
47
48
49
50
51
52
53
54
55.5
5556
57
58
59
60
61
62
63
64
65
66
70.5
7071
74
80

factor(Age)

500

400

Fare

300

200

100

0
1.0

1.5

2.0

2.5

3.0

Pclass

Analysis Approach
I further separate the training data provided by Kaggle further into two sets for training and testing purpose.
Split Data

library("caret")
train_F<-train[which(train[,"Sex"]=="female"),]
train_M<-train[which(train[,"Sex"]=="male"),]
train_index_f<-createDataPartition(1:nrow(train_F),1,.5,list=F)
train_index_m<-createDataPartition(1:nrow(train_M),1,.5,list=F)
trainset_f<-train_F[train_index_f,]
testset_f<-train_F[-train_index_f,]
trainset_m<-train_M[train_index_m,]
testset_m<-train_M[-train_index_m,]
trainset<-rbind(trainset_f,trainset_m )
testset<-rbind(testset_f,testset_m)
#trainset<-train[train_index,]
#testset<-train[-train_index,]

Based on the information learned from the correlation table and figures above, I consider including Sex, Age,
Pclass/Fare in the model first.
First, a quick decision tree is built based on the training set to see what attributes are the key factors to
survivl rate.
Second, a random forest tree is built based on the attributes selected from the first step.

Initial Solution and Initial Solution Analysis


Single Tree
It shows that Sex is the most important key factor, followed by Pclass (for female) and Fare (fore male).
Building Decision Tree

library("rpart")
library("partykit")
fol<- formula(Survived~Sex+Pclass+Fare+Age)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
##
11) Age< 36.5 59 25 0 (0.57627119 0.42372881)
##
22) Pclass>=2.5 27
7 0 (0.74074074 0.25925926) *
##
23) Pclass< 2.5 32 14 1 (0.43750000 0.56250000)
##
46) Fare>=40.2896 12
3 0 (0.75000000 0.25000000) *
##
47) Fare< 40.2896 20
5 1 (0.25000000 0.75000000) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286)
##
106) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
107) Age< 22.5 16
4 1 (0.25000000 0.75000000) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *

predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ", accuracy_dt))
## [1] "accuracy: "

"0.79954954954955"

plot_dt<- as.party(model_dt)
plot(plot_dt)

1
Sex
2

male

11

female

Fare

Pclass

4
< 15.621
15.621

2.5

12

Age

< 2.5

Fare

36.5
< 36.56

22.904

< 22.904

Pclass

14
Fare

2.5< 2.58

15

Fare

< 15.675 15.675

Fare
17
8.04
< 8.04

40.29
< 40.29

Age
22.5
< 22.5

Node 3 (n
Node
= 180)
5 Node
(n = 50)
7 Node
(n = 27)
9Node
(n = 12)
10Node
(n = 13
20)
Node
(n = 16
17)
Node
(n = 18
21)
Node
(n = 19
12)
Node
(n = 20
16)
Node
(n = 21
16)(n = 76)
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0
0
0
0
0
0
0
0
0
0
0

Given that Pcalss and Fare are highly correlated, it makes sense to just use one attribute instead of two in
the model. Therefore, I first run the model including Sex, Age, and Pclass.
Remove Fare
fol<- formula(Survived~Sex+Pclass+Age)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
9

##
##
##
##
##
##
##
##
##
##

1) root 447 163 0 (0.63534676 0.36465324)


2) Sex=male 289 56 0 (0.80622837 0.19377163)
4) Age>=8.5 274 47 0 (0.82846715 0.17153285) *
5) Age< 8.5 15
6 1 (0.40000000 0.60000000) *
3) Sex=female 158 51 1 (0.32278481 0.67721519)
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
12) Age>=5.5 74 32 0 (0.56756757 0.43243243) *
13) Age< 5.5 8
2 1 (0.25000000 0.75000000) *
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *

predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ", accuracy_dt))
## [1] "accuracy: "

"0.813063063063063"

Remove Pclass, Add Fare Back


fol<- formula(Survived~Sex+Fare+Age)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) Age>=36.5 47 10 0 (0.78723404 0.21276596) *
##
11) Age< 36.5 62 26 0 (0.58064516 0.41935484)
##
22) Fare>=39.34375 20
4 0 (0.80000000 0.20000000) *
##
23) Fare< 39.34375 42 20 1 (0.47619048 0.52380952)
##
46) Fare< 25.9625 14
4 0 (0.71428571 0.28571429) *
##
47) Fare>=25.9625 28 10 1 (0.35714286 0.64285714)
##
94) Fare>=30.5979 10
4 0 (0.60000000 0.40000000) *
##
95) Fare< 30.5979 18
4 1 (0.22222222 0.77777778) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Fare< 48.2021 122 49 1 (0.40163934 0.59836066)
##
12) Age>=5.5 109 47 1 (0.43119266 0.56880734)
##
24) Age< 14.75 8
1 0 (0.87500000 0.12500000) *
##
25) Age>=14.75 101 40 1 (0.39603960 0.60396040)
##
50) Fare< 10.16875 34 16 0 (0.52941176 0.47058824)
##
100) Fare>=8.0396 7
0 0 (1.00000000 0.00000000) *
##
101) Fare< 8.0396 27 11 1 (0.40740741 0.59259259)
##
202) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
203) Age< 22.5 15
4 1 (0.26666667 0.73333333) *
##
51) Fare>=10.16875 67 22 1 (0.32835821 0.67164179) *
10

##
##

13) Age< 5.5 13


2 1 (0.15384615 0.84615385) *
7) Fare>=48.2021 36
2 1 (0.05555556 0.94444444) *

predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ", accuracy_dt))
## [1] "accuracy: "

"0.765765765765766"

The three models shows that including both Pclass and Fare in the model would produce higher accuracy.
Remove Age
fol<- formula(Survived~Sex+Pclass+Fare)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
## 1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *

predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ", accuracy_dt))
## [1] "accuracy: "

"0.831081081081081"

Revised Solution and Analysis


Random Forest
I train a random forest model with the same attributed used in previous section.
Build a Random Forest
Include Age

11

library("randomForest")
library("e1071")
fol<- formula(Survived~Sex+Age+Pclass+Fare)
model_rf <- randomForest(fol, data=trainset,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf<-predict(model_rf,newdata=testset)
survival<-which(predict_rf>.5)
predict_rf[survival]<-1
predict_rf[-survival]<-0
accuracy_rf<-(sum(predict_rf==testset[,'Survived']))/length(predict_rf)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.804054054054054"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
##
##
##
##

Sex
Age
Pclass
Fare

IncNodePurity
14.170492
8.330348
5.594081
9.542739

Remove Age
library("randomForest")
library("e1071")
fol<- formula(Survived~Sex+Pclass+Fare)
model_rf <- randomForest(fol, data=trainset,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf<-predict(model_rf,newdata=testset)
survival<-which(predict_rf>.5)
predict_rf[survival]<-1
predict_rf[-survival]<-0
accuracy_rf<-(sum(predict_rf==testset[,'Survived']))/length(predict_rf)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.81981981981982"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)

12

##
IncNodePurity
## Sex
18.687389
## Pclass
5.819844
## Fare
10.649627
Interesting, a forest is not necessarily better than a tree.
Then, I am interested to know wheter adding SibSp and Parch in the model will change the accuracy and
how important each attribute is. As we can see Sex is the most important factor to surival rate, followed by
Fare and Sex.
Add SibSp and Parch
fol<- formula(Survived~Age+Sex+Pclass+Fare+SibSp+Parch)
model_rf <- randomForest(fol, data=trainset,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf<-predict(model_rf,newdata=testset)
survival<-which(predict_rf>.5)
predict_rf[survival]<-1
predict_rf[-survival]<-0
accuracy_rf<-(sum(predict_rf==testset[,'Survived']))/length(predict_rf)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.808558558558559"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
##
##
##
##
##
##

Age
Sex
Pclass
Fare
SibSp
Parch

IncNodePurity
15.009582
15.912053
6.666768
15.543072
4.386713
3.276624

Build a Random Forest for Each Gender


Male
fol<- formula(Survived~Age+Pclass+Fare+SibSp+Parch)
model_rf <- randomForest(fol, data=trainset_m,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf_m<-predict(model_rf,newdata=testset_m)
survival<-which(predict_rf_m>.5)
13

predict_rf_m[survival]<-1
predict_rf_m[-survival]<-0
accuracy_rf<-(sum(predict_rf_m==testset_m[,'Survived']))/length(predict_rf_m)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.836805555555556"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
##
##
##
##
##

Age
Pclass
Fare
SibSp
Parch

IncNodePurity
5.581538
1.582550
4.395235
1.661592
1.041310

Female
fol<- formula(Survived~Age+Pclass+Fare+SibSp+Parch)
model_rf <- randomForest(fol, data=trainset_f,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf_f<-predict(model_rf,newdata=testset_f)
survival<-which(predict_rf_f>.5)
predict_rf_f[survival]<-1
predict_rf_f[-survival]<-0
accuracy_rf<-(sum(predict_rf_f==testset_f[,'Survived']))/length(predict_rf_f)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.743589743589744"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
##
##
##
##
##

Age
Pclass
Fare
SibSp
Parch

IncNodePurity
3.695320
3.883030
4.539020
1.830367
1.657559

Combining Female and Male data set to get the accuracy


predict_rf_mix<-c(predict_rf_m,predict_rf_f)
truth_mix<-c(testset_m[,'Survived'],testset_f[,'Survived'])
accuracy_rf<-(sum(predict_rf_mix==truth_mix))/length(predict_rf_mix)
print(c("accuracy: ",accuracy_rf))
14

## [1] "accuracy: "

"0.804054054054054"

Based on previous sections, we see that a single tree might be better than a forest. I decided to run a model
with a signle tree considering additonal attributes, SibSp and Parch.
Rebuild a Decision Tree and Add SibSp and Parch
fol<- formula(Survived~Age+Sex+Pclass+Fare+SibSp+Parch)
model_dt <- rpart(fol, method="class", data=trainset)
print(model_dt)
## n= 447
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
##
1) root 447 163 0 (0.63534676 0.36465324)
##
2) Sex=male 289 56 0 (0.80622837 0.19377163)
##
4) Fare< 15.62085 180 20 0 (0.88888889 0.11111111) *
##
5) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
10) SibSp>=2.5 13
0 0 (1.00000000 0.00000000) *
##
11) SibSp< 2.5 96 36 0 (0.62500000 0.37500000)
##
22) Age>=9.5 89 29 0 (0.67415730 0.32584270)
##
44) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
##
45) Age< 36.5 39 18 0 (0.53846154 0.46153846)
##
90) SibSp>=0.5 19
5 0 (0.73684211 0.26315789) *
##
91) SibSp< 0.5 20
7 1 (0.35000000 0.65000000)
##
182) Fare>=31.75 10
4 0 (0.60000000 0.40000000) *
##
183) Fare< 31.75 10
1 1 (0.10000000 0.90000000) *
##
23) Age< 9.5 7
0 1 (0.00000000 1.00000000) *
##
3) Sex=female 158 51 1 (0.32278481 0.67721519)
##
6) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
12) Fare>=22.90415 17
3 0 (0.82352941 0.17647059) *
##
13) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
26) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
52) Fare>=8.0396 21
6 0 (0.71428571 0.28571429) *
##
53) Fare< 8.0396 28 11 1 (0.39285714 0.60714286)
##
106) Age>=22.5 12
5 0 (0.58333333 0.41666667) *
##
107) Age< 22.5 16
4 1 (0.25000000 0.75000000) *
##
27) Fare>=15.675 16
4 1 (0.25000000 0.75000000) *
##
7) Pclass< 2.5 76
7 1 (0.09210526 0.90789474) *

predict_dt<-predict(model_dt,newdata=testset)
accuracy_dt<-sum(testset[,'Survived']==colnames(predict_dt)[apply(predict_dt,1,which.max)])/dim(predict_
print(c("accuracy: ",accuracy_dt))
## [1] "accuracy: "

"0.81981981981982"

15

plot_dt<- as.party(model_dt)
plot(plot_dt)

Sex
2

male

Fare
4
< 15.621
15.621
SibSp
2.5

16
< 2.5
7

9.5

15

female
2.5

Fare
22.904 < 22.904

6
Age
< 9.5

Age
9
36.5
< 36.5
SibSp
11
0.5< 0.5
Fare
31.75
< 31.75

Pclass
< 2.5
18

Fare
19 < 15.675
15.675
Fare
21
8.04
< 8.04
Age
22.5
< 22.5

Node 3Node
(n = 180)
5Node
(n = Node
8
13)
(n =10
Node
50)(n =12
Node
19)
(n =13
Node
10)
(n =Node
14
10)(n 17
=Node
7)(n =20
Node
17)
(n =22
Node
21)
(n =23
Node
12)
(n =24
Node
16)
(n =25
16)
(n = 76)
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0

Rebuild Decision Trees for Each Gender


Male
fol<- formula(Survived~Age+Pclass+Fare+SibSp+Parch)
model_dt <- rpart(fol, method="class", data=trainset_m)
print(model_dt)
## n= 289
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
##
1) root 289 56 0 (0.80622837 0.19377163)
##
2) Fare< 15.62085 180 20 0 (0.88888889 0.11111111)
##
4) Age>=17.5 173 16 0 (0.90751445 0.09248555) *
##
5) Age< 17.5 7 3 1 (0.42857143 0.57142857) *
##
3) Fare>=15.62085 109 36 0 (0.66972477 0.33027523)
##
6) SibSp>=2.5 13 0 0 (1.00000000 0.00000000) *
##
7) SibSp< 2.5 96 36 0 (0.62500000 0.37500000)
16

##
##
##
##
##
##
##
##

14) Age>=9.5 89 29 0 (0.67415730 0.32584270)


28) Age>=36.5 50 11 0 (0.78000000 0.22000000) *
29) Age< 36.5 39 18 0 (0.53846154 0.46153846)
58) SibSp>=0.5 19 5 0 (0.73684211 0.26315789) *
59) SibSp< 0.5 20 7 1 (0.35000000 0.65000000)
118) Fare>=31.75 10 4 0 (0.60000000 0.40000000) *
119) Fare< 31.75 10 1 1 (0.10000000 0.90000000) *
15) Age< 9.5 7 0 1 (0.00000000 1.00000000) *

predict_dt_m<-predict(model_dt,newdata=testset_m)
accuracy_dt<-sum(testset_m[,'Survived']==colnames(predict_dt_m)[apply(predict_dt_m,1,which.max)])/dim(pr
print(c("accuracy: ",accuracy_dt))
## [1] "accuracy: "

"0.788194444444444"

plot_dt<- as.party(model_dt)
plot(plot_dt)

1
Fare
2 < 15.621 15.6215
Age

SibSp
2.5

< 2.5

7
Age
9.5

< 9.5

Age
17.5< 17.5

36.5< 36.510
SibSp
0.5 < 0.5 12
Fare
31.75
< 31.75

Node 3 (n =Node
173) 4 (nNode
= 7) 6 (n Node
= 13) 9 (nNode
= 50)11 (nNode
= 19)13 (nNode
= 10)14 (nNode
= 10)15 (n = 7)
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0
0
0
0
0
0
0
0

Female
fol<- formula(Survived~Pclass+Fare+SibSp+Parch)
model_dt <- rpart(fol, method="class", data=trainset_f)
print(model_dt)
17

## n= 158
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
## 1) root 158 51 1 (0.32278481 0.67721519)
##
2) Pclass>=2.5 82 38 0 (0.53658537 0.46341463)
##
4) Fare>=22.90415 17 3 0 (0.82352941 0.17647059) *
##
5) Fare< 22.90415 65 30 1 (0.46153846 0.53846154)
##
10) Fare< 15.675 49 23 0 (0.53061224 0.46938776)
##
20) Fare>=8.0396 21 6 0 (0.71428571 0.28571429)
##
40) Parch< 0.5 12 1 0 (0.91666667 0.08333333) *
##
41) Parch>=0.5 9 4 1 (0.44444444 0.55555556) *
##
21) Fare< 8.0396 28 11 1 (0.39285714 0.60714286) *
##
11) Fare>=15.675 16 4 1 (0.25000000 0.75000000) *
##
3) Pclass< 2.5 76 7 1 (0.09210526 0.90789474) *

predict_dt_f<-predict(model_dt,newdata=testset_f)
accuracy_dt<-sum(testset_f[,'Survived']==colnames(predict_dt_f)[apply(predict_dt_f,1,which.max)])/dim(pr
print(c("accuracy: ",accuracy_dt))
## [1] "accuracy: "

"0.846153846153846"

plot_dt<- as.party(model_dt)
plot(plot_dt)

1
Pclass
2.5

< 2.5

Fare
22.904

< 22.904

4
Fare
5

< 15.675 15.675

Fare
6

8.04

< 8.04

Parch
< 0.5 0.5
Node 3 (n = 17)
Node 7 (n = 12)
Node 8 (n = 9)
Node 9 (n = 28)
Node 10 (n = 16)
Node 11 (n = 76)
0.8
0.8
0.8
0.8
0.8
0.8
0.4
0.4
0.4
0.4
0.4
0.4
0
0
0
0
0
0

18

Combining Female and Male data set to get the accuracy

predict_dt_mix<-c(colnames(predict_dt_m)[apply(predict_dt_m,1,which.max)],colnames(predict_dt_f)[apply(p
truth_mix<-c(testset_m[,'Survived'],testset_f[,'Survived'])
accuracy_dt<-(sum(predict_dt_mix==truth_mix))/length(predict_dt_mix)
print(c("accuracy: ",accuracy_dt))
## [1] "accuracy: "

"0.808558558558559"

Build a Random Forest for Each Gender with different formula


Male
fol<- formula(Survived~Age+Fare)
model_rf <- randomForest(fol, data=trainset_m,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf_m<-predict(model_rf,newdata=testset_m)
survival<-which(predict_rf_m>.5)
predict_rf_m[survival]<-1
predict_rf_m[-survival]<-0
accuracy_rf<-(sum(predict_rf_m==testset_m[,'Survived']))/length(predict_rf_m)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.819444444444444"

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
IncNodePurity
## Age
13.90931
## Fare
13.90745
Female
fol<- formula(Survived~Age+Pclass+Fare)
model_rf <- randomForest(fol, data=trainset_f,na.action=na.omit)
#model_rf <- randomForest(fol, data=trainset,importance=T,keep.forest=T)
#testing
predict_rf_f<-predict(model_rf,newdata=testset_f)
survival<-which(predict_rf_f>.5)
predict_rf_f[survival]<-1
predict_rf_f[-survival]<-0
accuracy_rf<-(sum(predict_rf_f==testset_f[,'Survived']))/length(predict_rf_f)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: " "0.75"

19

#Gini coef
#"The higher the number, the more the gini impurity score decreases by branching on this variable, indic
importance(model_rf)
##
IncNodePurity
## Age
5.090393
## Pclass
5.326668
## Fare
5.784793
Combining Female and Male data set to get the accuracy
predict_rf_mix<-c(predict_rf_m,predict_rf_f)
truth_mix<-c(testset_m[,'Survived'],testset_f[,'Survived'])
accuracy_rf<-(sum(predict_rf_mix==truth_mix))/length(predict_rf_mix)
print(c("accuracy: ",accuracy_rf))
## [1] "accuracy: "

"0.795045045045045"

Summary
Simple model might work better; one concern may be over-fitting
Notebably, for different gender, the key factors to predict survival rate would be differnt. For male,
Fare and SibSp plays the two main factors in predicting survial rate; for female, Pclass and Fare plays
two main factors in predicting surval rate

20

You might also like