Professional Documents
Culture Documents
Prepared by: Data Miner Junior, Yi Han Low 42424658, Yean Seang Ng 42365783, Jiayi Zhu 42185211, Miduo Tian 42300266
Introduction
Aim
Data Description
The data file contains information about 45211 (rows) respondent in the
campaign with 17 (columns) 17 variables. 7 of them are continuous
variables, 6 of them are nominal, the last 4 is binary variables including
1 of it is the outcome variable denoting whether customers buy the term
deposit.
Outcome Variable consisted 88.3% of clients who had not subscribed
the term deposit and 11.7% had responded positively.
DATA PRE-PROCESSING
Classification Models
We divided the initial data set into three subsets training, validation and
testing, with weights of 60%, 20% and 20%, respectively. A consistent
misclassification matrix is also employed as shown below in all of our
classification models to reflect a higher cost of misclassifying customers
that do not buy given that they will buy the product.
Table 1: Misclassification cost matrix
The models we attempted
P redic ted
include C5.0, C&R Tree,
No
Y es
Ac tual
CHAID, QUEST, Neural
No
0.0
1.0
Network (NN) and Support
Y es
5.0
0.0
Vector Machine (SVM) to
develop predictions about the positive response to the target variable, i.e.
who will subscribe the term deposits (y= 1). These models were all
performed in SPSS whereas R was used to develop the C5.0 tree and
SVM tuning (e1071 package). Our modelling approach is to start with
the default option and then adjust parameters to decrease error rates of
predicting positive response of customers.
Evaluation
The results from the methods were compared in terms of class error rates
and its consistencies in all data sets, lifts, and lift and gain chart for yes.
Table 2: Class-specific accuracy rates of all models on training, testing and validation datasets
Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM
Training
Yes
No
70.069
81.124
82.726
82.758
38.003
52.293
87.598
77.827
72.151
81.839
96.241
97.585
Exclusion of variables
-Day, Poutcome
Methods
Two-Step Clustering - We initially chose to exclude month and duration
variable as these two variables contain past information that does not
help in characterising our future customers. We exclude outliers with a
percentage of 25% to improve clustering results since this data include
outliers and extremes. Empirical internal testing showed that the
87.413
77.734
72.184
82.177
96.211
96.413
Table 3: Lift for y= yes and no of all models on training, testing and validation datasets
Balance, Campaign,
Clustering Analysis
normality assumption is reasonable to be violated. We also used the loglikelihood as distance measure and Schwartzs Bayesian Criterion (BIC)
as clustering criterion.
Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM
Yes
Training
No
3.663
2.792
2.415
3.220
4.900
6.350
Yes
1.083
1.097
1.097
1.101
1.043
1.063
Lift
Testing
No
3.666
2.812
2.462
3.206
4.873
5.132
Yes
1.085
1.100
1.099
1.104
1.046
1.048
Validation
No
3.673
2.780
2.404
3.254
4.807
5.134
1.086
1.097
1.097
1.102
1.042
1.047
C5.0 Model delivers the highest accuracy rate for Yes (82.758%) as
well as a relatively high lift (3.220). The records for the testing and
validation data sets also demonstrated consistent results. From the lift
chart for [y=yes] in, C5.0 model was the third highest line (the yellow
line) which still indicates a very high lift for our prediction purposes.
From the gain chart for [y=yes] in, C5.0 model was also the third
highest line (the yellow line) which still indicates a very high
information gain for our prediction purposes. The comparison of lifts
and gains were consistent under the testing set and validation set.
Therefore, C5.0 is chosen to be the best model in our analysis.
Results
Conclusion
Understanding of the best segments of customers
With clustering analysis performed as an unsupervised learning, we
could have a better realization of the distinct groups in the portfolio of
customers. Then, we could predict the needs of segment of customers
and promote specific products through campaigns that suit their needs to
improve the overall profitability.
Classify customers who buy term deposits
Encourage marketers to increase the length of their phone calls
(around 4 to 6 minutes)
Increase calls made or employing more agents in certain month
which highly affects the probability of a successful contact.
Advice the marketing department that they should avoid calling the
same customers too early since the last campaign
Limitations
The overall proportion of target variable for yes was quite low
(11.7%) which lead to around 17.242% error rate for training data
even in the best model.
Since we filtered out the poutcome variable, we cannot identify
whether previous outcome will affect current marketing campaign.
Choice of misclassification cost is difficult and may not be accurate.
Further Research
Perform time series analysis to the data sets to identify the trends
or patterns in the buying power of the customers over time.
Run competitive analysis to assess the strengths and weakness of
the current product in relation to market standards.
Figure 2: Web graph showing the importance of the predictors according to Cluster 1 and 2
Figure 4: Gain Charts (above) and Lift Charts (below) of all classification
models performed on training datasets
C5.0
C5.0
Table 6: Class-specific error rates using C5.0 model performed on the training, testing and validation data
sets
Training
Testing
Validation
output (subscribe)
no
yes
Total
no
Count
19724
4377
24101
Row %
81.83893 18.16107
100
yes
Count
549
2635
3184
Row %
17.24246 82.75754
100
Total
Count
20273
7012
27285
Row %
74.3009 25.6991
100
Lift (yes)
3.22025
output (subscribe)
no
yes
Total
no
Count
6428
1448
7876
Row %
81.61503 18.38497
100
yes
Count
173
879
1052
Row %
16.44487 83.55513
100
Total
Count
6601
2327
8928
Row %
73.93593 26.06407
100
Lift (yes) 3.205759
output (subscribe)
no
yes
Total
no
Count
6529
1416
7945
Row %
82.17747 17.82253
100
yes
Count
182
871
1053
Row %
17.28395 82.71605
100
Total
Count
6711
2287
8998
Row %
74.58324 25.41676
100
Lift (yes)
3.25439