Data Miner Junior Poster

INVESTIGATION OF BANKS TERM DEPOSIT SUBSCRIPTION
Prepared by: Data Miner Junior, Yi Han Low 42424658, Yean Seang Ng 42365783, Jiayi Zhu 42185211, Miduo Tian 42300266
Introduction
With significant changes in macroeconomic factors and market

behaviour as well as the types of customers in the banks portfolio, there
is a far greater need for discovering new information about customers
profiles and its association with their buying decision.
The data mining team was approached by the bank to apply data mining
techniques to more accurately pinpoint customers who have higher
chances of subscribing the term deposit product.
Aim
By implementing various clustering and classification methods, the data

mining team aims to obtain solutions for the following questions:
Are there any demographics characteristics of the customer that
associate with their buying decision of term deposits?
Which data mining techniques and further analysis are best to be
performed to make predictions and improve the results of the
marketing campaigns for similar products of the bank?
What factors related to the marketing process are the most
influential to the buying decision of the clients?
Data Description
The data file contains information about 45211 (rows) respondent in the
campaign with 17 (columns) 17 variables. 7 of them are continuous
variables, 6 of them are nominal, the last 4 is binary variables including
1 of it is the outcome variable denoting whether customers buy the term
deposit.
Outcome Variable consisted 88.3% of clients who had not subscribed
the term deposit and 11.7% had responded positively.
Figure 1: Data Pre-processing

Analysis of Outliers and
extreme values: Age,
DATA PRE-PROCESSING
Classification Models
We divided the initial data set into three subsets training, validation and
testing, with weights of 60%, 20% and 20%, respectively. A consistent
misclassification matrix is also employed as shown below in all of our
classification models to reflect a higher cost of misclassifying customers
that do not buy given that they will buy the product.
Table 1: Misclassification cost matrix
The models we attempted
P redic ted
include C5.0, C&R Tree,
No
Y es
Ac tual
CHAID, QUEST, Neural
No
0.0
1.0
Network (NN) and Support
Y es
5.0
0.0
Vector Machine (SVM) to
develop predictions about the positive response to the target variable, i.e.
who will subscribe the term deposits (y= 1). These models were all
performed in SPSS whereas R was used to develop the C5.0 tree and
SVM tuning (e1071 package). Our modelling approach is to start with
the default option and then adjust parameters to decrease error rates of
predicting positive response of customers.
Evaluation
The results from the methods were compared in terms of class error rates
and its consistencies in all data sets, lifts, and lift and gain chart for yes.
Table 2: Class-specific accuracy rates of all models on training, testing and validation datasets
Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM
Training
Yes
No
70.069
81.124
82.726
82.758
38.003
52.293
87.598
77.827
72.151
81.839
96.241
97.585
Accuracy Rate (%)

Testing
Validation
Yes
No
Yes
No
70.627
87.595
71.605
82.224
77.831
81.007
83.080
72.842
82.146
83.555
81.615
82.716
39.734
96.064
36.752
41.445
96.381
40.741
Duration, Pdays, Previous
Exclusion of variables
-Pdaysnew, Jobnew, Durationnew
-Day, Poutcome
Methods
Two-Step Clustering - We initially chose to exclude month and duration
variable as these two variables contain past information that does not
help in characterising our future customers. We exclude outliers with a
percentage of 25% to improve clustering results since this data include
outliers and extremes. Empirical internal testing showed that the
87.413
77.734
72.184
82.177
96.211
96.413
Table 3: Lift for y= yes and no of all models on training, testing and validation datasets
Balance, Campaign,
Concept hierarchies and Binning
Clustering Analysis
normality assumption is reasonable to be violated. We also used the loglikelihood as distance measure and Schwartzs Bayesian Criterion (BIC)
as clustering criterion.
Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM
Yes
Training
No
3.663
2.792
2.415
3.220
4.900
6.350
Yes
1.083
1.097
1.097
1.101
1.043
1.063
Lift
Testing
No
3.666
2.812
2.462
3.206
4.873
5.132
Yes
1.085
1.100
1.099
1.104
1.046
1.048
Validation
No
3.673
2.780
2.404
3.254
4.807
5.134
1.086
1.097
1.097
1.102
1.042
1.047
C5.0 Model delivers the highest accuracy rate for Yes (82.758%) as
well as a relatively high lift (3.220). The records for the testing and
validation data sets also demonstrated consistent results. From the lift
chart for [y=yes] in, C5.0 model was the third highest line (the yellow
line) which still indicates a very high lift for our prediction purposes.
From the gain chart for [y=yes] in, C5.0 model was also the third
highest line (the yellow line) which still indicates a very high
information gain for our prediction purposes. The comparison of lifts
and gains were consistent under the testing set and validation set.
Therefore, C5.0 is chosen to be the best model in our analysis.
Results
Two-step clustering identified two clusters. Cluster 1 consists of

customers who are older ages (44) and married whereas Cluster 2
consists of customers who are younger (34) and single.
C5.0 model has selected the most important factors as call duration,
number of days past since last campaign, month of duration and whether
the client has any housing loan.
The most important rules identified according to the rule confidence are
outlined below:
Rule 1: If duration of the call is less than 600 seconds, the days between
the previous and current campaign is between 300 to 399 days, month of
contact is August, January, July, Jun, May, or November, then we have
50% confidence that our customers will buy term deposits.
Rule 2: If duration of contact with the customers of 600 seconds or
longer, then we have 49.30% confidence that they will buy term deposits.
Rule 3: If the duration of contact is neither too short nor too long
(between 120 to 360 seconds), days between the previous and current
campaign are between 200 to 399 days (or never been contacted before),
month of contact is Jun and through cellular or telephones, then we have
38.8% confidence that their responds are positive.
Conclusion
Understanding of the best segments of customers
With clustering analysis performed as an unsupervised learning, we
could have a better realization of the distinct groups in the portfolio of
customers. Then, we could predict the needs of segment of customers
and promote specific products through campaigns that suit their needs to
improve the overall profitability.
Classify customers who buy term deposits
Encourage marketers to increase the length of their phone calls
(around 4 to 6 minutes)
Increase calls made or employing more agents in certain month
which highly affects the probability of a successful contact.
Advice the marketing department that they should avoid calling the
same customers too early since the last campaign
Limitations
The overall proportion of target variable for yes was quite low
(11.7%) which lead to around 17.242% error rate for training data
even in the best model.
Since we filtered out the poutcome variable, we cannot identify
whether previous outcome will affect current marketing campaign.
Choice of misclassification cost is difficult and may not be accurate.
Further Research
Perform time series analysis to the data sets to identify the trends
or patterns in the buying power of the customers over time.
Run competitive analysis to assess the strengths and weakness of
the current product in relation to market standards.
Table 4: All 17 variables including its type and explanations
Figure 2: Web graph showing the importance of the predictors according to Cluster 1 and 2
Table 5: Clusters generated by

Two-Step Clustering in SPSS with
its underlying predictors
Figure 4: Gain Charts (above) and Lift Charts (below) of all classification
models performed on training datasets
C5.0
C5.0
Figure 3: Decision Tree produced by the C5.0 model in SPSS

DurationNew in [ "0 to 120" "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: no ] (25,032)
pdaysnew in [ "0 to 99" "100 to 199" "400 or above" ] [ Mode: yes ] (2,500)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (679; 0.953)
DurationNew in [ "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,821; 0.382)
pdaysnew in [ "200 to 299" "300 to 399" "never" ] [ Mode: no ] (22,532)
month in [ "apr" "dec" "feb" "mar" "oct" "sep" ] [ Mode: yes ] (3,341)
housing = yes [ Mode: no ] => no (1,678; 0.911)
housing = no [ Mode: yes ] (1,663)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (505; 0.895)
DurationNew in [ "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,158; 0.38)
month in [ "aug" "jan" "jul" "jun" "may" "nov" ] [ Mode: no ] (19,191)
contact in [ "cellular" "telephone" ] [ Mode: no ] (12,049)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (4,345; 0.995)
DurationNew in [ "120 to 240" "240 to 360" ] [ Mode: no ] (6,160)
month in [ "aug" "jan" "jul" "may" "nov" ] [ Mode: no ] => no (5,972; 0.961)
month in [ "jun" ] [ Mode: yes ] => yes (188; 0.388)
DurationNew in [ "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,544; 0.198)
contact in [ "unknown" ] [ Mode: no ] (7,142)
pdaysnew in [ "300 to 399" ] [ Mode: yes ] => yes (2; 0.5)
pdaysnew in [ "never" ] [ Mode: no ] (7,140)
month in [ "aug" "nov" ] [ Mode: yes ] => yes (46; 0.196)
month in [ "jan" "jul" "jun" "may" ] [ Mode: no ] => no (7,094; 0.992)
pdaysnew in [ "200 to 299" ] [ Mode: no ] => no (0)
DurationNew in [ "600 to 720" "720 to 840" "840 and above" ] [ Mode: yes ] => yes (2,253; 0.493)
DurationNew in [ "default" ] [ Mode: no ] => no (0)
Table 6: Class-specific error rates using C5.0 model performed on the training, testing and validation data
sets
Training
Testing
Validation
output (subscribe)
no
yes
Total
no
Count
19724
4377
24101
Row %
81.83893 18.16107
100
yes
Count
549
2635
3184
Row %
17.24246 82.75754
100
Total
Count
20273
7012
27285
Row %
74.3009 25.6991
100
Lift (yes)
3.22025
output (subscribe)
no
yes
Total
no
Count
6428
1448
7876
Row %
81.61503 18.38497
100
yes
Count
173
879
1052
Row %
16.44487 83.55513
100
Total
Count
6601
2327
8928
Row %
73.93593 26.06407
100
Lift (yes) 3.205759
output (subscribe)
no
yes
Total
no
Count
6529
1416
7945
Row %
82.17747 17.82253
100
yes
Count
182
871
1053
Row %
17.28395 82.71605
100
Total
Count
6711
2287
8998
Row %
74.58324 25.41676
100
Lift (yes)
3.25439

Data Miner Junior Poster

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Miner Junior Poster

Uploaded by

Copyright:

Available Formats

INVESTIGATION OF BANKS TERM DEPOSIT SUBSCRIPTION

With significant changes in macroeconomic factors and market

By implementing various clustering and classification methods, the data

Figure 1: Data Pre-processing

Accuracy Rate (%)

Duration, Pdays, Previous

-Pdaysnew, Jobnew, Durationnew

Concept hierarchies and Binning

Two-step clustering identified two clusters. Cluster 1 consists of

Table 4: All 17 variables including its type and explanations

Table 5: Clusters generated by

Figure 3: Decision Tree produced by the C5.0 model in SPSS

You might also like