You are on page 1of 2

INVESTIGATION OF BANKS TERM DEPOSIT SUBSCRIPTION

Prepared by: Data Miner Junior, Yi Han Low 42424658, Yean Seang Ng 42365783, Jiayi Zhu 42185211, Miduo Tian 42300266

Introduction

With significant changes in macroeconomic factors and market


behaviour as well as the types of customers in the banks portfolio, there
is a far greater need for discovering new information about customers
profiles and its association with their buying decision.
The data mining team was approached by the bank to apply data mining
techniques to more accurately pinpoint customers who have higher
chances of subscribing the term deposit product.

Aim

By implementing various clustering and classification methods, the data


mining team aims to obtain solutions for the following questions:
Are there any demographics characteristics of the customer that
associate with their buying decision of term deposits?
Which data mining techniques and further analysis are best to be
performed to make predictions and improve the results of the
marketing campaigns for similar products of the bank?
What factors related to the marketing process are the most
influential to the buying decision of the clients?

Data Description

The data file contains information about 45211 (rows) respondent in the
campaign with 17 (columns) 17 variables. 7 of them are continuous
variables, 6 of them are nominal, the last 4 is binary variables including
1 of it is the outcome variable denoting whether customers buy the term
deposit.
Outcome Variable consisted 88.3% of clients who had not subscribed
the term deposit and 11.7% had responded positively.

Figure 1: Data Pre-processing


Analysis of Outliers and
extreme values: Age,

DATA PRE-PROCESSING

Classification Models

We divided the initial data set into three subsets training, validation and
testing, with weights of 60%, 20% and 20%, respectively. A consistent
misclassification matrix is also employed as shown below in all of our
classification models to reflect a higher cost of misclassifying customers
that do not buy given that they will buy the product.
Table 1: Misclassification cost matrix
The models we attempted
P redic ted
include C5.0, C&R Tree,
No
Y es
Ac tual
CHAID, QUEST, Neural
No
0.0
1.0
Network (NN) and Support
Y es
5.0
0.0
Vector Machine (SVM) to
develop predictions about the positive response to the target variable, i.e.
who will subscribe the term deposits (y= 1). These models were all
performed in SPSS whereas R was used to develop the C5.0 tree and
SVM tuning (e1071 package). Our modelling approach is to start with
the default option and then adjust parameters to decrease error rates of
predicting positive response of customers.

Evaluation

The results from the methods were compared in terms of class error rates
and its consistencies in all data sets, lifts, and lift and gain chart for yes.
Table 2: Class-specific accuracy rates of all models on training, testing and validation datasets
Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM

Training
Yes
No
70.069
81.124
82.726
82.758
38.003
52.293

87.598
77.827
72.151
81.839
96.241
97.585

Accuracy Rate (%)


Testing
Validation
Yes
No
Yes
No
70.627
87.595
71.605
82.224
77.831
81.007
83.080
72.842
82.146
83.555
81.615
82.716
39.734
96.064
36.752
41.445
96.381
40.741

Duration, Pdays, Previous

Exclusion of variables

-Pdaysnew, Jobnew, Durationnew

-Day, Poutcome

Methods
Two-Step Clustering - We initially chose to exclude month and duration
variable as these two variables contain past information that does not
help in characterising our future customers. We exclude outliers with a
percentage of 25% to improve clustering results since this data include
outliers and extremes. Empirical internal testing showed that the

87.413
77.734
72.184
82.177
96.211
96.413

Table 3: Lift for y= yes and no of all models on training, testing and validation datasets

Balance, Campaign,

Concept hierarchies and Binning

Clustering Analysis

normality assumption is reasonable to be violated. We also used the loglikelihood as distance measure and Schwartzs Bayesian Criterion (BIC)
as clustering criterion.

Partition
Class
CART
CHAID
QUEST
C5.0/C4.5
Neural Networks
SVM

Yes

Training
No
3.663
2.792
2.415
3.220
4.900
6.350

Yes
1.083
1.097
1.097
1.101
1.043
1.063

Lift
Testing
No
3.666
2.812
2.462
3.206
4.873
5.132

Yes
1.085
1.100
1.099
1.104
1.046
1.048

Validation
No
3.673
2.780
2.404
3.254
4.807
5.134

1.086
1.097
1.097
1.102
1.042
1.047

C5.0 Model delivers the highest accuracy rate for Yes (82.758%) as
well as a relatively high lift (3.220). The records for the testing and
validation data sets also demonstrated consistent results. From the lift
chart for [y=yes] in, C5.0 model was the third highest line (the yellow
line) which still indicates a very high lift for our prediction purposes.
From the gain chart for [y=yes] in, C5.0 model was also the third
highest line (the yellow line) which still indicates a very high
information gain for our prediction purposes. The comparison of lifts
and gains were consistent under the testing set and validation set.
Therefore, C5.0 is chosen to be the best model in our analysis.

Results

Two-step clustering identified two clusters. Cluster 1 consists of


customers who are older ages (44) and married whereas Cluster 2
consists of customers who are younger (34) and single.
C5.0 model has selected the most important factors as call duration,
number of days past since last campaign, month of duration and whether
the client has any housing loan.
The most important rules identified according to the rule confidence are
outlined below:
Rule 1: If duration of the call is less than 600 seconds, the days between
the previous and current campaign is between 300 to 399 days, month of
contact is August, January, July, Jun, May, or November, then we have
50% confidence that our customers will buy term deposits.
Rule 2: If duration of contact with the customers of 600 seconds or
longer, then we have 49.30% confidence that they will buy term deposits.
Rule 3: If the duration of contact is neither too short nor too long
(between 120 to 360 seconds), days between the previous and current
campaign are between 200 to 399 days (or never been contacted before),
month of contact is Jun and through cellular or telephones, then we have
38.8% confidence that their responds are positive.

Conclusion
Understanding of the best segments of customers
With clustering analysis performed as an unsupervised learning, we
could have a better realization of the distinct groups in the portfolio of
customers. Then, we could predict the needs of segment of customers
and promote specific products through campaigns that suit their needs to
improve the overall profitability.
Classify customers who buy term deposits
Encourage marketers to increase the length of their phone calls
(around 4 to 6 minutes)
Increase calls made or employing more agents in certain month
which highly affects the probability of a successful contact.
Advice the marketing department that they should avoid calling the
same customers too early since the last campaign

Limitations

The overall proportion of target variable for yes was quite low
(11.7%) which lead to around 17.242% error rate for training data
even in the best model.
Since we filtered out the poutcome variable, we cannot identify
whether previous outcome will affect current marketing campaign.
Choice of misclassification cost is difficult and may not be accurate.
Further Research
Perform time series analysis to the data sets to identify the trends
or patterns in the buying power of the customers over time.
Run competitive analysis to assess the strengths and weakness of
the current product in relation to market standards.

Table 4: All 17 variables including its type and explanations

Figure 2: Web graph showing the importance of the predictors according to Cluster 1 and 2

Table 5: Clusters generated by


Two-Step Clustering in SPSS with
its underlying predictors

Figure 4: Gain Charts (above) and Lift Charts (below) of all classification
models performed on training datasets

C5.0

C5.0

Figure 3: Decision Tree produced by the C5.0 model in SPSS


DurationNew in [ "0 to 120" "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: no ] (25,032)
pdaysnew in [ "0 to 99" "100 to 199" "400 or above" ] [ Mode: yes ] (2,500)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (679; 0.953)
DurationNew in [ "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,821; 0.382)
pdaysnew in [ "200 to 299" "300 to 399" "never" ] [ Mode: no ] (22,532)
month in [ "apr" "dec" "feb" "mar" "oct" "sep" ] [ Mode: yes ] (3,341)
housing = yes [ Mode: no ] => no (1,678; 0.911)
housing = no [ Mode: yes ] (1,663)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (505; 0.895)
DurationNew in [ "120 to 240" "240 to 360" "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,158; 0.38)
month in [ "aug" "jan" "jul" "jun" "may" "nov" ] [ Mode: no ] (19,191)
contact in [ "cellular" "telephone" ] [ Mode: no ] (12,049)
DurationNew in [ "0 to 120" ] [ Mode: no ] => no (4,345; 0.995)
DurationNew in [ "120 to 240" "240 to 360" ] [ Mode: no ] (6,160)
month in [ "aug" "jan" "jul" "may" "nov" ] [ Mode: no ] => no (5,972; 0.961)
month in [ "jun" ] [ Mode: yes ] => yes (188; 0.388)
DurationNew in [ "360 to 480" "480 to 600" ] [ Mode: yes ] => yes (1,544; 0.198)
contact in [ "unknown" ] [ Mode: no ] (7,142)
pdaysnew in [ "300 to 399" ] [ Mode: yes ] => yes (2; 0.5)
pdaysnew in [ "never" ] [ Mode: no ] (7,140)
month in [ "aug" "nov" ] [ Mode: yes ] => yes (46; 0.196)
month in [ "jan" "jul" "jun" "may" ] [ Mode: no ] => no (7,094; 0.992)
pdaysnew in [ "200 to 299" ] [ Mode: no ] => no (0)
DurationNew in [ "600 to 720" "720 to 840" "840 and above" ] [ Mode: yes ] => yes (2,253; 0.493)
DurationNew in [ "default" ] [ Mode: no ] => no (0)

Table 6: Class-specific error rates using C5.0 model performed on the training, testing and validation data
sets
Training
Testing
Validation
output (subscribe)
no
yes
Total
no
Count
19724
4377
24101
Row %
81.83893 18.16107
100
yes
Count
549
2635
3184
Row %
17.24246 82.75754
100
Total
Count
20273
7012
27285
Row %
74.3009 25.6991
100
Lift (yes)
3.22025

output (subscribe)
no
yes
Total
no
Count
6428
1448
7876
Row %
81.61503 18.38497
100
yes
Count
173
879
1052
Row %
16.44487 83.55513
100
Total
Count
6601
2327
8928
Row %
73.93593 26.06407
100
Lift (yes) 3.205759

output (subscribe)
no
yes
Total
no
Count
6529
1416
7945
Row %
82.17747 17.82253
100
yes
Count
182
871
1053
Row %
17.28395 82.71605
100
Total
Count
6711
2287
8998
Row %
74.58324 25.41676
100
Lift (yes)
3.25439

You might also like