Sujoy Singha

CREDIT SCORING AND DATA MINING
MANG 6054
(Individual Course-Work)
13-05-2009
Sujoy Singha, M.Sc Management Science,

University of Southampton, 2008/09
Student ID: 22924299
ss21g08@soton.ac.uk
TUTOR:
Dr. Bart Baesens, School of Management,
University of Southampton
Bart.Baesens@econ.kuleuven.ac.be
Question 1 (30 marks, max. 4 pages): DATA MINING METHODOLGY PROBLEM
A) Introduction: This paper attempts at comparing two different methods of data mining to
analyse a dataset, and produce predictive results on the customer retention patterns in the mobile
phone industry. The given data set is first pre-processed and then used as an input data source for
the analysis. The data pre-processing performed in Microsoft Excel and in SAS, includes
methodologies like Transformation of Variables, Filtration of Outliers, Replacement of Rare data,
Data Partition and Stratified Sampling. The data mining techniques used in this process are:
Logistic Regression and Decision Tree, which is performed in SAS Enterprise Miner 4.3.
B) Data Pre-processing
The given data set churn.xls is a dataset comprising of 21 variables and 5000 entries for each
variable. This dataset is to be used to analyse the behaviour of a customer which is expressed by
means of a probabilistic expression ‘churn’. This indicates if a customer would be leaving the
organisation or not. The data pre-processing procedure followed is described below:
• Randomising of dataset with Microsoft ExcelTM by introducing a new column called

RANDOM in the dataset and filling the 5000 values using RAND ( ) function. The
dataset is then sorted on ascending order of RANDOM, and then the column is deleted,
and dataset saved as input data source to be analysed.
• Input Data Source is created in SAS Enterprise Miner (SAS-EM) and churn dataset is
imported as source in the Work Folder. The node of Variable selection is then added in
and linked to the Input Data Source. In the Input data source, the variable churn is
selected as target. Variable selection is done on the basis of Chi-Square to get rid of
irrelevant variables (a χ2 value of 3.92 is set as the cut-off, which is very low enough for
SAS-EM to reject a variable).
• The output from the Variable Selection is linked to Transform Variables node. Here the
variables are transformed into standardised form, so that in the subsequent step of
Replacement of variables, outliers can be easily replaced.
• The data is then divided into Training and Test set through the Data partition node in the
ratio of 2:1. Further, 34% (approx) of Training set is taken as Validation set. Hence, the
final ratio of Training: Validation: Testing is – 50:17:33. This node is linked to a Filter
Outliers node.
• The Replacement node which follows next, replaces the extreme values which might
affect the central tendency of the data. The data has been standardised already, and hence
by replacing values which less than -3 with -3, and those greater than 3 with 3, the data
set is winsorised within 6 times the standard deviation.
• The Data output from the Replacement node is then used for further analysis in Decision
Tree, and Interactive Grouping for Coarse Classification for Regression Analysis.
C) Results from Data pre-processing:
Based upon the Chi-Square criteria of

Variable Selection, the selected variables
were: STATE, INTL_PLAN, MAIL_PLAN,
DAY_MINS, DAY_CHARGE, EVE_MINS,
EVE_CHARGE, NIGHT_MINS,
INTL_MINS and INTL_CALLS.
An Insight node is linked to the Input Data

Source to detect presence of any
Univariate outlier. Vmail_Message shows
a Univariate characteristic histogram plot.
But the variable is eliminated in the
selection process, and hence no treatment
is required.
Fig 1: Network diagram for Churn analysis.
D) LOGISTIC REGRESSION:
The coarse classification of variables is handled by the Interactive Grouping node. Here, the
Commit criteria were selected as Information value and the Commit value was set to 0.1.
For the regression analysis, logistic regression was used with a logit link function, and the
method selected was Stepwise. No initial parameters estimates were used and no interactions
between variables were modelled. The network diagram for the entire analysis is shown below.
RESULTS OBTAINED FROM LOGISTIC REGRESSION:
i) Estimated Parameters: These are obtained from the Maximum Likelihood coefficient estimates
from the Regression Output.
Parameter Estimate P-Value
Intercept Value -1.4697 <.0001

Intl_Plan No -1.2144 <.0001
Vmail_Plan No 0.6108 <.0001
Evening Usage (Mins) 0.297 <.0001
Night Usage (Mins) 0.2405 0.0005
Intl Usage (Mins) 0.2603 0.0004
Intl Calls -0.2771 0.0008
Customer Service Calls -1.285 <.0001
Day Charge -1.2362 <.0001
Table 1: Analysis of Maximum Likelihood co-efficient estimates
The most predictive inputs are determined on the basis of their χ2 values or the p-value seen from
Table 1. Hence, the most predictive inputs are: Night Usage, Intl Usage and Intl Calls.
ii) CLASSIFICATION CALCULATIONS ON CONFUSION MATRIX:
The confusion matrix derived from the Regression analysis output at 0.5 cut-off is:-
Frequency
Percent Classification Accuracy = (TP+TN)/
Row Pct (TP+TN+FP+FN) = (2068+129)/
Col Pct FALSE TRUE TOTAL (2068+129+224+79)
2068 79 2147 = 0.8788
FALSE 82.72 3.16 85.88
96.32 3.68
90.23 37.98
Sensitivity = TP/
(TP) (FP) (TP+FN) = 2068/ (2068+224) = 0.9022
224 129 353
TRUE 8.96 5.16 14.12 Specificity = TN/
63.46 36.54 (TN+FP) = 129/208 = 0.6201
9.77 62.02
(FN) (TN) For ROC Curve comparison, please
TOTAL 2292 208 2500 refer to Fig 4 in Appendix.
91.68 8.32 100
E) DECISION TREE ANALYSIS:
The same dataset is considered for Decision Tree Analysis and hence, the Tree node is connected
to the Replacement node as shown in Fig 1. Parameter settings set in the Tree were:-
• For the tree construction, the Splitting criterion was selected as a Chi-square test of
significance level 0.2
• Sufficient observations for split search were set to 2000.
• Model assessment measure was set to Proportion misclassified.
i) Prevention of Overfitting: This is done by reducing the number of leaves in the tree to an
optimal level so as to reduce the complexity as well the overfitting of the model. The reduction
criterion is based on the misclassification plot obtained from the Tree analysis output which is
shown in Fig 2.
From the misclassification plot, we can see that with increase beyond 16 leaves in the Tree
model, there is not a substantial decrease in the misclassification rate. Hence, considering 16
leaves would be appropriate to prevent overfitting of the model as well as reduce complexity.
ii) THE TREE:

The Decision Tree resulted from the analysis is shown in Fig 3 (Appendix). As seen from the
figure, the splitting decision is made on the basis of the following variables:
Customer Service Calls, Day Minutes, International Plan, VMail Plan and State.
iii) CLASSIFICATION CALCULATIONS ON CONFUSION MATRIX:
The confusion matrix derived from Tree analysis output at 0.5 cut-off is:-[Baesens, B., 2009]1
Frequency Classification Accuracy = (TP+TN)/

Percent (TP+TN+FP+FN) = (2106+87)/
Row Pct (2106+87+41+266) = 0.8772
Col Pct FALSE TRUE TOTAL
2106 41 2147 Sensitivity = TP/ (TP+FN) = 2106/
FALSE 84.24 1.64 85.88
(2106+87) = 0.9603
98.09 1.91
96.03 13.36
(TP) (FP) Specificity = TN/ (TN+FP) = 266/307 =
87 266 353 0.866
TRUE 3.48 0.64 14.12
24.65 75.35 ROC Curve for Decision tree is
3.97 86.64 compared with the ROC for Regression
(FN) (TN) Analysis in Fig 4 of Appendix.
TOTAL 2193 307 2500
87.72 12.28 100
F) COMPARISON OF DECISION TREE AND REGRESSION ANALYSIS:
The following table assembles some comparative information from the corresponding analyses.
Measurement LR(%) DT(%)

CA 0.8788 0.8772 From the sensitivity comparison, we can see that
Sensitivity 0.9022 0.9603 the Decision Tree model performs better than the
Specificity 0.6201 0.866 Logistic Regression.
Root ASE 0.2787 0.216
Validity ASE 0.2922 0.2414
This is also reflected by the various ASE as well as
Test ASE 0.287 0.2246
Misclassification
the Misclassification Rate.
Rate 0.1148 0.0528
The ROC curves (APPENDIX: Fig 3) for the two analyses intersect each other and hence are not
very evident to understand the better performing methodology by mere inspection of the curve,
and hence AUC calculation needs to be done.
APPENDIX:
Figure 2: Misclassification Plot of Decision Tree Analysis:

FIGURE 3: DECISION TREE:
FIGURE 4: COMPARISON OF ROC CURVES FOR LR AND DT:
Question 2 (20 marks, max. 3 pages): Review of a Technical paper on Data Mining.
• Title, authors and complete citation (journal name, book title, issue, year, …)
The paper discussed here is titled: An analysis of customer retention and insurance claim
patterns using data mining: a case study. This case study was carried out by KA Smith, RJ
Willis (both from Monash University) and M Brooks (from Australian Associated Motor Insurers
Limited, Australia)4. It was published at the Journal of the Operational Research Society, Vol. 51,
No. 5; Pg 532-541, in the year 2000.
• The data mining problem considered
The case study mentioned here aims at demonstrating the capability of data mining in the
insurance industry by means of analysis of data available from an insurance company. This
insurance company has details of every financial transaction and claim recorded in a large and
effective data warehouse. The purpose of this analysis is to produce effective results which could
increase the market growth and profitability of the organisation. From a holistic perspective, the
result expected from this analysis is to set a pricing level which is corresponding in the degree of
claim costs and at the same time price policies to retain existing customers and attract new ones.
To set a pricing level, knowledge about which customers are likely to renew their policies, their
levels of risks and sensitivity of price increases, needs to be gathered. Hence, the Part-I of the
problem situation is the understanding of customer retention patterns. This can be done by
classifying policy holders as likely to renew or terminate the policies.
Further, the pricing of insurance products is interrelated to other factors such as policy
acquisition and retention. The relation between pricing and market growth and profitability also
needs to be established. Part-II of the problem situation embarks upon understanding of claim
patterns, identifying of risks at the customer level.
• The data mining techniques used
For the Part-I analysis, a sample comprising of 20914 motor vehicle policy holders whose
policies were due for renewal in April 1998. This data comprised of demographic information,
policy details, policy holder history etc. It was agreed upon the factors that were affecting the
renewal of a policy: price, service and insured value. Preliminary statistical analysis was
performed to deduce the impact of these factors which reassured that pricing and sum did play a
major role in the policy renewing process. Significance of other factors was not established,
although they could be significant when they are combined.
The Data mining problem here was handled through SAS Enterprise Miner software. With
the help of this software, the data was pre-processed (by means of variable selection and then
transformation). Several variables were rejected at this level based on the value of low χ2 with
the dependent variable. The remaining 29 independent variables and one dependent variable
(Terminated) are partitioned into training and test set and then Regression, decision tree analysis
and neural network was performed.
Part-II analysis of claims is done by studying the impact of growth in number of policy
holders on the decrease in profitability. The claim arrivals are assumed as a Poisson process and
maximum likelihood is used to estimate the model. This analysis aims at understanding the
growth patterns and their impact on profitability. A tool is to be built to predict the claim costs of
policy holders and accordingly, a strategy is to be made to increase profitability. Data for all
transactions relating to policy holders paying premiums in the 1st quarter of 1996, 1997, 1998 is
considered.
Data pre-processing here involved the addition of each policy holder’s contribution to the key
performance indicators. The sample size for each year is different. Preliminary analysis
discovered that the exceptional growth over the previous two years has occurred with people
under 22 years of age within certain risk areas and sum insured. The growth rate for this section
was almost twice the growth rate of the population, which indicates that a 16-20 year old was
more likely to have a claim than 51-71 year olds. A unidirected clustering of the data was hence
performed in the data mining. Input variables were same as the ones used after variable selection
in Part-I analysis and 50 clusters were used.
The pricing analysis also involved data mining which uses the previous analyses on customer
retention and risks. Information about the current pricing and profitability, growth and claim
patterns are used in the unified data mining methodology to come up with an optimal solution
which balances pricing, growth, retention and profitability.
• The results reported
From the Part-I customer retention analysis, it was observed that neural network produced the
best results for classifying the policies as likely to renew or terminate. The accuracy level of the
best found neural network model was then calculated considering a policy which is classified as
terminated if the probability given by the neural network exceeds 0.5. The results for both
training and test sets are presented for this threshold of 0.5 and compared to another
representation with a threshold of 0.1. Based on the comparison done, the optimal decision
threshold can be achieved. For future marketing endeavours aimed at customers who are likely to
terminate their policies, a lower threshold is adjusted to include as many customers as possible,
even though it would mean addressing people who are likely to renew. Further, if policy
threshold is set to a higher value, it means that the classification of policy holders as terminations
has to be quite certain. Else, the business runs the risk of addressing the wrong section of policy
holders. Depending on this analysis, cost of misclassification for both scenarios are considered,
using which profitability can be increased in the subsequent pricing plan.
From Part-II claims analysis performed by Clustering, 50 clusters were chosen based on a basic
k-means algorithm. The clusters exhibited different behaviour in terms of cost and frequency
ratio. Based upon this difference in the behaviour of the clusters, same clustering algorithm
applied to different year’s dataset produced results indicating the particular cluster’s behaviour
across the years. This analysis provides insight about the effect of changing growth on the key
performance indicators.
The pricing analysis produces the following results:-
a) Average claim cost and frequency of the group of policy holders.
b) Required adjustment in Premium price based upon the current policy holder cost ratio and
acceptable cost ratio from the model.
c) Customer retention classification based on decision threshold discussed in customer retention
analysis.
d) Price setting based on cost ratio, considering the risks of a customer terminating the policy in
case of a price increase beyond the tolerance limit. Price increase affects growth and profitability,
but also affects retention of customers. This price setting attempts to balance the price,
profitability and retention, through an iterative process which monitors the key indicators over
time.
• A critical discussion of the model and results (assumptions made, shortcomings,

limitations, …)
This case study attempts at studying historical data from an insurance company and predict
customer retention patterns, policy claims patterns and accordingly proceed with a pricing plan
to address customers who are likely to renew and attract customers who are likely to terminate
their policies. This is done based on the analysis mentioned previously. However, this analysis
has certain limitations.
• The selection of threshold classification as well as the cost of misclassification is
essential for proper decision making. But neither of them can be properly judged from the
given analysis.
• Implementation of the proposed methodology will involve major changes in the
organization structure in terms of technology and resources.
• Many other issues of the insurance industry need to be tackled which are not in the scope
of this paper, like early detection of customer termination of policies, and the governing
factors that affect it.
The assumptions made for this analysis were:

• Behaviour of policy holders whose policy is due for renewal in a certain month was
assumed to represent all policy holders.
• Arrival of claims was assumed to be an inhomogeneous Poisson process dependent upon
policy holder characteristics and environmental factors.
The good approaches of methodologies worth mentioning in this case study are:
• Structured data mining approach with comparison of Neural Network, Decision Tree and
Logistic Regression.
• Detailed pre-processing of data in a systemic way.
• Appropriate usage of Clustering in Claims Analysis.
Question 3 (20 marks, max. 30 lines per concept)
Explain the following concepts:
• Asset correlation parameter in the Basel II capital accord
• Winsorisation
• Leave-one out cross validation
• AUC based pruning of input variables
• Asset correlation parameter in the Basel II capital accord
Asset correlation parameter is an important measure of the importance of systematic risk.

The degree of exposure to systematic risk of an obligor is expressed in terms of systematic risk.
The asset correlation shows how the value of one borrower is interlinked with the asset value of
another borrower. Similarly, the correlation can be also described as the dependence of the asset
value of a borrower on the economic conditions, and all borrowers are linked by this single risk
factor. 5
Under Basel Accord the Asset correlation parameter was calibrated by using individual
Data exposure and retail exposure Data from the government and various Financial Institutions
of USA and Europe. The calibration of asset correlation depended upon the characteristic of the
borrower. A top down approach was used by the Basel Committee to measure the retail exposure.
This measured the variation in the total loss rates of the portfolio over time and used it to
estimate the correlation of the individual asset values. It is observed that if the assets were highly
correlated then there is more likelihood that that they will move in the same direction.
Consequently this would increase the volatility of the aggregate loss rates of the portfolio.
Therefore, the asset correlation declines with the Probability of default of the borrower.
Since different borrowers/asset classes have different degrees of dependency on the

economy, asset correlation is hence asset class dependent. Hence, in the risk weight function
model of Basel II, there are different values of asset correlation which are used for different asset
classes viz. Retail, Wholesale, Corporate or Banks etc.
• Winsorisation
Winsorisation is a statistical data processing method used to make a data set more robust
and adjust it against the outliers or extreme values. Winsorisation involves the modification to
the value of an item; with no change to its estimation weight.6 .The presence of outliers in a
dataset can affect the mean of the dataset. Hence, to reduce this deviation in the behaviour,
Winsorisation of data is done. Winsorisation involves setting of the outlier values to a specific
percentile of the data e.g. For 90% Winsorisation of a dataset, all the data below the 5 th percentile
is set to the 5th percentile, and the data above the 95th percentile is set to the 95th percentile. The
outlier values are hence replaced.
The Winsorised mean is further calculated by averaging the entire dataset which has been
winsorised by the process mentioned above. It is different from truncating a dataset where the
outlier data is completely omitted. A truncated mean of 5% hence, would only be calculated for
the data from the 5th to the 95th percentile of the data.
Winsorised data is a useful estimator for datasets as it reduces the deviation shown by the
outliers and installs central tendency behaviour in the dataset.
However, Winsorisation has a disadvantage. For this process to be carried out,

information about the outliers of the dataset and also their values are being considered, and
hence, it uses more information from the distribution or sample than the median. But, in case of
unsymmetrical data, where the mean is skewed, Winsorisation would result in an unbiased
estimation as it would replace more data on one end of the outliers.
• Leave-one out cross validation
Leave-one-out cross validation is a cross validation process to analyse the accuracy of a learning
statistical model in predicting data on which it was not trained. This procedure is carried out by
training the model multiple times, using all but one set of data points. The procedure followed is
as given below:
For a data set having N data set points;
• Temporarily remove the 1st data point from the training set.
• Run the training statistical model on the remaining N-1 points.
• Test the removed data point and make a note of the error.
• Calculate the error over all data points.
• Repeat this procedure for each data point from 1 to N.
Fig: A) Leave one out Cross Validation 7
As seen from the above figure, the single data point which is not considered in one of the
training method, serves as the validation set, and the remaining N-1 data points are the training
set. This procedure demands a lot of computational resources and hence can be termed as
resource-hungry and consequently very expensive.
AUC based pruning of input variables
Pruning is a process which assists in removal of irrelevant variables from a dataset. By doing
this, the input variables from a decision tree built on the basis of the input variables can be
reduced in terms of complexity. As the name suggests, AUC based pruning of input variables
would involve pruning of the input variables from a given dataset on the basis of a process
termed as AUC (Area underneath Curve).
To understand AUC, first we have to have to understand the Receiver Operating Characteristic
curve (ROC curve). The ROC Curve is a two-dimensional graphical illustration to determine if a
classification threshold for splitting a variable is adequate or not 2. It is a plot of the sensitivity on
the Y-axis versus (1-specificity) on the X-axis for various values of the classification threshold
[sensitivity = good predicted as good; 1-specificity = bads predicted to be good]. A typical ROC
curve would be as shown below. For different thresholds of classification, the ROC plot is
shown. The most efficient threshold of classification is selected by means of its proximity to the
Y-axis i.e. the sensitivity is the highest. But there are cases when ROC curve of different
thresholds of classification intersect and it is difficult to compare the proximities of the curves.
This problem is handled by AUC.
ROC Curve
1
0.8
Sensitivity
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1 - Specificity)
Scorecard - A Random Scorecard - B
In case of intersecting ROC curves, the Area Underneath the Curve (AUC) is taken into
consideration as a criteria for selection. The AUC is useful in that it aggregates performance
across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the
better, with 0.50 indicating random performance and 1.00 denoting perfect performance.
In general statistical terms, AUC represents the probability of a randomly sampled ‘good’ gets a
good score than a randomly sampled ‘bad’. In modern day financial institutions, AUC ranges
between 70%-90%. Application scoring consider AUC of 70% whereas, Behavioural scoring
have a higher AUC of 90%.
Question 4 (25 marks, max. 2 pages)
• Explain what is meant by “long run defaulted-weighted average loss rate given default”.
Loss given default or LGD is the fraction of the Exposure at Default (EAD) which will not be
recovered when a default condition occurs. Basel Accord recommends financial institutions
following an IRB approach to calculate LGD which is not less than long run defaulted-weighted
average loss rate given default. Default weighted average takes into consideration the default
percentage incurred over the period along with the LGD % of each year, over a period.
E.g. A simple average vs. weighted average LGD calculation is shown below [Resti A., Sironi A.,
2005]
Default Rate
Year %) LGD (%)
1998 1 20
1999 3 30
2000 7 45
2001 3 40
2002 2 20
Simple Average 3 31
Weighted Average 37
Here each default is given equal weighting and defaults from all years are accumulated into a
single unit. Hence, a long run defaulted-weighted average loss rate given default is the defaulted-
weighted average LGD measured over a period.9 It is an estimate of the losses that would be
incurred over a long run from the emergence of the default to the first recovery.
• Why does Basel not use the time-weighted or exposure weighted average loss rate given
default?
Some of the ways to measure average loss rate is by time-weighted or exposure-weighted

averaging. Time-weighted averaging takes into account default weighting only at an annual level,
i.e. each default has equal weighting within the annual group. If a year exhibits more than one
default characteristics, the average default rate is considered for that particular year. Whereas,
exposure weighted averaging gives equal weighting to exposures, defaults and the loss rate. E.g.
If at Year 1: 20 defaults of $40 occurred with average loss of 70%; and at
Year 2: 50 defaults of $100 occurred with average loss of 90%; and 30 defaults of $140 occurred
with average loss of 60%. [Baesens, B., LGD Modelling, 2009]3
50.90  30.60
10 
Time weighted averaging = 50  30 10  78.75
  44.4
2 2.
20.40.10  50.100.90  30.140.60
And, Exposure weighted averaging =  71
20.40  50.100  30.140
As seen from the above equation, time-weighted average smoothes out the high LGD due to
presence of a lower loss rate in the same year. This might understate the expected LGD, and
hence is highly undesirable. It also masks the correlation between aggregate default rate and
LGD.
Whereas, in case of exposure weighted average, due to inclusion of exposure component which
is given an equal weight, the expected loss rate could be overestimated. As observed in the
equation, due to the presence of the third component of 30 defaults of $140, the loss rate gets
shifted to the higher end. This condition is also undesirable. Hence, Basel does not use time-
weighted averaging or exposure-weighted averaging
• Why does Basel require economic downturn LGDs?
Basel takes economic downturn conditions into consideration to accurately reflect the effect of
such conditions on the estimation of loss rate. During an economic downturn, losses on defaulted
loans are likely to be higher than those during regular conditions. E.g. the value of collateral
declines and hence, recovery rates on defaulted exposures will also decrease accordingly.
These relevant risks need to be considered to reflect the tendency of guaranteeing sufficient
capital to cover the realised losses during the economic downturn scenario. Average loss
calculations over long periods of time can understate loss rates during a downturn and may
therefore need to be adjusted upward to reflect relevant results of the adverse economic
conditions. By including economic downturn conditions in the LGD equation, the systematic
volatility in the credit losses over time can be appropriately estimated.
• Discuss different ways of coming up with an economic downturn LGD (do a literature
search for this).
There are certain different ways in which economic downturn LGD can be estimated as
suggested by various financial organisations. These are discussed below3:
Bootstrapping [Van Gestel, 2008]: In this process, a number of resamples of the observed
dataset is replaced by sample values obtained by random sampling from the original dataset. The
mean of these values are calculated and this process is iterated for numerous times (e.g. 100,000
times). This will give the distribution of the mean LGD, and the management will choose a
percentile of this distribution as the economic downturn LGD.
Mapping formula: A mapping formula is used by the United States Federal Reserve to calculate
economic downturn LGD, and the equation is given by:
LGDed = 0.08 + 0.92 LGDav, where LGDav = Average Expected Loss given default
Other methods used are Averaging the LGD of the worst years. However, the definition of worst
is relative.Another method that can be used is quantifying the impact of economic downturn on
LGD, and accordingly adjusts the equation for calculating LGD.
References:
1. Using SAS Enterprise Miner 4.3, Bart Baesens, Univ of Southampton, 2009
2. Baesens, B., Credit Scoring and Data Mining, University of Southampton, 2009
3. Baesens, B., LGD Modelling, University of Southampton, 2009
4. Smith KA, Willis RJ, Brooks M,(2000); An analysis of customer retention and insurance claim
patterns using data mining: a case study; Journal of the Operational Research Society , Vol 51,
No 5: Pg 532-541
5. Basel Committee on Banking Supervision; An Explanatory Note on the Basel II IRB Risk
Weight Functions, (2005). Pg 8-11
6. Winsoration; definition from National Statistical Service, Australian Bureau of Statistics,

Australia
7. Ricardo Gutierrez Osuna, Diagram: Leave-One-Out Cross Validation: Lecture 13: Validation,
Pg -8; Wright State University.
8. Resti A., Sironi A., (2005); Recovery Risk: The Next Challenge in Credit Risk Management,
Risk Books
9. Basel Committee on Banking Supervision; Guidance on Paragraph 468 of the Framework

Document (2005), Basel II Accord.
Bibliography:
RMA Capital Working Group; Downturn LGDs for Basel II (2005)
Schuermann, T. (2004): What do we know about Loss Given Default? , federal Reserve Bank of
New York,Liberty Street, NY.
Miu, P., Ozdemir, B., (2005): Basel Requirement of Downturn LGD: Modeling and Estimating
PD & LGD Correlations
Basel Committee on Banking Supervision; International Coverage of Capital Measurements and

Capital Standards (2005), Pg 8-11.

Sujoy Singha

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sujoy Singha

Uploaded by

Copyright:

Available Formats

CREDIT SCORING AND DATA MINING

Sujoy Singha, M.Sc Management Science,

• Randomising of dataset with Microsoft ExcelTM by introducing a new column called

Based upon the Chi-Square criteria of

An Insight node is linked to the Input Data

Fig 1: Network diagram for Churn analysis.

RESULTS OBTAINED FROM LOGISTIC REGRESSION:

Parameter Estimate P-Value

Intercept Value -1.4697 <.0001

ii) CLASSIFICATION CALCULATIONS ON CONFUSION MATRIX:

E) DECISION TREE ANALYSIS:

ii) THE TREE:

Frequency Classification Accuracy = (TP+TN)/

F) COMPARISON OF DECISION TREE AND REGRESSION ANALYSIS:

Measurement LR(%) DT(%)

Figure 2: Misclassification Plot of Decision Tree Analysis:

• The data mining problem considered

• The data mining techniques used

• The results reported

• A critical discussion of the model and results (assumptions made, shortcomings,

The assumptions made for this analysis were:

• Asset correlation parameter in the Basel II capital accord

Asset correlation parameter is an important measure of the importance of systematic risk.

Since different borrowers/asset classes have different degrees of dependency on the

However, Winsorisation has a disadvantage. For this process to be carried out,

• Leave-one out cross validation

For a data set having N data set points;

Fig: A) Leave one out Cross Validation 7

Scorecard - A Random Scorecard - B

Some of the ways to measure average loss rate is by time-weighted or exposure-weighted

• Why does Basel require economic downturn LGDs?

3. Baesens, B., LGD Modelling, University of Southampton, 2009

6. Winsoration; definition from National Statistical Service, Australian Bureau of Statistics,

9. Basel Committee on Banking Supervision; Guidance on Paragraph 468 of the Framework

RMA Capital Working Group; Downturn LGDs for Basel II (2005)

Basel Committee on Banking Supervision; International Coverage of Capital Measurements and

You might also like