You are on page 1of 5

PREDICTIVE CLUSTERING FOR CREDIT RISK ANALYSIS

Jay B.Simha
Abiba Systems, Bangalore, India
jay.b.simha@abibasystems.com
ABSTRACT
Credit risk modeling is a well researched area from both
statistical and AI communities. Several models cited in research
use model built using whole data set. In this study, a hybrid
redictive model framework based on fu!!y clustering and
statistical"machine learning classifiers is roosed for credit risk
analysis. #his hybrid aroach enables building rules"functions
for different grous of borrowers searately. In the first stage,
customers are segmented into clusters, that are characteri!ed by
similar features and then, in the second ste, for each grou,
classifiers are built to obtain scoring rules"function that may
rovide risk level for each customer. $ultile classifiers are
evaluated on each segment and the best classifier for each
segment will selected for final scoring. #he main advantage of
alying the integration of two techni%ues consists of building
models that, may better redict risk connected with granting
credits for each client, than while using each method searately.
#he results are comared with the results of classifier on the
whole data set, according to classification erformance and the
business objective. #he results indicate that the hyothesis that a
hybrid model based framework indeed rovides better results
than a global model.
Key wor!" &ybrid models, 'u!!y C(means, Classifiers, Credit
)isk
#. INTRODUCTION
*ne of the key decisions financial institutions have to make is to
decide whether or not to grant a loan to a customer. #his
decision basically boils down to a binary classification roblem
which aims at distinguishing good ayers from bad ayers.
+umerous methods have been roosed in the literature to
develo credit scoring models. #hese models include traditional
statistical methods ,e.g. logistic regression -./0, nonarametric
statistical models ,e.g. k(nearest neighbor -1/, and classification
trees -2/0, clustering -3/, fu!!y logic -4/ and neural network
models -5,6/. $ost of these studies rimarily focus at
develoing classification models with high redictive accuracy.
&owever all these aroaches build a global model. It can be
argued that otential savings from redicting risks from certain
segments can overweigh overall classification accuracy on all
the segments.
7akr!ewska -58/ develoed a model based on clustering and
decision trees. Since the concet used one classifier ,decision
tree0 for scoring, it may not be alicable across the different
data sets. In addition a soft clustering method like 'u!!y
clustering is suerior to hard k(means clustering as it rovides
better cluster %uality. &ence in this research these two concets,
i.e use of soft clustering to identify the segments and use of best
classifier for each of the segment has been investigated, with a
hyothesis that, the resulting classifier system will rovide a
better control over scoring.
In this aer we resent a framework using fu!!y clustering and
different classifiers for building credit scoring models using
local atterns.
$. SYSTE% ARC&ITECTURE
#he roosed system, which is e9ected to suort evaluation of
credit risks, by building classifiers, is comosed of three main
modules.
'igure 5. System Architecture
#he first module is a segmentation module where the data set is
slit into clusters with homogeneous behavior. :e are using
fu!!y C(means algorithm for clustering as discussed in the
revious section. #he second module is the classifier learning
module, which will build a model for each of the classifier on
the each of the cluster obtained by the revious module. In the
third module, the best classifier for each of the segment will be
selected based on the configured criteria. In this research we
have selected two criteria for evaluation, namely ; classification
accuracy and true ositive rate.
'. FU((Y C)%EANS CLUSTERING

'u!!y C(means Clustering,'C$0, is a clustering techni%ue
which is different from hard k(means that emloys hard
artitioning. #he 'C$ emloys fu!!y artitioning such
that a data oint can belong to all grous with different
membershi grades between 8 and 5.
'C$ is an iterative algorithm. #he aim of 'C$ is to find cluster
centers ,centroids0 that minimi!e a dissimilarity function. A
brief summary of the considerations and major stes is given
below.
#he algorithm first osits a given number <c= of clusters and an
initial membershi value ,from !ero to one0 for each oint ,a
customer=s attribute vector0 in each of the <c= clusters. #he
seudoartition cluster membershi values for each oint are
chosen as adding to one, with the membershi values not all
e%ual at first. #he algorithm then successively adjusts the
membershi values of each oint in each of the various clusters,
based on the oint=s distance from the cluster=s center, comared
to the distances from the other cluster centers. #he algorithm
then uses the new membershi degrees to iteratively move the
cluster center oints toward mutually better locations. #he
>uclidean distance based ?center@ of each cluster will be
calculated from all the customers= attribute vectors weighted by
their membershi degrees in the cluster. #he weighting will also
be recomuted based on the membershi values. #he algorithm
stos when the seudo artition membershis collectively sto
changing by a determined amount on successive iterations. #he
mathematical treatment of the algorithm can be found in -A/. #he
algorithm used in the research is given in fig ..
.
'ig .. 'u!!y clustering algorithm
*. CLASSIFIERS
A classifier is a statistical"machine learning function which mas
the indeendent attributes to deendent attribute with some
confidence. #here are different tyes of classifiers -B/. In this
work, five classifiers namely ; naCve Bayes, logistic regression,
decision trees, logistic regression, artificial neural networks and
suort vector machines are used. A brief overview of these
techni%ues is given belowD
*.# Na+,e Baye! -.a!!i/ier
#he robability model for a classifier is a conditional model
over a deendent class variable C with a small number of
outcomes or classes, conditional on several feature variables F5
through Fn. #his conditional model can be e9tended using
Bayes= theorem as
&owever the above e%uation assumes interdeendence. :hen
this model is rela9ed with the assumtion of indeendence, the
conditional distribution over the class variable C can be
e9ressed like asD
where Z is a scaling factor deendent only on '5,'.,..,'n i.e., a
constant if the values of the feature variables are known.
'ig 3. +aCve Bayesian classifier
$odels of this form are much more manageable, since they
factor into a so(called class rior ,C0 and indeendent
robability distributions p(Fi|C). #his is the naCve Bayes=
classifier, which has shown surrising erformance over real life
data sets.
*.$ Lo0i!1i- Re0e!!io2
Eogistic regression is the widely used classifier in the credit risk
modeling. Eogistic regression can redict the robability ,F0
than an e9amle G belongs to one of two redefined classes.
Suose e9amle G H ,95, 9., 93,III. 9n,0, as in linear regression,
logistic regression gives each 9i a coefficient wj which measures
the contribution of each 9i to variations in F. 'irst, a logistic
transformation of F is defined as
where F can only range from 8 to 5, while logit,F0 ranges from
(J to J. Eogit,F0 is then matched by a linear function of the
feature variables
*.' De-i!io2 1ree!
Kecision tree learning is a common method used in data mining.
#he goal is to create a model that redicts the value of a target
variable based on several inut variables. >ach interior node
corresonds to one of the inut variables. #here are edges to
children for each of the ossible values of that inut variable.
>ach leaf reresents a value of the target variable given the
values of the inut variables reresented by the ath from the
root to the leaf.
A tree can be LlearnedL by slitting the source set into subsets
based on an attribute value test. Slitting can be based on
different criteria. #wo of the most widely used measures are
information gain and Mini inde9.
Information gainD
Mini inde9D
'ig A. Kecision tree classifier

#his rocess is reeated on each derived subset in a recursive
manner called recursive artitioning. #he recursion is comleted
when the subset at a node all has the same value of the target
variable, or when slitting no longer adds value to the
redictions.
*.* Ar1i/i-ia. 2e3ra. Ne1wor4!
An Artificial +eural +etwork ,A++0 is an information
rocessing aradigm that is insired by the way biological
nervous systems, such as the brain, rocess information. #he key
element of this aradigm is the novel structure of the
information rocessing system. It is comosed of a large number
of highly interconnected rocessing elements ,neurons0 working
in unison to solve secific roblems. #he learning in neural
networks is accomlished by adjusting the connection weights
iteratively, till convergence.
'ig 1. Artificial +eural +etworks
>ach of the feed forward connections are comuted using the
activation functionD
#yically feedback of the delta comutations
are used to minimi!e the errors during learning. +eural networks
are used in credit risk ne9t only to logistic regression.
*.5 S366or1 Ve-1or %a-hi2e! 7SV%8
A Suort Nector $achine is a suervised learner for
classification. An SN$ will view inut data as two sets of
vectors in an n(dimensional sace and construct a searating
hyerlane in that sace, one which ma9imi!es the margin
between the two data sets.
'ig B. Suort vector machines
In order to calculate the margin, two arallel hyerlanes are
constructed, one on each side of the searating hyerlane,
which are Lushed u againstL the two data sets. Intuitively, a
good searation is achieved by the hyerlane that has the
largest distance to the neighboring data oints of both classes,
since in general the larger the margin the lower the
generali!ation error of the classifier. In formal terms an SN$
can be written as ,in its dual form0D
$a9imi!e ,in Oi 0
subject to ,for any 0
and
It has been found that suort vector machines work well with
credit risk modeling.
9. E:PERI%ENTAL RESULTS AND DISCUSSIONS
>9eriments were done on a real life credit risk data set
collected for an Indian bank. #he e9eriments consist of
valuating and comaring the %uality of results obtained by
best classifier for each segment against similar classifier
develoed using whole data. In the whole data set mode
of learning the classifier, a ten(fold cross validation is
adoted to test the model. Since the segment si!es are
small, leave(one(out aroach for validation of the
classification models is adoted.
#able 5. shows the classification accuracy of different
classification algorithms. It can be seen that all the
algorithms erform well the validation set. *ne of them
,decision tree0 have in built feature selection, another
,logistic regression0 is used with forward selection. *ther
two classifiers were built using full data set and all the
attributes. Since a similar aroach is used in learning the
classifier on segmented data, further runing was not
carried out on the algorithm.
#able .. shows the true ositive rates with different
classification algorithms. It can be observed that all the
classifiers erform similarly when all the data is used for
modeling. #his indicates that the classification boundaries
learned by each of the classifier are otimal for the given
data. Any further data transformation and classifier
learning arameters may imrove the classification
accuracy. &owever our intention was to comare the
erformance of classifiers on segments with same
arameter settings. It can be seen that none of the
classifier is suerior in all the segments on all of the
erformance measures. #his has motivated us to develo
our aroach to select the best classifier for each segment.
It is clear from the tables that the best classifier for each
segment rovides a suerior erformance.
#able 5. Classification accuracy
#able .. #rue ositive rates
;. CONCLUSION
In the aer a framework for connecting unsuervised
,fu!!y clustering0 and suervised ,classification
algorithms0 techni%ues for credit risk evaluation is
investigated. #he resented techni%ue allows for building
different classifiers for different grous of customers,
which rovide the best results for that segment. In the
roosed aroach, each credit alicant is assigned to
the most similar grou of clients from the training data set
and credit risk is evaluated by alying the classifier
roer for this grou.
)esults obtained on the real credit risk data sets showed
higher recisions and simlicity of models obtained for
each cluster than for model develoed with the whole data
set.
'uture research will focus on further investigations on
using Self *rgani!ing $as and >9ectation
$a9imi!ation clustering for segmentation with multile
classification techni%ues for suervised learning and
additional erformance measures like area under )*C
curve.
REFERENCES
-5/ B. Baesens, ). Setieno, Ch. $ues, P. Nanthienen. Qsing
+eural +etwork )ule >9traction and Kecision #ables for Credit(
)isk >valuation. $anagement Science, A6,30, .883, 35.(3.6.
-./ $. Bensic, +. Sarlija, $. 7ekic(Susac. $odelling Small(
Business Credit Scoring by Qsing Eogistic )egression. +eural
+etworks and Kecision #rees. Intelligent Systems in
Accounting, 'inance and $anage(ment, 53, .881, 533(518.
-3/ M. Chi, P. &ao, Ch. Giu, 7. 7hu. Cluster Analysis for :eight
of Credit )isk >valuation Inde9. Systems >ngineering(#heory
$ethodology, Alications, 58,50, .885, BA(B4.
-A/ Kunn P.C., 5643, LA 'u!!y )elative of the IS*KA#A
Frocess and Its Qse in Ketecting Comact :ell(Searated
ClustersL, Pournal of Cybernetics 3D 3.(14
-1/ :.>. &enley, K.>. &and. Construction of a k(nearest
neighbor credit(scoring system. I$A Pournal of $ana(gement
$athematics, 2, 5664, 381(3.5.
-B/ Ian H. Witten and Eibe Frank (2005)
"Data Mining: Practical machine
learning tl! and techni"#e!"$ 2nd
Editin$ Mrgan %a#&mann$ 'an
Franci!c$ 2005.
-4/ R.(7. Euo, S.(E. Fang, S.(S. Siu. 'u!!y Cluster in Credit
Scoring. Froceedings of the Second Interna(tional Conference on
$achine Eearning and Cyber(netics, Gi=an, .(1 +ovember .883,
.435(.43B.
-2/ Satchidananda S.S., Pay B.Simha, Comaring decision trees
with logistic regression for credit risk analysis, SAS AFAQMC
.88B, $umbai
-6/ K. :est. +eural network credit scoring models. Comuters
T *erations )esearch, .4, .888, 5535(551.
-58/ 7akr!ewska K, *n integrating unsuervised and suervised
classification for credit risk evaluation, Information technology
and Control, .884, Nol.3B, +o.5A