You are on page 1of 24

IMA Journal of Mathematics Applied in Business & Industry (1997) 8, 323-346

Credit-scoring models in the credit-onion environment using neural


networks and genetic algorithms
VUAY S. DESAI*, DANIEL G. CoNWAYf, JONATHAN N. CROOKJ, AND
GEORGE A. OVERSTREET JR*
* Mclntire School of Commerce, University of Virginia, Charlottesville, VA 22903,
USA Now at: HNC Software Inc., San Diego, CA USA
t Pamplin School of Business, Virginia Tech, Blacksburg, VA 24060, USA
X Department of Business Studies, University of Edinburgh, 50 George Square,
Edinburgh Em 9JY, UK

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


The purpose of the paper is to investigate the predictive power of feedforward neural
networks and genetic algorithms in comparison to traditional techniques such as
linear discriminant analysis and logistic regression. A particular advantage offered
by the new techniques is that they can capture nonlinear relationships. Also, previous
studies and a descriptive data analysis of the data suggested that classifying loans
into three types—namely good, poor, and bad—might be preferable to classifying
them into just good and bad loans, and hence a three-way classification was
attempted.
Our results indicate that the traditional techniques compare very well with the
two new techniques studied. Neural networks performed somewhat better than the
rest of the methods for classifying the most difficult group, namely poor loans. The
fact that the Al-based techniques did not significantly outperform the conventional
techniques suggests that perhaps the most appropriate variants of the techniques
were not used. However, a post-experiment analysis possibly indicates that the reason
for the new techniques not significantly outperforming the traditional techniques was
the nonexistence of important consistent nonlinear variables in the data sets
examined.

1. Introduction
Recent issues of trade publications in the credit and banking area have published a
number of articles heralding the role of artificial intelligence (AI) techniques in helping
bankers make loans, develop markets, assess creditworthiness, and detect fraud. For
example, HNC Inc., considered a leader in neural-network technology, offers (among
other things) products for detection of credit-card fraud (Falcon), automated
mortgage underwriting (Colleague), and automated property valuation. Clients for
HNC's Falcon software include AT&T Universal Card, Household Credit Services,
Colonial National Bank, First USA Bank, First Data Resources, First Chicago Corp.,
Wells Fargo & C o , and Visa International (American Banker 1993c,d, 1994a,b).
According to Allen Jost (1993: p. 32), the director of Decision Systems for HNC Inc.,
'Traditional techniques cannot match the fine resolution across the entire range of
account profiles that a neural network produces. Fine resolution is essential when
only one in ten thousand transactions are frauds'. Other software companies
marketing AI products in this area include Cybertek-Cogensys and Nestor Inc.
323
© Oxford Univenity Pros 1997
324 V. S. DESAI ETAL.

The former company markets an expert-system software called Judgment Processor


which is used in evaluating potential borrowers for various consumer loan products,
and includes customers such as Wells Fargo Bank, San Francisco, and Common-
wealth Mortgage Assurance Co. of Philadelphia (American Banker 1993a,b). It plans
to introduce a neural-net software product for under $1000 (Brennan 1993a:
p. 52). Nestor Inc.'s customers for a neural-network-based software for the detection
of credit-card fraud include Mellon Bank Corp. (American Banker 1993e).
While acknowledging the success of expert systems and neural networks in
mortgage lending and the detection of credit-card fraud, reports in trade journals
claim that artificial intelligence and neural networks have yet to make a breakthrough
in evaluating customer credit applications (Brennan 1993a). According to Mary A.
Hopper, senior vice president of the Portfolio Products Group at Fair, Isaac and
Co., a major provider of credit scoring systems, "The problem is a quality known as
robustness. The model has to be valid over time and a wide range of conditions.

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


When we tried a neural network, which looked great on paper, it collapsed—it was
not predictive. We could have done better sorting the list on a single field.' (Brennan
1993b: p. 62).
In spite of the reports in the trade journals indicated above, papers in academic
journals investigating and reporting on the claims appearing in the trade journals
are not common. Perhaps this is due to the lack of data available to the academic
community. Exceptions include Overstreet et al. (1992) and Overstreet & Bradley
(1994), who compare custom and generic credit-scoring models for consumer loans
in a credit-union environment using conventional statistical methods such as
regression and discriminant analysis, and Desai et al. (1995) who explore the efficacy
of neural networks in credit-scoring models.
The present paper has two important objectives. The first is to investigate whether
the predictive power of the variables employed in the above three studies can be
enhanced if the statistical methods of regression and discriminant analysis are
replaced by combinations of neural-network models with backpropagation of error,
which we will refer to as combinations of multilayer perceptrons (CM PL), and genetic
algorithms for discriminant analysis (Convay et al. 1995). The second objective is to
study a three-way classification of loans. The typical dependent variable in the
credit-scoring literature is binary; for example, Desai et al. (1995) classify a case as
'bad' if, at any time in the last 48 months, the customer's most recent loan was
charged off or if the customer went bankrupt All other cases were classified as 'good',
provided that the most recent loan was between 48 months and 18 months old.
Descriptive data analysis reported in this paper suggests a three-way classification
scheme by further subdividing the 'good' category into 'good' and 'poor'; i.e. in the
current paper, a case is classified as 'good' only if there are no payments that have
been overdue for 31 days or more, and 'poor' if the payment has ever been overdue
for 60 days or more.
The multilayer perceptron (MLP) and genetic algorithm (GA) can be viewed as
nonlinear classification techniques. While there exist a number of nonlinear regression
techniques, in a number of these techniques one has to specify the nonlinear model
before proceeding with the estimation of parameters; hence these techniques can be
classified as model-driven approaches. In comparison, the use of MLPs or GAs is a
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 325

data-driven approach, i.e. a prespecification of the model is not required. For example,
an MLP 'learns' the relationships inherent in the data presented to it, and a GA
provides a nonlinear classification function using a search procedure borrowed from
natural phenomena These approaches seem particularly attractive in solving the
problem at hand, because, as Allen Jost (1993: p. 30) says, Traditional statistical
model development includes time-consuming manual data review activities such as
searching for non-linear relationships and detecting interactions among predictor
variables'.
Desai et al. (1995) use the same data sets with a binary classification scheme to
compare neural-network models with linear discriminant analysis and logistic
regression, and report that, in terms of correctly classifying good and bad loans,
the neural-network models outperform linear discriminant analysis, but are only
marginally better than logistic regression. However, in terms of correctly classifying
bad loans, the neural-network models outperform both conventional techniques. The

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


present paper adds a new technique, namely genetic algorithms, to the list used in
the earlier paper, and explores a three-way classification scheme. We find that the
performance of genetic algorithms is inferior to the best method overall, namely
logistic regression, by a small but statistically significant margin. Also, we find that
one of the neural-network models does better than the rest of the models in identifying
the most difficult group, namely poor loans, and that the genetic algorithm
outperforms all the other methods for bad loans. Since poor and bad loans are only
a small portion of the total loans made, this result resonates with the claim made by
Allen Jost of HNC Inc. that traditional techniques cannot match the fine resolution
produced by neural nets.
In Sections 2, 3, and 4, we review conventional statistical techniques, MLPs and
CMLPs, and GAs respectively. Section 5 describes the data and sources of data, and
provides the specifics of the MLPs and GAs used. Section 6 sets out the results of
our experiments, and Section 7 presents the conclusions.

2. Conventional statistical techniques


Linear discriminant analysis (LDA)
The basic idea of discriminant analysis when we have r + 1 populations is as follows.
Let there be p characteristics of each credit applicant, represented by vector x. We
wish to divide the complete p-dimensional space into r + 1 regions J/o,..., J7r so
that, if x falls into _7to the applicant is classified as a member of group k. For example,
the groups could be 'good payer' and 'bad payer', or alternatively, 'good payer', 'poor
payer', and 'chargeoff or bankrupt'.
Various allocation rules have been proposed, one being to minimize the expected
costs of misclassifying a case into a group of which it is not a member. Let
fk(x) denote the probability density function of x, given membership of group k.
Then the proportion of cases in group k which are misclassified into group h
equals

J, /k(x)dx. (1)
326 V. S. DESAI ET AL.

Let p t be the prior probability, in the population, that a case is a member of group
k, and c u be the cost of misclassifying a member of group k into group h. Then the
expected loss is
f
/*(x)dx. (2)
*-oj;o j*

The aim is to choose an allocation rule which minimizes L. Suppose that the cost
of misclassification c u is always equal to 1. Then it can be shown (Choi 1986) that
the solution is for each Jh to satisfy

Suppose that we have only two groups. Let the set of x values in group 0 and 1 each
be multivariate normally distributed with means fi0 and /i t respectively, and

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


covariance matrices LQ = Et = L. Denote

Using these definitions, it can be shown that (e.g. Lachenbruch 1975; Boyle et al.
1992) it would be optimal to classify x into J7O if
bTx > c. (4)
This is a version of the linear discriminant function derived by Fisher (1936), but he
did so using a different argument. This paper, in which there are more than two
groups, uses a second method which is based on Fisher's original approach. Fisher
argued that the greatest difference between the groups occurs when the ratio
of the between-groups to within-groups sums of squares is largest. This ratio can be
shown to be
X = ^D^ (5)

where w is a column vector of weights, D is the matrix of the between-groups sums


of squares and cross-products, and A is the matrix of within-groups sums of squares
and cross-products matrix (Tatsuoka 1970). Differentiating equation (5) and setting
equal to zero, we derive:
Dw = XAw, (6)
where A is an unknown scalar. This equation is solved for values of A, i.e. the
eigenvalues, and for values of w, the eigenvectors. Since there are p characteristics
for each credit applicant, equation (6) gives a pth-order polynomial in A. But we are
interested only in the positive values of A. The maximum number of such values is
min{r, p], where r and p are defined as above. Each positive eigenvalue A, has
associated with it a unique eigenvector w, which fulfills the equation (A~iD — ktl)w(
= 0, with wjwt = 1, where / is the identity matrix.
The eigenvector-eigenvalue pairs may be interpreted as follows. Of ah1 the linear
combinations of the p characteristics, the first eigenvector, wu gives the weights
yielding the greatest value of A: say At . Of all of the linear combinations which are
uncorrelated with the first linear combination, the second eigenvector, w2, gives the
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 327

weights for the linear combination which gives the largest value of k, say X2- Similar
interpretations apply to other eigenvector-eigenvalue pairs.
Each such linear combination of the p characteristics is called a canonical
discriminant function. We may present these as a single matrix-vector equation
z = W*x, (7)
where W is an p x m matrix of weights, each column being a separate eigenvector, wt
(there being m such eigenvectors), z is an m-vector of variables, and x is a p-vector
of characteristics.
In this paper, each case was classified into the group where the posterior probability
that it was a member of that group, given its value of x, was largest This posterior
probability was calculated using Baye's rule as

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


PiO, = k\x) = P(x\O,= k)pj £ P(x\O, = k)Pk,
I t-o
i.e.
P(0t = k\x) = DtpJ £ Dtpk, (8)
/ *=o
where O, is the group membership in case i, and where

with Ck and xl calculated as follows.


dk is the centroid vector of group k,
pk is the prior probability that a case is a member of group k,
Ck is the covariance matrix of canonical discriminant functions for group k,
assumed nonsingular, and

here d is an m-vector of canonical discriminant functions defined by


d= Bx + e,
where
x is a p-vector of discriminating variables for the case,
B is an m x p matrix of coefficients for the unstandardized canonical dis-
criminant functions,
e is an m-vector of constants.
Eisenbeis & Avery (1972) and Eisenbeis (1977, 1978) discuss eight problems with
using LDA in credit scoring. For example, the linear discriminant model assumes (a)
that the discriminating variables are measured on an interval scale, (b) that the
covariance matrices of the discriminating variables are equal for the groups, and (c)
that the discriminating variables follow a multivariate normal distribution. As will
be clear from Section 5, some of the discriminating variables used in our credit-union
application are of nominal order and so assumptions (a) and (c) are violated. The
violation of the third assumption was confirmed by significant values of Box's M
statistic. It is well known that, when predictor variables are a mixture of discrete and
328 V. S. DESAI ET AL.

continuous variables, the linear discriminant function may not be optimal. In this
case, special procedures for binary variables are available (Dillon & Goldstein
1984). However, in the case of binary variables, most evidence suggests that the linear
discriminant function performs reasonably well (Gilbert 1968; Moore 1973; Krzanowki
1977). Furthermore the literature suggests that, while quadratic discriminant analysis
is appropriate when the assumption of normality holds, but that of equal covariances
does not, the results of a classificatory quadratic rule are more sensitive to
violations of normality than the results of a linear rule. A linear rule seems to work
satisfactorily unless the violation of equal covariance matrices is drastic (Stevens
1992; Boyle 1992).

Logistic regression (LR)


Unlike LDA, which uses Baye's rule to predict the posterior probability, logistic

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


regression is used under the assumption that the posterior probability that a case is
a member of group k is specified directly as

, = k\x) = (exp fix,)/ X exp fixt (fc = 0,..., r), (9)


/ j-o

where fik is a column vector of coefficients for group k and xt is a column vector of
values for each variable for case i. To make the parameter estimates identifiable, it
is usual to normalize the coefficients for one group. Thus it is assumed that (i0 = 0.
The conditional probabilities then become

P(O, = k\x) = (exp fix,)l(l + t «P fix\ (fc = 1,..., r), (10)

P(O, = k\x) = l (l+ £exp/fjx,j whenfc = 0. (11)

This implies that we can compute r log-odds ratios:

where p tt denotes the probability that case i is a member of group k.


The model so far described is often known as the multinomial logit model. The
logarithm of the likelihood function can be formulated and differentiated to give
estimators of the parameters. Notice that, when the outcome variable is binary, we
have a special case of the above, i.e. r = 1.
In this paper the parameter vectors were estimated (including a constant) using
the method of maximum likelihood, and a case was classified into the group of which
it had the highest probability of membership. Thus, when there were two groups, a
case was allocated to the group where its probability of membership of that group
exceeded ^. In the three-group case, the criterion was simply the group with the
highest probability of membership.
The logistic regression model does not require the assumptions necessary for the
linear discriminant model. In fact, Harrell & Lee (1985) found that, even when the
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 329

assumptions of LDA were satisfied, LR is almost as efficient as LDA. Some studies


have found that, when using mixed binary, categorical, and continuous variables, the
logistic regression rule gives slightly better results than linear or quadratic dis-
criminant analysis (Knoke 1982; Titterington 1981).

3. Neural networks for classification


A neural-network model takes an input vector x and produces an output vector o.
The relationship between x and o is determined by the network architecture. The
network generally consists of at least three layers: one input layer, one output layer,
and one or more hidden layers.

3.1 Network architecture

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Each layer in a multi-layer perceptron (MLP) consists of one or more processing
elements ('neurons'). In the network we will be using, the input layer will have p
processing elements, i.e. one for each predictor variable. Each processing element in
the input layer sends signals x, (i = 1,..., p) to each of the processing elements in the
hidden layer. Each processing element in the hidden layer (indexed by j = 1,..., q)
produces an 'activation' a, = G(£, wyx,) where wy are the weights associated with
the connections between the p processing elements of the input layer and the jth
processing element of the hidden layer. The processing elements in the output layer
behave in a manner similar to the processing elements of the hidden layer to produce
the output of the network

ok = F(Uk) = F^akwj^ = / ^ t G^(£ w y x , ^ (k = 0 ,..., r). (12)

The main requirements to be satisfied by the activation functions F(-) and G(-) are
that they be nonlinear and differentiable. Typical functions used in the hidden layer
are the sigmoid, hyperbolic tangent, and the sine functions, i.e.

G(x) = — — - or G(x) = C * ~ C " or G(x) = sin x. (13)


x x z
1+ e e +e
Since we are working on a classification problem, the network outputs ol,...,or
are interpreted as the conditional a priori probabilities. Thus it is required that
0 < ok < 1 and X*=o°t = 1- This is achieved by using the following 'softmax'
activation function for the output neurons of the network:

. (14)

The weights in the neural network can be adjusted to minimize the relative entropy
criterion, given as
E^-to.lny,. (15)
330 V. S. DESAI ETAL.

3.2 Network training


The most popular algorithm for training multilayer perceptrons is the backpropaga-
tion algorithm. Essentially, backpropagation performs a local gradient search, and
hence its implementation, although not computationally demanding, does not
guarantee reaching a global minimum. A number of heuristics are available to
alleviate this problem, some of which are presented below.
Let xl'] be the current output of the ;'th neuron in layer s, /[*] the weighted
summation of the inputs to the ith neuron in layer s, F'(I\I]) the derivative of the
activation function of the ith neuron in layer s, and 4*1 t n e local error of the kth
neuron in layer s. Then, after some mathematical simplification (e.g. Haykin 1994:
pp. 144-52) the weight-change equation suggested by backpropagation and (15) can
be expressed as

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


u for non-output layers, (16)

ff = n(y, - o,)^f~1] for the output layer, (17)


where n is the learning coefficient and 9 is the momentum parameter. One heuristic
we use to prevent the network from getting stuck at a local minimum is random
presentation of the training data (e.g. Haykin 1994: pp. 149-50.) If the second term
were omitted in (16), then setting a low learning coefficient results in slow learning,
whereas a high learning coefficient can produce divergent behaviour. The second
term in (16) reinforces general trends, whereas oscillatory behaviour is cancelled out,
thus allowing a low learning coefficient but faster learning. Last, it is suggested that
convergence is speeded up by starting the training with a large learning coefficient
and letting its value decay as training progresses.
There are three criteria that are commonly used to stop network training: namely,
when a fixed number of iterations have been made, or when the error reaches a
certain prespecified minimum, or when the network reaches a fairly stable state and
learning effectively ceases. In the current paper we allowed the network to run for a
maximum of 100,000 iterations. Training was stopped before 100,000 iterations if the
error criterion (£ t defined in (15)) reached below 0.1. Also, the percentage of training
patterns correctly classified was checked after every cycle of 1000 iterations, and the
network was saved if there was an improvement over the previous best saved network.
Thus, the network used on the test data set was the one that had shown the best
performance on the training data set during a training of up to 100,000 iterations.

3.3 Network size


Important questions about network size that need to be answered are as follows:
first, how does one determine the number of processing elements in the hidden layer,
second, how many hidden layers are adequate? As yet there are no firm answers for
these questions. In the case of the first question, it has been suggested that 'starting
with oversized networks rather than with tight networks seems to make it easier to
find a good solution' (Weigend et al. 1990). One then tackles the problem of
overfitting by somehow eliminating the excess neurons. While several methods for
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 331

eliminating the excess neurons have been explored, one that seems to be readily
applicable in our case is given below. The hypothesis that '... the simplest most
robust network which accounts for a data set will, on average, lead to the best
generalization to the population from which the training set has been drawn' was
made by Rumelhart (reported in Hanson & Pratt 1988). One of the simplest
implementations of this hypothesis is to check the network at periodic intervals and
eliminate nodes in the hidden layers, up to a certain maximum number, if the
elimination does not lead to a significant deterioration in performance. We have
implemented this hypothesis in our current paper, the details of which are available
from the authors.

3.4 Combinations bf neural networks

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Recently there has been considerable interest in using combinations of neural
networks (e.g. Hansen & Salamon 1990) to reduce the residual error in neural
networks for classification. The basic argument for using a combination of networks
is the existence of many local minima which cause individual networks to make errors
on different subsets of the input space. Hansen & Salamon argue that the collective
decision produced by the combination of networks is less likely to be in error than
the individual networks. They explore various rules for combining the outputs of the
neural networks and report that the majority rule, i.e. accepting the classification
made by more than half the networks seems to work best. We implemented the
majority rule using a combination of three networks.
Hansen & Salamon further report that results can be significantly improved if the
individual networks are trained using independent training data. Due to the paucity
of data, we could not train the individual networks using independent training data.
However, since the basic idea was to encourage the individual networks not to
converge to the same local minimum, we changed the probability of sampling the
three loan types for the three networks such that one network had a higher probability
of learning to classify good loans, the second network had a higher probability of
classifying poor loans, and the third network had a higher probability of classifying
bad loans.
Due to the fact that we trained individual networks using different probabilities
of sampling the three loan types, we also tested another rule, namely the 'best neuron'
rule which involves using only one out of the three outputs for each network. More
specifically, if one of the individual networks is trained to classify good loans, the
only output used from that network is the one that represents good loans. Thus we
select one output from each neuron and then proceed as we would with a single
network.

4. Genetic algorithms for discriminant analysis


GAs begin their search process by randomly generating an initial population of
strings. Each of these strings is a data structure which generally represents a possible
solution to the problem. Each solution is evaluated by some measurable criteria
332 V. S. DESAI ET AL.

resulting in a 'fitness' being associated with each string. A second population is


generated from the first population by 'mating' the previous population in a way
that models the mingling of genes in natural populations, and performing some
mutations. The partners in mating are chosen by the principle of survival of the fittest.
The higher the string's fitness, the more likely it is to reproduce. The termination of
search is usually by reaching a predetermined non-improvement in fitness in
subsequent generations.
The advantages of such a heuristic approach are considerable. First, the evaluation
function determining fitness need not be linear, differentiable, or even continuous.
The parameters of the approach, including the size of the initial population, the
evaluation function, and the mutation frequency, as well as the breeding method (to
some extent), are all under control of the user. Thus, they may be manipulated in
such a manner as to achieve faster or slower convergence. The method of GA is also
very parallel, and thus represents one of the few algorithms which can accomplish

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


linear speedups on a parallel machine.
GAs move toward optimality only if they satisfy the conditions of the fundamental
theorem of genetic algorithms (Goldberg 1989). Those conditions demand that the
breeding process does not destroy patterns, or schema, which then survive (in
expected value) and combine with other fit schema in hope of achieving optimality.
Many successful applications of GAs have been documented, and the curious reader
is referred to Goldberg (1989).
In this paper, we are attempting to distinguish between good loans, poor loans,
and bad loans. We describe the process for separating the good from the poor and
hope to generalize from there. For example, say we have 100 historical cases of
loans—half good and half poor. Each loan or point consists of 18 attributes describing
thefinancialenvironment for each loan. We wish to determine a function and a cutoff
point with which we can better predict the quality of a future loan.
The goal of the procedure is to minimize the number of misclassifications of
the observed data. This goal differs from that of conventional DA mathematical
approaches. Many methods, e.g. those of Freed & Glover (1981) attempt to minimize
some measure of the sum of distances from a line to the observations. These
approaches hope that the resulting separating hyperplane which minimizes such a
sum will also minimize the number of misclassifications. The structure of the data
set determines the degree of correlation between the two objectives. The problem
of directly minimizing the number of misclassifications is an integer programming
problem. For completely separable data sets, the problem becomes a linear pro-
gramming problem and the dual yields the desired hyperplane.
One natural approach for solving this integer problem is to employ a branch-and-
bound method, where branches would consider whether or not to include a data
point of a set in calculating the hyperplane. One would then construct two sets,
subsets of the original two sets X and % which would be of maximum size and yet
separable. We use the genetic algorithm in this manner. The GA chooses the points,
or branches, which create separable sets. Thus, the genetic string can be interpreted
as an equivalent branch-and-bound node. Once the points are chosen, the dual model
of discriminant analysis is employed to find the actual hyperplane, and the number
of misclassifications can be measured. We then avoid the inherent data-dependent
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 333

problems which might be associated with using modified objectives. The mathematical
justification is given below.

4.1 Generating a starting population


The goal of the procedure is to minimize the number of misclassifications for the
observed data. For the discriminant analysis, we chose an initial random population
of 20 strings. The data were encoded as a 0-1 string based on the omission or
inclusion of an observation for discrimination. A random string was generated such
that each member in the string has a probability 0.2 of being one and 0.8 of being
zero. These parameters were chosen based on their performance on preliminary
testing. A one represents inclusion of an observation for discrimination, and zero
represents the omission of an observation for discrimination. For the example above,
a string will be a vector of length 100, each element representing a data point.

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


A linear discriminator is of the following form. For any point x, if H>TX < c, then
x is classified in one set; otherwise it is classified in the other set. If the data points
represented by the string are successfully discriminated by the dual model, the fitness
of the string is the sum of the ones in the string plus other points omitted which can
be classified by the discriminant function. That is, by maximizing the number of
points included in the discriminant function, we are minimizing the number of
misclassifications. To evaluate a string and obtain a w and c discriminant function,
we need a model that is guaranteed to work when discrimination is possible. We
propose a generic dual model which has this property.

4.2 The dual model


In mathematics, if a linear discriminant function existed for the two sets X and % it
would be called a separating hyperplane. For years, mathematicians have been
devising methods for proving the existence of such hyperplanes, but have had little
interest in calculating them. One method of establishing the existence of separating
hyperplanes singles out one such hyperplane in particular. This result deals with the
problem of determining the minimum distance between two convex sets Xc and yc.
It states that the dual of the problem of determining the minimum distance between
two convex setsXc and yc is determining the separating hyperplane which maximizes
the distance from Xc to the hyperplane (Luenberger (1969)). We wish to employ this
result to build a discriminating function.
Suppose that the set X is composed of r points in R" and the set y is composed
of s points in R", which we represent by the n x r and n x s matrices x and y
respectively. Then the convex hulls of the two sets may be defined as Xc and yc
respectively, where
Xc = {xiXa = x,eJ* = U e R ^ } 1 % = {y. Yfi=y,e]fi = l,peW+}.
Here RV is the set of r-vectors with nonnegative components, and er is the r-vector
with all entries 1, and similarly for R"+ and er If we denote by || • || the norm operation,
then the problem of determining whether or not the sets Xe and yc are disjoint may
be stated:
334 V. S. DESAI ETAL.

(P) minimize z = ||x - j>|| subject t o i e X c and y e 9^-


This problem is actually the generic dual to the problem of determining a linear
discriminator. We say it is generic because its actual form depends on the form
of the norm that we use to construct it. By virtue of the above discussion, it is the
dual problem.
Some observations are in order. Gearly, we do not want to solve problem P. It
is a nonlinear program and probably very difficult. There may be forms of P for
which we can solve the dual, however, and this leads to linear discriminators. If z = 0
in the solution to P, then there is no discriminator, since X and y must have points
in common. Problem P can never be infeasible or unbounded. Our discriminator
will maximize the distance between the two groups with respect to the norm chosen
for P. We now present results for a common norm.
Suppose in problem P that we take ||x - y\\ = YJ-I \xj ~ X/l- Th' s ' s known as

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


the Si or 'rectilinear' norm. The discriminating problem is given by the linear program
(Dl) maximize dt + 52 subject to
XJu + e,$, < 0r, YTu - e,S2 ^ 0,, - e , < u < e,,
where u is an n-vector and ek is a vector of k Is; see Convay et al. (1995) for proof.
Problem Dl is equivalent to the model originally formulated by Mangasarian (1965).
By a change of variables, it can be solved with any linear programming code.
The fitness value will be equal to zero if the objective function is zero. This
is a trivial solution indicating that perfect discrimination is not possible. This
occurs if the convex region of one set intersects the convex region of the other
set. Thus the distance between the sets is zero. Otherwise the number of ones
in the string, plus the number of the string's zero-valued observations classified
correctly by the discriminant function, will be the fitness value. This total represents
the total number of points correctly classified by the separating hyperplane.
For the example problem, a string with 100 ones would probably yield a value of
zero for the dual objective function, indicating that perfect discrimination is not
possible. Such a string will be assigned a fitness value of zero. A 'roulette wheel' is
formed (Goldberg 1989) which contains the solutions in the current generation, and
an arc length is assigned as the solution's fitness value expressed as a proportion of
the total fitness value of all solutions. If a solution is a good one, then it will receive
more area on the wheel. That way, by spinning the wheel, a good solution is more
likely to be picked than a bad solution. Two strings are chosen to 'breed' by spinning
the wheel, i.e. generating a random number and scaling it to the circumference of
the wheel. The two solutions are then 'cross-bred' to produce two more solutions.
The cross-breeding occurs as follows. Consider a string of length 24 (for compact-
ness). For the two strings, two splicing positions will be generated such that one
splice will occur in positions representing set X (good loans) and the other in set y
(poor loans). The two strings are split according to the splice positions and the
corresponding substrings are swapped, generating four new strings. As an example,
consider the cross-breeding of the following two strings, where a vertical marks the
splicing point.
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 335

Set X: good loans Set y-. poor loans

St: 1111100|10001 100|111101101


S2: 0101011|01111 010|010010111.
The above splicing can be considered equivalent to:
Set X Set J

X IY V I V
11 IA12 x
11 I r
12
X IY Y I V
21 I A 2 2 X
21 I r 22>
where X xl = 1111100, X12 = 10001, X21 = 0101011, X22 = 10111, Y n = 100,Y12 =
111101101, Y21 = 010, Y22 = 010010111. The new strings are:

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Np X11X22Y11Y22, N2: X21X12Y21Y12,
N3: X11X22Y21Y12, N4: X 21 X 12 Y U Y 22 .
Five reproductions are carried out to generate a new population of twenty strings.
To avoid rapid convergence to local optima, each string in the new population
has a 20% probability of having an element switched from 0 to 1 or 1 to 0. The 20%
probability of mutation was used throughout the experiment, though no significant
difference occurred in the experiment from using mutation rates varying from 5% to
50%. This may be because including or excluding points that were not on the
'boundary' should make no difference in the hyperplane. In fact, it is to our
computational advantage to keep the number of Is small, thus rendering smaller LPs
to solve for each string. By including a point through mutation which would have
already been properly classified, we gain nothing. Only by including a point which
was not previously in the convex hull of included points can we change the hyperplane
and thus possibly improve the objective.
Koehler (1991) gives a GA for directly determining the coefficients of a separating
manifold. Koehler's method has been shown to give competitive results on a
particular data set. It also can be used to create nonlinear surfaces for separation.
Computationally, it is approximately 10 times more intense than the genetic dual
method. In order to modify our method to create a nonlinear separating manifold,
we would be required to increase the number of attributes, similarly to how one
would create nonlinearregressionmodels. For example, to create a quadratic surface,
we would need to calculate the square of each attribute, thus using 36 attributes, and
proceed as above. The dual model would generate the coefficients in the same manner,
though we would expect the computational time to increase with the complexity of
the nonlinear model.

5. Data sample and model construction


5.1 Data samples
The loan files of credit unions (unlike those of large banks) were not kept in
readily available computerized databases. Consequently samples had to be collected
laboriously by analysing individual loan files. Data were collected from the loan files
336 V. S. DESAI ET AL.

of three credit unions in the Southeastern United States for the period 1988-91. Credit
union L is predominantly made up of teachers, and credit union N is predominantly
made up of telephone-company employees, whereas credit union M represents a more
diverse state-wide sample. The narrowness of membership is somewhat mitigated by
the inclusion of family members in all three credit unions. Only credit union M had
added select employee groups to diversify its membership.
Predictor variables commonly used in credit-scoring studies include various debt
ratios and other cash-flow-oriented surrogates, employment time, home ownership,
major credit-card ownership, and representations of past payment history (e.g.
Overstreet et a\. 1992). Additional variables that can be added to the model include
detailed credit-bureau reports (e.g. Overstreet & Bradley 1994). In selecting predictor
variables, care must be taken to comply with regulation B of the terms of the Equal
Credit Opportunity Act, so as to avoid non-intuitive variables. Based on all the
considerations mentioned above, eighteen variables were selected for the present

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


study, and are given in Table 1.
The data collected had 962 observations for credit union L, 918 observations for
credit union M, and 853 observations for credit union N. A three-way classification
scheme was used, i.e. a case was classified as 'good' only if there were no payments
that had been overdue for 31 days or more, 'poor' if the payment had ever been
overdue for 60 days or more, and 'bad' if, at any time in the last 48 months, either
the customer's most recent loan was charged off or the customer went bankrupt
After eliminating observations with corrupted or missing elements, we were left with
505 observations for credit union L with 56.83% good loans, 24.75% poor loans, and
18.42% bad loans; 762 for credit union M with 51.18% good, 22.70% poor, and

TABLE 1
List of predictor variables

Majcard Number of major credit cards


Ownbuy Owns home
Income Salary plus other income
Goodcr 'Good' depending upon derogatory information and number of 01-09 ratings
on credit bureau reports
Jobtime Number of years in current job
Depends Number of dependents
Numinq Number of inquiries in past 7 months
Tradeage Number of months since trade line opened
Trade75 Number of trade lines 75% full
Payratel Monthly payments as a proportion of income
Delnq Delinquent accounts in past 12 months
Olddebt Total debt as a proportion of income
Age Age of borrower
Addrtime Number of years at the current address
Acctopen Number of open accounts on credit bureau reports
Actaccts Number of active accounts on credit bureau reports
Prevloan Number of previous loans with credit union
Deroginf Information based upon 01-09 ratings on credit bureau reports
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 337

TABLE 2
Descriptive statistics for credit union L
The number of good, poor, and bad loans equal 287,125, and 93 respectively. Thefirst,second,
and third numbers in each cell refer to the good, poor, and bad loans respectively

Variable Mean Median Trunc. mean Min Max

Majcard 0.659, 0.448, 1,0,0 0.676, 0.443, 0,0,0 1,1, 1


0.326 0.305
Ownbuy 0.714, 0.448, 1,0,0 0.738, 0.443, 0,0,0 1, 1,1
0.359 0.342
Income 2153, 1890, 1231 1967,1429,1005 20511684,1100 548,400,400 9000,8368,6000
Goodcr 0.707, 0.520, 0.283 1, 1,0 0.730, 0.521 0,0,0 1, 1, 1
0.256
Jobtime 5.56, 4.67, 4.00 3,2,2 4.87, 4.10, 3.51 0,0,0 34,30,23
Depends 0.7, 0.9- 0.9 0,0,0 0.65, 0.78, 0.82 0,0,0 5,5,4
Numinq 1.52, 2.60, 3.37 1,2,2 1.30, 108,178 0,0,0 10, 19, 32

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Tradeage 99.53, 83.76, 86, 71, 30 94.55, 79.1, 0,0,0 491 288, 252
45.20 39.79
Trade75 1.21,190, 2.41 0,3,1 0.97,162,102 0,0,0 13, 20, 18
Payratel 0.33, 0.36, 0.39 0.30, 0.36, 0.39 0 J 1 0.36, 0.39 0,0,0 0.86, 0.97, 1.00
Delnq 0.335, 0.904, 0, 1,1 0.317, 0.947, 0,0,0 1, 1, 1
0.674 0.695
Olddebt 3.33, 5.20, 4.01 1.93, 3.16,104 177, 4.52, 3.29 0,0,0 4139, 34.20,
24.25
Age 37.68, 34.42, 37, 33, 29 37.09, 34.13, 19, 18, 18 73, 57, 49
31.10 30.74
Addrtime 7.27, 6.18, 6.49 4,13 6.27, 5.17, 5.59 0,0,0 56, 48, 40
Acctopen 0.37, 0.62, 0.75 0,0,0 0.27, 0.50, 0.56 0,0,0 4,5,6
Actaccts 4.25, 5.10, 3.23 3,4,2 4.04, 4.74,188 0,0,0 16, 33, 14
Prevloan 0.78, 0.21, 0.03 0,0,0 0.57,0.11,0.00 0,0,0 13, 4, 1
Deroginf 0.24, 0.34, 0.55 0,0, 1 021, 0.31 0.56 0,0,0 1, 1, 1

26.12% bad; and 695 observations for credit union N with 56.77% good, 22.05%
poor, and 21.18% bad.

5.1.1 Descriptive data analysis. Tables 2, 3, and 4 give the descriptive statistics for
the three credit unions. Furthermore, the descriptive statistics are broken down by
loan type. It is interesting to note that the median values for a number of variables
are identical for two out of three loan types, or in some instances for all loan types.
For example, the median number of dependents (Depends) are equal for all three
loan types for all three credit unions, and the number of delinquent accounts in the
past 12 months (Delnq) are equal for all three loan types for credit unions M and
N, and are equal for poor and bad loans for credit union L. It is also interesting to
note that the median values are identical for good and poor loans for some variables
in some credit unions, and identical for poor and bad loans for other credit unions,
e.g. the number of inquiries in past 7 months (Numinq), number of open accounts
on credit-bureau reports (Acctopen), and the number of active accounts on credit-
bureau reports (Actaccts).

5.1.2 Training and holdout data. There exist several approaches for validing statistical
models (e.g. Dillon & Goldstein 1984; Hair et al. 1992). The simplest approach,
338 V. S. DESAI ETAL.

TABLE 3
Descriptive statistics for credit union M
The number of good, poor, and bad loans equal 390,173, and 199 respectively. Thefirst,second,
and third numbers in each cell refer to the good, poor, and bad loans respectively

Variable Mean Median Trunc. mean Min Max


Majcard 0.628, 0.364, 1,0,0 0.643, 0.348, 0,0,0 1,1, 1
0.475 0.472
Ownbuy 0.746, 0.497, 1,0,0 0.774, 0.497, 0,0,0 1,1, 1
0.429 0.421
Income 2227,1900,1157 2016,1715,1000 2111,1819,1103 480, 538, 400 10000,6000,
4250
Goodcr 0.803, 0.434, 1,0, 1 0.837, 0.426, 0,0,0 1, 1, 1
0.788 0.820
Jobtime 9.62, 4.48, 4.76 7.5, 2, 3 8.88, 3.64, 4.17 0,0,0 45, 30, 33
Depends 1.1, 1.16, 1.34 1, 1, 1 1.00, 1.06, 1.29 0,0,0 5,5,7

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Numinq 1.35,112,171 1, 1, 2 1.09, 1.77, 264 0,0,0 14, 16, 8
Tradeage 112.84,67.56, 99, 60, 43.5 108.08, 63.82, 0,0,0 409, 234, 235
49.68 46.22
Trade75 0.83, 1.06, 2.97 0,0,3 0.63, 0.87, 284 0,0,0 9,8, 10
Payratel 0.22, 0.26, 0.40 0.22, 0.26, 0.39 0.22, 0.77, 0.99 0,0,0 0.78, 0.77, 0.99
Delnq 0.136, 0.763, 0,0,0 0.043, 0.626, 0,0,0 5,5,5
0.379 0.242
Olddebt 2.70, 3.35, 5.35 1.64, 268, 4.74 227, 3.01, 4.97 0,0,0 38.00, 17.31,
27.27
Age 38.72, 32.32, 37, 30, 30 38.20, 31.59, 19, 19, 18 73,66,60
31.12 30.42
Addrtime 9.73, 6.71, 4.85 7,3,2 8.73, 5.89, 3.80 0,0,0 70, 53, 55
Actaccts 293, 2.91, 3.92 3,2,3 279, 270, 3.76 0,0,0 12, 13, 17
Acctopen 0.48, 0.45, 1.13 0,0, 1 0.31, 0.36, 1.05 0,0,0 12,3,6
Prevloan 256, 207, 256 2,1,3 214, 1.63, 243 0,0,0 23, 23, 14
Deroginf 0.12,0.35,0.15 0,0,0 0.07,0.33,0.11 0,0,0 1, 1, 1

referred to as the cross-validation method, involves dividing the data into two subsets,
one for training (analysis sample) and a second one for testing (holdout sample).
More sophisticated approaches include the U method and the jackknife method.
Both these methods are based on the 'leave-one-out' principle, where the statistical
model is fitted to repeatedly drawn samples of the original sample. Dillon & Goldstein
(1984: p. 393) suggest that, in the case of discriminant analysis, a large standard
deviation in the estimator for misclassification probabilities can overwhelm the bias
reduction achieved by the U method, and, if multivariate normality is violated, it is
questionable whether jackknifed coefficients actually represent an improvement in
general. Also, these methods can be computationally expensive. An intermediate
approach, and perhaps the most frequently used approach, is to divide the original
sample randomly into analysis and holdout samples several times. Given the
substantial number of data and the fact that we investigated six models, we decided
to use an intermediate approach. The data sample was divided into two parts, with
two thirds of the observations being used for training and the remaining one third
for testing. Observations were randomly assigned to the training or testing data set,
and ten such pairs of data sets were created. A popular approach is to use stratified
sampling in order to keep the proportion of good loans and bad loans identical
across all data sets. Since the percentage of bad loans is different for the three credit
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 339

TABLE 4
Descriptive statistics for credit union N
The number of good, poor, and bad loans equal 394,153, and 148 respectively. Thefirst,second,
and third numbers in each cell refer to the good, poor, and bad loans respectively
Variable Mean Median Tmnc. mean Min Max
Majcard 0.800, 0.575, 1, 1,0 0.833, 0.584, 0,0,0 1, 1, 1
0.449 0.444
Ownbuy 0.812, 0.693, 1, 1, 1 0.846, 0.715, 0,0,0 1, 1, 1
0.544 0.549
Income 2850, 2441, 2603,2203,1950 2771, 2346, 2157 584, 728, 600 7499, 8500,
2410 22000
Goodcr 0.848, 0.706, 1, 1, 1 0.887, 0.730, 0.707 0,0,0 1, 1, 1
0.687
Jobtime 13.48, 8.44, 7.42 13, 6, 5 13.14, 7.52, 6.85 0,0,0 40, 40, 33
Depends 12, U 1.2 1, 1, 1 1.1, 1.2, 1.1 0,0,0 6,4,4

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


Numinq 1.22, 208, 3.08 1, 1. 1 0.97, 1.72, 254 0,0,0 12, 15, 20
Tradeage 133.23, 96.39, 120, 79, 54 129.91, 90.64, 0,0,0 342, 379, 276
71.1 66.41
Trade75 1.91,138, 2.71 2,2,2 1.72, 214, 247 0,0,0 10, 16, 12
Payratel 0.29, 0.27, 0.27 0.30, 028, 0.28 0.29, 0.27, 0.27 0,0,0 0.91,0.62,0.60
Delnq 0.368, 0.647, 0,0,0 0.220,0.511, 0,0,0 6,5,8
0.626 0.451
Olddebt 4.49, 4.09, 4.79 3.75, 3.85, 3.72 4.25, 3.91, 4.46 0,0,0 20.57, 13.10,
33.32
Age 38.90, 36.17, 38, 35, 32 38, 56, 35.88, 17, 18, 18 67, 62, 69
33.27 3275
Addrtime 8.09, 8.12, 6.93 5,5,4 7.42, 7.16, 5.96 0,0,0 33, 53, 41
Acctopen 0.59, 0.80, 1.10 0,1,1 0.49, 0.69, 0.91 0,0,0 5,5,9
Actaccts 5.17, 4.88, 4.79 5,4,4 4.95, 4.54, 4.44 0,0,0 22,22, 21
Prevloan Z83, 216, 238 2,1,1 246, 1.82, 1.88 0,0,0 19, 15, 25
Deroginf 0.09, 0.16, 0.16 0,0,0 0.04, 0.12, 0.12 0,0,0 1, 1, 1

unions, and since claims by practitioners imply that the performance of neural
networks in comparison to the conventional methods would depend upon the
proportion of bad loans in the data set, we decided not to use stratified sampling,
and we let the percentage of bad loans vary across the ten data sets so that our
results would not depend upon the particular composition of the data sample at
hand. As Section 6 indicates, when the results were compared, we accounted for this
variation by performing paired t tests.

5.2 Model construction


As the preceding paragraphs indicate, we want to use eighteen predictor variables
to predict whether a loan will be 'good', 'poor', or 'bad'. For the multilayer
perceptron, thisfixesthe number of neurons in the input and output layers to eighteen
and two respectively. We have used a sigmoid activation function in the hidden layer,
i.e. F(z) = 1/(1 +c~r). After some preliminary testing, we decided to use a single
hidden layer starting with eighteen neurons in all our models. All models were
trained separately for each of the 10 samples for each credit union. Also, by the
end of training, some models had fewer neurons than they started with, since
some neurons were eliminated by pruning. Thus, the number of neurons deleted
340 V. S. DESAI ETAL.

ranged from 0 to 10 out of the 18 neurons in the hidden layer for the MLP
models.
For the MLP models, the initial values of the learning parameter r\ were set at
0.3 for the hidden layer and 0.15 for the output layer, and the momentum parameter
6 was set at 0.4 for all layers. These parameters were allowed to decay by
reducing their values by half after 10,000 iterations, and again by half after 30,000
iterations.
In all, the six methods tested were as follows:
Ida linear discriminant analysis
lr logistic regression
ga genetic algorithm
mlp multilayer perceptron
mlp-m combination of multilayer perceptrons and a majority rule

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


mlp-b combination of multilayer perceptrons and a best-neuron rule.

6. Comparison of results
Table 5 gives the results for the traditional techniques, Table 6 does the same for
neural networks and genetic algorithms, and Table 7 for combinations of neural
networks. For each method, the first column gives the total (i.e. the good plus the
poor plus the bad) percentage correctly classified, the second column gives the
percentage of poor loans correctly classified, and the third column gives the
percentage of bad loans correctly classified. Since the cost of giving a loan to a
defaulter is far greater than rejecting a good or poor loan, the percentage of bad
loans correctly identified is important. Also, poor loans and good loans differ in their
profitability. While reading the results of Tables 5-7, one must keep in mind that,
in the experiments reported in the present study, we did not explicitly include
misclassification costs; this is because one of the methods, namely logistic regression,
does not allow that feature. Also note that, given the information in §5.1, the
percentage of good loans correctly classified can be easily obtained from the data
given in these tables.

6.1 Comparing credit unions


In comparing the three credit unions, one can see that models for credit union M
are the best, followed by those for credit unions L and N respectively. Given the
limited amount of information, one can only speculate as to the reasons for these
differences. First, M has the largest sample size, and secondly M is the only credit
union that added select employee groups to actively diversify itsfieldof membership,
indicating that the diversity of examples in the data set in terms of size and variety
is important in creating good models. However, using these factors, one would expect
N to provide better results than L because L has the smallest sample size and the
narrowest field of membership. Perhaps the poor performance of N might be due to
the lack of power by the explanatory variables to separate the three groups as is
indicated by the descriptive statistics for N reported in Table 4. Of the 18 explanatory
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 341

TABLE 5
Results for linear discriminant analysis and logistic regression

Percentage correctly classified

sample sample Ida Ida Ida lr lr lr


Data set %poor % bad % total %poor % bad % total %poor % bad

Credit union L
1 24.40 18.45 66.07 56.10 41.94 68.45 41.46 54.84
2 29.76 17.28 66.07 41.46 48.28 66.67 40.00 51.72
3 23.81 20.24 67.26 47.50 50.00 67.26 40.00 55.88
4 21.43 22.62 69.05 44.44 44.74 69.05 36.11 50.00
5 29.76 18.45 66.07 55.00 48.39 65.48 52.50 45.16
6 25.60 19.64 67.86 57.58 54.55 67.26 39.39 60.61
7 22.02 17.86 67.86 54.05 40.00 70.84 45.95 50.00

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


8 24.40 17.26 65.48 46.34 55.17 68.45 48.78 55.17
9 20.24 17.86 72.02 43.18 43.33 73.81 38.64 53.33
10 20.24 20.83 69.05 52.94 34.29 69.05 47.06 37.14
Credit union M
1 16.14 29.13 76.77 51.22 77.03 77.17 48.78 81.08
2 20.47 29.13 7205 46.15 67.57 71.65 38.46 71.62
3 19.69 28.74 73.23 50.00 71.23 75.20 48.00 76.71
4 20.87 25.20 71.65 41.51 71.88 72.83 37.74 79.69
5 20.87 33.48 68.90 45.28 64.71 72.44 54.72 69.41
6 23.62 27.95 76.38 63.33 72.46 76.38 51.67 81.69
7 20.87 26.38 72.44 49.06 65.67 7283 50.94 65.67
8 20.47 27.17 70.08 51.92 71.01 73.23 46.15 75.36
9 24.41 24.02 75.20 45.16 78.69 74.41 41.94 77.05
10 20.87 26.38 71.26 38.10 70.15 74.80 49.06 74.63
Credit union N
1 19.83 23.84 57.76 10.87 32.08 59.48 12.50 3208
2 19.40 19.83 61.64 4.64 32.61 61.64 4.44 3261
3 25.86 21.12 56.47 10.00 42.66 56.47 10.00 4286
4 25.00 19.83 61.21 6.90 54.35 60.78 3.45 54.35
5 28.45 23.71 56.47 7.14 38.18 56.03 7.14 3273
6 23.71 20.69 56.90 14.55 29.17 5647 10.91 29.17
7 24.14 20.26 56.47 14.29 29.79 55.17 5.36 36.17
8 19.40 21.12 60.78 20.00 22.46 6250 22.22 24.99
9 22.84 21.12 63.79 9.43 46.94 62.50 7.55 44.90
10 20.69 20.69 59.48 10.42 25.00 60.78 8.25 33.33

average 22.64 22.68 66.53 35.94 50.82 67.30 32.91 54.32

p value* 0.003
• The p values are for a one-tailed paired t test comparing the lr results with the other five methods.

variables used, seven have median values that are identical for all three loan types,
four have median values that are identical for poor and bad loans, and two have
median values that are identical for good and poor loans. Similar behaviour persists
under more detailed comparisons.
In comparing the three loan types, one sees that the poor loans were most difficult
to classify, followed by bad loans. For example, for credit union L, logistic regression
342 V. S. DESAI ET AL.

TABLE 6
Results for multilayer perceptrons and genetic algorithms

Percentage correctly classified

sample sample mlp mlp mlp ga ga ga


Data set %poor %bad % total %poor %bad % total %poor % bad

Credit union L
1 24.40 18.45 66.07 48.78 51.61 63.10 26.83 74.19
2 29.76 17.28 70.83 60.00 62.07 64.29 35.00 65.52
3 23.81 20.24 70.24 77.50 38.24 61.90 30.00 64.71
4 21.43 22.62 69.43 80.56 15.79 70.24 41.67 71.05
5 29.76 18.45 65.48 28.00 58.06 67.86 32.00 70.97
6 25.60 19.64 60.71 27.91 24.24 64.29 25.58 63.64
7 22.02 17.86 66.07 45.95 43.33 66.67 37.84 60.00

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


8 24.40 17.26 66.67 51.23 55.17 64.29 26.83 72.41
9 20.24 17.86 68.45 76.47 30.00 69.05 41.18 66.67
10 20.24 20.83 63.69 32.35 48.57 64.29 32.35 65.71
Credit union M
1 16.14 29.13 77.95 53.66 79.73 71.65 24.39 85.14
2 20.47 29.13 71.26 42.31 67.57 71.26 36.54 82.43
3 19.69 28.74 73.62 38.88 69.86 73.62 40.00 86.30
4 20.87 25.20 73.62 49.06 81.25 71.65 26.42 85.94
5 20.87 33.46 70.47 56.60 62.35 70.87 30.19 82.35
6 23.62 27.95 74.02 45.00 76.06 68.50 28.33 87.32
7 20.87 26.38 70.87 58.49 65.67 70.87 26.42 79.10
8 20.47 27.17 71.65 51.92 72.46 68.50 23.08 84.06
9 24.41 24.02 72.83 48.39 65.57 68.11 29.03 81.97
10 20.87 26.38 72.44 41.51 73.13 72.83 35.85 85.07
Credit union N
1 19.83 23.84 59.48 8.69 28.30 58.62 26.09 43.40
2 19.40 19.83 60.34 2.22 36.96 67.67 31.11 63.04
3 25.86 21.12 62.93 1.67 38.78 55.60 13.33 59.18
4 25.00 19.83 59.48 1.72 50.00 61.64 25.86 56.52
5 28.45 23.71 54.74 14.29 37.78 59.91 23.21 53.33
6 23.71 20.69 57.76 1.82 35.42 58.19 23.64 47.92
7 24.14 20.26 56.90 1.79 44.68 56.90 23.21 55.32
8 19.40 21.12 61.64 17.78 18.37 63.36 28.89 46.94
9 22.84 21.12 61.63 1.89 36.73 61.64 20.75 61.22
10 20.69 20.69 60.34 2.08 35.42 63.79 27.08 50.00
average 22.64 22.68 66.38 35.62 50.11 65.70 29.19 68.38

p value* 0.038 0.007

* The p values are for a one-tailed paired t test comparing the lr results with the other five methods.

misclassified, on average, 57.16% of poor loans as good loans or bad loans, and
30.76% of bad loans as poor loans or good loans, whereas only 14.49% of the good
loans were misclassified as poor or bad loans. These results are consistent with the
descriptive statistics reported in Tables 2-4 which show the median values of the
explanatory variables for the poor loans to be often coinciding with those for good
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 343

TABLE 7
Results for combinations of multilayer perceptrons

Percentage correctly classified

sample sample mlp-m mlp-m mlp-m mlp-b mlp-b mlp—b


Data set %poor %bad % total %poor % bad % total %poor % bad

Credit union L
1 24.40 18.45 60.71 29.27 24.39 65.48 41.46 41.94
2 29.76 17.28 67.86 60.00 55.17 67.86 50.00 55.17
3 23.81 20.24 66.67 4150 44.12 67.86 40.00 50.00
4 21.43 22.62 66.07 55.56 13.16 69.64 61.11 23.68
5 29.76 18.45 64.88 40.00 48.39 64.29 38.00 51.61
6 25.60 19 64 61.90 48.83 21.21 62.50 46.51 27.27
7 22.02 17.86 66.07 51.35 43.33 64.88 43.24 43.33

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


8 24.40 17.26 61.31 63.41 24.14 66.67 73.17 34.48
9 20.24 17.86 65.48 61.76 36.67 65.48 58.82 36.67
10 20.24 20.83 60.71 35.29 45.71 64.88 3235 51.43
Credit union M
1 16.14 29.13 78.35 53.66 79.73 77.56 53.66 79.73
2 20.47 29.13 71.26 4131 74.32 72.88 51.92 7197
3 19.69 28.74 72.44 54.00 72.60 74.80 54.00 75.34
4 20.87 25.20 72.44 41.51 8181 7183 49.06 78.12
5 20.87 33.46 69.29 37.73 61.18 71.26 52.83 64.71
6 23.62 27.95 73.62 35.00 81.69 75.59 55.00 80.28
7 20.87 26.38 70.47 52.83 71.64 73.62 52.83 64.18
8 20.47 27.17 70.47 50.00 71.01 7144 48.08 7146
9 24.41 24.02 73.62 48.39 77.05 73.23 48.39 65.57
10 20.87 26.38 7105 43.40 70.15 73.62 50.94 70.15
Credit union N
1 19.83 23.84 58.62 00.00 33.96 59.48 13.04 28.30
2 19.40 19.83 51.29 00.00 17.39 6150 08.89 41.30
3 25.86 21.12 53.88 00.00 38.78 57.33 06.67 4186
4 25.00 19.83 56.03 00.00 43.48 6150 17.24 5117
5 28.45 23.71 49.57 01.79 33.33 56.90 07.14 4122
6 23.71 20.89 52.59 00.00 14.58 58.62 16.36 14.58
7 24.14 20.26 51.72 10.71 02.13 53.88 07.14 38.30
8 19.40 21.12 60.34 24.44 12.24 61.21 20.00 28.57
9 22.84 21.12 53.88 00.00 32.65 61.64 05.66 36.73
10 20.69 20.69 60.78 04.17 14.58 60.34 16.67 14.58
average 2264 22.68 63.81 32.93 44.72 66.39 37.00 49.29

p value* 0.000 0.020

• The p values are for a one-tailed paired t test comparing the lr results with the other five methods.

or bad loans. These results are also consistent with previous studies (e.g. Overstreet
& Bradley 1995).
4.2 Comparing modelling techniques
As Tables 5-7 indicate, logistic regression identifies 67.30% of the loans correctly,
which is higher than all the other models. This superiority is confirmed by a more
344 V. S. DESAI ETAL.

formal comparison using a paired t test. Since the data sets differ in that the
proportion of poor and bad loans were different for the ten data sets, we accounted
for this difference by using the paired t test. As the p values indicate, logistic regression
is clearly better than linear discriminant analysis and genetic algorithms, and the
difference between logistic regression and multilayer perceptrons is significant at the
0.05 significance level, but not at the 0.01 significance level. The fact that logistic
regression is better than discriminant analysis is consistent with results reported
elsewhere (e.g. Harrel & Lee 1985), and is perhaps due to the presence of categorical
variables, which violates the assumption of multivariate normality required for linear
discriminant analysis.
The situation is quite different when it comes to identifying poor and bad loans
correctly. The combination of neural networks with the best-neuron rule (model
mlp-b) has the best performance when it comes to correctly identifying poor loans,
and the genetic-algorithm models outperformed the rest when it came to correctly

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


identifying bad loans. This is an encouraging result for neural networks because, as
discussed before, poor loans are the most difficult to identify. It is worth noting
though that only linear discriminant analysis and the neural-network models used
prior probabilities of classifying the three loan types available from the training
samples, whereas logistic regression and genetic algorithms used equal prior prob-
abilities for all three loan types. Perhaps this difference might be partially responsible
for the higher percentage of bad loans correctly classified by logistic regression and
genetic algorithms. The fact that none of the methods are as good at identifying poor
or bad loans as they are at identifying good loans seems to confirm claims by
practitioners reported in Section 1.
7. Conclusions
An attempt was made to investigate the predictive power of feedforward neural
networks and genetic algorithms in comparison to traditional techniques such as
linear discriminant analysis and logistic regression. A particular advantage offered
by the new techniques is that they can capture nonlinear relationships. Also, a
descriptive analysis of the data suggested that classifying loans into three types—
namely good, poor, and bad—might be preferable to classifying them into just good
and bad; hence the three-way classification was attempted.
Our results indicate that the traditional techniques compare very well with the
two new techniques studied. Interestingly, neural networks performed somewhat
better than the rest of the methods for classifying the most difficult group, namely
poor loans. The fact that the Al-based techniques did not significantly outperform
the conventional techniques suggests that perhaps the most appropriate variants of
the techniques were not used. However, a post-experiment analysis possibly indicates
that the reason for the new techniques not significantly outperforming the traditional
techniques was the nonexistence of important consistent nonlinear variables in the
data sets examined.

Acknowledgements
Financial support from the Mclntire Associates Program is gratefully acknowledged
by the first author. This paper was presented at the fourth Credit Scoring and Credit
NEURAL NETWORKS AND GENETIC ALGORITHMS FOR CREDIT SCORING 345

Control Conference held at the University of Edinburgh in 1995, and the authors
gratefully appreciate the comments of participants at this conference.

REFERENCES

American Banker, 1993a (29 March), p. 15A; 1993b (25 June), p. 3; 1993c (14 July), p. 3; 1993d
(27 August), p. 14; 1993e (5 October), p. 14; 1994a (2 March), p. 15; 1994b (22 April), p. 17.
BOYLE, M., CROOK, J. N., HAMILTON, R., & THOMAS, L. C , 1992. Methods for credit scoring
applied to slow payers. Credit scoring and credit control (L. C. Thomas, J. N. Crook, &
D. E. Edelman, Eds). Oxford University Press, Oxford. Pp. 75-90.
BRENNAN, P. J., 1993a. Promise of artificial intelligence remains elusive in banking today. Bank
Management (July), pp. 49-53.
BRENNAN, P. J., 1993b. Profitability scoring comes of age. Bank Management (September),
pp. 58-62.
BRYSON, A. E., & Ho, Y. C , 1969. Applied optimal control. Hemisphere Publishing, New York.
CONWAY, D. G., VENKATARAMANAN, M. A., & CABOT, A. V., (forthcoming) A genetic

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


algorithm for discriminant analysis. Annals of Operations Research.
DESAI, V. S., CROOK, J. N., & OVERSTREET JR, G. A., 1996. A comparison of neural networks
and linear scoring models in the credit union environment. European Journal of
Operational Research 95, 24-37.
DILLON, W. R., & GOLDSTEIN, M., 1984. Multivariate analysis methods and applications. Wiley,
New York.
EISENBEIS, R. A., 1977. Pitfalls in the application of discriminant analysis in business, finance,
and economics. Journal of Finance 32, 875-900.
EISENBEIS, R. A., 1978. Problems in applying discriminant analysis in credit scoring models.
Journal of Banking & Finance 32, 205-19.
EISENBEIS, R. A., & AVERY, R. B., 1972 Discriminant analysis and classification trees. Lexington
Books, Lexington, Massachusetts.
FAHLMAN, S. E., & LEBIERRE, C , 1990. The cascade-correlation learning architecture. School
of Computer Science Report CMU-CS-90-100, Carnegie Mellon University.
FISHER, R. A., 1936. The use of multiple measurement in taxonomies problems. Annals of
Eugenics 7, 179-88.
FREED, N., & GLOVER, F., 1981. A linear programming approach to the discriminant problem.
Decision Sciences 12, 68-74.
GREENE, W. H., 1991. LIMDEP Econometric Software Inc, New York.
GILBERT, E. S., 1968. On discrimination using qualitative variables. Journal of the American
Statistical Association 63, 1399-412.
GOLDBERG, D. E., 1989. Genetic algorithms in search: optimization and machine learning.
Addison-Wesley, Reading, Massachusetts.
GOTHE, P., 1990. Credit bureau point scoring sheds light on shades of gray. The Credit World
(May-June), pp. 25-9.
HAIR, J. F., ANDERSON, R. E., TATHAM, R. L., & BLACK, W. C , 1992. Multivariate data
analysis: eighth readings. Macmillan, New York.
HANSON, S. J., & PRATT, L., 1988. A comparison of different biases for minimal network
construction with back-propagation. Advances in neural information processing systems
(D. S. Touretzky, Ed.). Morgan Kaufmann, San Mateo, California. Pp. 177-85.
HANSEN, L. K., & SALAMON, P., 1990. Neural network ensembles. IEEE Transactions on
Pattern Analysis and Machine Intelligence 12, 993-1001.
HARRELL, F. E., & LEE, K. L., 1985. A comparison of the discrimination of discriminant
analysis and logistic regression under multivariate normality. Biostatistics: statistics in
biomedical, public health, and environmental sciences (ed. P. K. Sen). North-Holland,
Amsterdam.
HAYKIN, S., 1994. Neural networks: a comprehensive foundation. Macmillan, New York.
JACOBS, R. A., JORDAN, M. I., NOWLAN, S. J., & HINTON, G. E., 1991. Adaptive mixtures of
local experts. Neural Computation 3, 79-87.
346 V. S. DESAI ET AL.

JOST, A., 1993. Neural networks: a logical progression in credit and marketing decision
systems. Credit World (March/April), pp. 26-33.
KNOKE, J. D., 1982. Discriminant analysis with discrete and continuous variables. Biometrics
38, 191-200.
KOEHLER, G. J., 1991. Linear discriminant function determined by genetic search. ORSA
Journal of Computing 3, 345-57.
KRZANOWKI, W. J., 1977. The performance of Fisher's linear discriminant function under
non-optimal conditions. Technometrics 19, 191-200.
LACHENBRUCH, P. A., 1975. Discriminant analysis. Hafner, New York.
LAPEDES, A., & FARBER, R., 1987. Non-linear signal processing using neural networks:
prediction and system modeling. Los Alamos National Laboratory report LA-UR-87-
2662.
LEUNBERGER, D. G., 1969. Optimization by vector space methods. Wiley, New York.
MANGASARIAN, O., 1965. Linear and nonlinear separation of patterns by linear programming.
Operations Research 13, 444-52.
MOORE, D. M., 1973. Evaluation offivediscrimination procedures for binary variables. Journal

Downloaded from imaman.oxfordjournals.org by guest on February 10, 2011


of the American Statistical Association 68, 399.
NILSSON, N. J., 1965. Learning machines: foundations of trainable pattern-classifying systems.
McGraw-Hill, New York.
OVERSTREET JR, G. A., BRADLEY JR, E. L., & KEMP, R. S., 1992. The flat-maximum effect and
generic linear scoring model: a test. IMA Journal of Mathematics Applied in Business &
Industry 4, 97-109.
OVERSTREET JR, G. A., & BRADLEY JR, E. L., 1996. Applicability of generic linear scoring
models in the U.S. credit-union environment. IMA Journal of Mathematics Applied in
Business & Industry 7, 291-311.
PARKER, D. B., 1982. Learning logic. Invention Report 581-64 (File 1), Stanford University,
Office of Technology Licensing.
RAO, C. R., 1965. Linear statistical inference and its applications. Wiley, New York.
ROBBINS, H., & MUNRO, S., 1951. A stochastic approximation method. The Annals of
Mathematical Statistics 11, 400-7.
ROSENBLATT, F., Principles of newodynamics. Spartan Books, Washington D.C.
RUMELHART, D. E., 1988. Learning and generalization. Proceedings of the IEEE International
Conference on Neural Networks, San Diego (Plenary address).
RUMELHART, D. E., HINTON, G. E., & WILLIAMS, R. J., 1986. Learning internal representations
by error propagation. In: Parallel distributed processing: explorations in the microstructures
of cognition (D. E. Rumelhart & J. L. McCleland, Eds). The MIT Press, Cambridge,
Massachusetts. Vol 1, pp. 318-62.
STEVENS, J., 1992 Applied multivariate statistics for the social sciences. Lawrence Erlbaum
Associates, New Jersey.
TAM, K. Y., & KIANG, M. Y., 1992. Managerial applications of neural networks: the case of
bank failure predictions. Management Science 38, 926—47.
TATSUOKA, M. M., 1975, Selected topics in advanced statistics, an elementary approach 6:
discriminant analysis. Institute for Personality and Ability Testing, Illinois.
WERBOS, P., 1974. Beyond regression: new tools for prediction and analysis in the behavioral
sciences. Unpublished Ph.D. dissertation, Harvard University, Dept. of Applied
Mathematics.
WIDROW, B., 1962. Generalization and information storage in networks of adaline "neurons".
In: Self-organizing systems (M. C. Yovitz, G. T. Jacobi, & G. D. Goldstein, Eds). Spartan
Books, Washington D.C. Pp. 435-61.
WEIGEND, A. S., HUBERMAN, B. A., & RUMELHART, D. E., 1990. Predicting the future: a
connectionist approach. Working Paper, Stanford University, California.

You might also like