Improving Prediction Accuracy

Improving prediction accuracy of loan default- A case in rural credit
M.J. Xavier, Sundaramurthy, P.K. Viswanathan, G. Balasubramanian Abstract: Application of data mining techniques for predicting loan default assumes paramount importance in Banking and financial services. The analytics involved in this context pave the way for evolving robust credit scoring models and automation of the lending process. They also help discern the pattern of relationship between the input [borrower characteristics] and the output [loan default status]. If the underlying relationship is not strictly linear in nature, routine ways of using factor analysis and non-linear oriented neural network may not improve predictive accuracy due to intense level of complexity present in the data. The question is Is it possible to improve the predictive accuracy by judiciously combining the inputs from these two algorithms?. In this paper we demonstrate that it is possible to do so using the data set of a finance company lending small loans in rural areas. Factor analysis is used to generate inputs for the application of the neural network algorithm to predict loan default and the result seen is substantial improvement in accuracy.
Key words: Predictive accuracy, algorithms, factor analysis, neural network, data mining
I Introduction Lending institutions are exposed to the risk of default by borrowers as it affects their profitability and solvency. Continuous default by borrowers affects the economic growth of a country and therefore a major cause of concern for the government and banks. In managing loan default, prevention is better than cure. Hence, lenders exercise vigil at the time of lending by carefully studying the characteristics of the borrowers and making a judgment call about the relationship between borrower characteristics and loan default event. Over a period of time, the database of lending and default history captures this relationship, which can then be modeled using the appropriate algorithms. Using this relationship, lenders take credit decisions. Models, once tested for their reliability,
facilitate automation of lending process, which helps lenders to scale up their operations. Lenders here include banks, financial institutions and other agencies like micro finance institutions engaged in lending.
Historical data relating to borrower characteristics and their default status is used to predict loan default. Past data is studied for discerning any patterns between certain characteristics and the loan default status. A very nave example could be a case where a bank found that people belonging to a particular community never default. So, this bank will not look at any other characteristics except community. According to this bank, all the rest of the characteristics are irrelevant. That is the banks experience. So, a few data points were used to predict the loan default. But today, with high speed computing power, more data can be captured and analyzed to detect hidden patterns. More sophisticated techniques are used for improving prediction accuracy. Multiple regression techniques are very famous in expressing the mathematical relationship between dependent and independent variables in the form of coefficients which can be used for prediction. But, multiple regression technique, assumes that the relationship between dependent and independent variable is linear. More over, when the number of independent variables is more, there is a problem of multi-collinearity which can affect the predictive accuracy of the equation. This means that the independent variables themselves are correlated with one another. Factor analysis came in as a powerful
technique that can handle this issue. In factor analysis, all the correlated independent variables are reduced to factors which are then used to predict the future outcome.
The various techniques and tools that can be used to discern the underlying relationship between the dependent and independent variable are collectively known as data mining techniques. The conventional approach of regression and other multi-variate techniques are based on the assumption of linear relationship between dependent and independent variable, which may not hold well in many situations. Arising out of this limitation, nonlinear analytical tools were developed which use algorithms that handle non-linearity. Neural network (NN) is one such technique extensively used to predict outcomes in cases where the underlying relationship between the dependent and independent variable is non-linear. In some situations NN fails miserably, particularly when there is large number of independent variables. But, is there a way of combining the conventional factor analysis and NN to obtain a better predictive accuracy? We show in this paper that it is possible, with the help of a real life case.
II Review of Literature Factor analysis is one of the most widely used tools for prediction in various fields such as psychology, sociology and market research. Factor analysis is a statistical method for reducing the original set of variables into a smaller set of underlying factors in a manner that retains as much as possible of the original information of the data. In statistical terms this means finding factors that explain as much as possible of the variances in the data. Many statistical methods are used to study the relation between independent and dependent variables. Factor analysis is used to study the patterns of relationship among many inter-dependent variables. Timo Salmi, Ilkka Virtanen and Paavo Yli-Olli [1990], applied factor analysis and transformation analysis techniques on a number of financial ratios of 32 firms from Finland for the period 1974-84. Fifteen financial ratios, grouped under three broad categories namely, accrual, cash flow and market based ratios, were considered for the application of factor analysis. It is mainly used for exploratory analysis of underlying dimensions and exploits the benefits of data reduction to replace original variables with factor scores for subsequent analyses such as multidimensional scaling, cluster analysis or neural network analysis. One of the earliest applications of factor analysis by Spearman [2005], a noted psychologist, researched the issue of intelligence. Eight measurable variables associated with the factor of intelligence and reduced to four dimensions. Colgate and Lang [2001] used exploratory factor analysis to assess the dimensionality of the reasons why customers do not switch banks and thus determine the relevance of categories unearthed in literature. Minhas and Jacobs [1996] employed factor analysis to identify the most important attributes of new technology that have an impact on the marketing of financial services. 33 variables were reduced to underlying six dimensions. In a study for predicting industrial bond ratings, Pinches and Mingo [1973] used factor analysis to identify independent dimensions to be used for further modeling. Seven underlying factors were extracted which accounted for sixty three percent of explanation of the results. Scannel, Safdari and Newton (2003) present an extended application of factor analysis performed on a set of 17 banks and 13 financial variables. The results were used to classify banks on the basis of common characteristics.
Neural Networks [NN], represent a radically different form of computation from the more common algorithmic approaches like factor analysis. The unique learning capability of NN promise benefits in many aspects of default prediction which involve pattern recognition. Development of robust and reliable model for default prediction is important as it enables investors, auditors and others to independently evaluate the risk of investment. The task of predicting default can be posed as a classification problem: given a set of classes (good and bad loans) and a set of input data vectors, the task is to assign each input data vector to one of the classes. What forms the input data vector is one of the key issues to be determined in research design. Conventional statistical approaches are of limited use in deriving an appropriate prediction model in the absence of well-defined domain models. They all require the assumption of a certain functional form for relating independent and dependent variables. Generalization can be made only with caution. NN provide a more general framework for determining relationships in the data and do not require the specification of any functional form. Wullianallur Raghupathi etal, [1991], presented the results of an exploratory research where NN was applied for bankruptcy prediction. They conclude that NN could provide a model for prediction of bankruptcy. Eric Rahimian etal.,[1992] compared discriminant analysis and NN, applying on a set of data of 129 firms of which 65 were bankrupt. They report that after normalizing the data, the performance of NN significantly improved. It is important to consider how we specify the input data to NN. Marcus D. Odom etal.,[1991] also compared the predictive ability of a NN and multivariate discriminant analysis model in bankruptcy risk prediction and show promise in using NN for prediction purposes. They report that NN performed better in both the original data and hold out sample. Kevin coleman etal.,[1991], Kar Yan Tam etal., [1992], extended the application of NN in bankruptcy prediction further into an expert system for prescribing remedial action for preventing bankruptcy. Linda Salchenberger etal.,[1992], used NN to train 100 failures 100 surviving institutions of thrift, between 1986 to December 1987 and compared its performance with logit model. Their results show that NN has performed better than logit model. They also conclude that the cost of committing type I error and type II error are lower in NN compared to the logit model.
III case study Predicting Credit default for a consumer financial services company Company background: The Company under study is a financial services Company and is a part of a leading multi business Industrial group in South India. Their products include auto loans, personal loans and consumer durable loans. They have a customer base in all the four states of South India. They have customers in urban, semi urban and rural areas. In fact their strength is their reach to the rural areas and they are expanding their base in rural areas of South India. There is a large untapped potential in the rural sector as elaborated above. They have a very good brand image in South India and they are known for their professionalism, integrity and customer focus and commitment to social responsibility. Their customers include lower middle class, middle class and upper middle class in the salaried segments in private , government and joint sectors, traders, small farmers, and others in various own small enterprises in urban, semi urban and rural areas.
Process of acquiring customers: Auto loans: The Business Group has an automobile company with a leading brand in the two wheeler market The Company has appointed authorized dealers who have their show rooms in various places and also a marketing team. The Customers walk in to the show room and once they decide to buy the vehicle they are approached by different financing companies available in the show rooms. So this financial company has to compete with other companies to get a customer. Once they convince a customer their personal data in the format is sent to the call centre. The call centre enters the data in the network and makes it available for the field investigation network.
Another process of acquiring customers is when the financial services company conducts loan melas at different locations. It is organized either by the company directly in alliance with the dealers or by a dealer himself. Here the number of customers acquired i.e. the hit rate is much more than that obtained from the walk in customers.
The assessment of the prospective customer for giving a loan is done by a team of staff called Field Investigators who visit the customers, their work places, their residence and fill in a format which helps them to arrive at a score. The score is arrived at by summing up the marks given for various attributes like income, ownership of house, consumer durables, age etc. If the score arrived at is more than a cutoff then the person is deemed eligible for a loan.
The score is sent to the call centre and from there it is sent to the concerned branch. Then the branch handles the further formalities of handling the process up to loan disbursement.
Process of acquiring customers: Consumer Durables: The financial services company has tie up with the dealers of the consumer durables and a similar process as explained above for the auto loans is carried out for the assessment and disbursement of the loans. Another avenue for acquiring the customer is by cross selling to the customers who have already got and repaid may be an auto loan or a personal loan and who exhibited a good repayment behavior.
Process of acquiring a personal loan: Here again the existing customers of auto loans or the consumer durable loans are the customers. Good customers who have not defaulted even once are offered these loans.
Problem definition: There is a tremendous competition among the various banks both Indian and Foreign banks and also the financial services companies in the Private sector. As already mentioned there is a scramble for the low hanging fruits and which is in the urban and semi urban areas. All the players in this sector vied with each other and with the result the margins have become very thin. The players have been forced to look for newer markets and also compete on the time for disbursement of loans. Hence with the rush to sanction the loans there is a lot of pressure on the assessment process to ascertain customers as GOOD or BAD .Even though it is evident for the sake of emphasis it is
better to state here that neither wrongly rejecting a good customer nor accepting a bad customer is acceptable as both will affect the bottom line. However generally, accepting a bad customer wrongly is more harmful than rejecting a good customer wrongly. The un recovered loans varies anywhere between 1% at the best to even 25% at the worst. For the company understudy the typical collection is around 2.5%.So there is every reason for every company to minimize the un- recovered loans. And more importantly the cost of collection is also important and this is the case with the customers who are not prompt in repayment. Finally when there is need to grow aggressively there is a need to disburse the loans as fast as possible but at the same time giving the loan only to the right customer. Hence it has become imperative for the Company under study to develop a credit scoring model which should a)identify and predict a good customer and a bad customer b) take all the relevant factors into consideration and assign weights according to their relative importance with respect to the repayment behavior c)arrive at a final score based on the various parameters so that the customers can be ranked and the pricing of the loan in terms of interest ,initial down payment, etc and collection mechanism can be arrived at.
As a first step of the study, detailed discussions were carried out with the President, to understand the business and with the regional managers, branch managers, risk managers, call centre managers and call centre operators, process of acquiring customers, assessment of customers, collection mechanisms, challenges in terms of technology, people and processes. The objectives of the study was to a) understand the operational aspects related to the risk b) aspects of customer data capture and process of Field Investigation c) risk assessment d) to ascertain the practical aspects in respect of quality of information obtained from customers and the process of validation of the information e) to assess the attitudes of FI with respect to the credit scoring models, willingness to get information which may be more critical f) to understand the transaction between the DSA and the prospective customer and the process of conversion g) to ascertain the Dealers perspective of the scoring of Customers h)to understand the process of getting the customer information with respect to the credit scoring model and validating it h) to check the data with respect to the completeness and consistency and also make a preliminary study on any visible patterns.
Visits were made to dealers, master field investigators, field investigators,
branches,
direct selling associates and customers in rural, semi urban and urban areas in different regions.
The company had a very good data warehouse of customer data. This was used to collect secondary data of the customers. The customers comprised of people who had availed loans for buying a two wheeler. Most of them are from the rural and semi urban areas. Most of them are in self employment, in small trade or marginal farming or some form of small business. The vehicles are used directly by traders and business men in their occupation or indirectly to help in their business or farming to transport farm inputs or farm produce.
A database of Customers database
with
around 12000 customers was taken. The
was a mixture of good and bad customers. The database had 38
fields and data like age, qualification, profession, income, possession of consumer durables, house, and number of dependents, down payments and advance equated monthly installments etc.
Data collection: Sample Size: The total customer database has a total of around 150000 and the sample is a random sample and has around 12000 data sets. The following independent variables were used in the model. Good or Bad customer is the dependent variable, represented as 1 or 0. Data Structure 1. Income 2. Advance EMI 3. Dependents 4. Experience 5. Rent
6. Down payment 7. Consumer durables 8. Interest 9. Vehicles 10. Age 11. Other income 12. TV 13. Music System 14. Fridge 15. Two wheeler 16. Four wheeler. 17. Qualification
Methodology for analysis: The data set of 12000 customers with seventeen variables were subjected to factor analysis, which reduced the seventeen variables to the following five factors. Some of the variables themselves loaded as independent factors. 1. Income & assets 2. Consumer durables 3. Initial payments (down payments, advance EMI) 4. Vehicles 5. Dependents
The rotated factor matrix is shown

Rotated Component Matrix Component 1 cfsan_emi ADVEMI dependent CHILDREN INCOME experience RENT AGE othincome TV MS FRIDGE WM TW FW .198 -.108 -.005 -.089 .032 .035 -.116 -.137 -.014 .670 .675 .803 .724 .066 .080 2 .025 .008 .005 -.005 .970 .022 -.001 .010 .970 .000 .004 .003 .016 -.024 .032 3 .037 -.084 .453 -.053 -.006 .468 -.018 .638 .034 .545 .525 -.126 -.289 .257 -.124 4 .781 .814 -.007 .079 .040 -.009 .178 -.035 -.005 -.021 -.014 .063 .026 .045 -.017 5 -.014 .047 .016 .040 .026 .194 .206 .007 -.015 -.008 -.061 .091 .099 .655 .770 6 -.062 .049 -.311 .833 -.004 .353 -.305 .020 .004 -.039 -.034 .039 -.001 .071 -.080
a
Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. a. Rotation converged in 6 iterations.
The resultant factor scores were used on the past data to test the predictive accuracy of the factor scores. It could achieve about 40 percent accuracy in prediction. Thought it is considered to be significant, further improvement of the prediction accuracy was explored with NN methodology. Methodology of NN: NN is an assumption free, non-algorithmic approach to estimate the relationship between dependent and independent variables. The relationship between dependent and independent variables can be explained with the help of algorithm based techniques like regression, discriminant analysis or factor analysis. The predictive performance of these techniques depends on the extent to which the dependent and independent variables are linearly related. When the underlying relationship is non-linear, a different approach is required for modeling the relationship which is the non-algorithmic approach of NN. The given dataset is divided into training and testing data sets. NN uses the training data set to model the relationship between input (independent variables) and output (dependent variable). Starting with a randomly assigned weight matrix, NN maps the input and output with the help of this weight matrix and continuously refines the weight
matrix to get the perfect fit between input variables and output. It will never get a perfect fit. So, a stopping rule has to be specified in the form an error term. Root mean squared error is normally used for this purpose. NN stops refining the weight matrix when the root mean squared error reaches a certain target value. The weight matrix is applied in the test data to check its reliability. The process of refining the weight matrix is done with the help of various algorithms. The back propagation algorithm is the most popular method. It is a method of distributing the error within the network. Performance of NN can be improved through various methods of configuring the network and giving the proper inputs.
The same data was subjected to NN which could not generalize with the existing data. Instead of subjecting the raw data to NN, the extracted factor scores for each cases was used as an input to NN. This dramatically improved the performance of NN which converged in three hours but with 98% accuracy. The classification matrix is shown below:
Test Good 2673 2585 88 0 2585 88 Bad 3354 3349 5 0 5 3349 Verify Good 1460 1409 51 0 1409 51 Bad 1553 1553 0 0 0 1553 Validate Good 1412 1364 48 0 1364 48 Bad Total 1601 12053 1599 11859 2 194 0 2 1599
Total Correct Wrong Unknown Good Bad
98.39%
NN creates a test data set, uses various types topography of network, selects the best performing network and weights, uses those weights to verification data and validation data. Thus, we get the overall summary in the form of misclassification table. The networks chosen by NN are shown below:
Network 1 2 3 4 5 6 7 8 9 10 *
Topology RBF RBF Linear Linear Linear RBF MLP MLP MLP MLP
Error 0.496363 0.492150 0.342226 0.342162 0.342103 0.298466 0.083890 0.083850 0.081590 0.079790
Input 2 2 3452 2 3 4 4
Hidden 1 2
4 7 17 14 15
Performance 0.5599071 0.5618984 0.9784268 0.9807501 0.9824096 0.9724527 0.9923664 0.9923664 0.9926983 0.9923664
Note: 1. Column Topology shows the algorithm chosen by NN. MLP refers to multilayer-perceptron, layer. 2. The reduction in error is accompanied by improvement in performance 3. Column Input and Hidden shows the input and hidden layers used in the network.
IV Summary and Conclusion NN is a powerful technique for predicting the behavior of output based on a given set of input values. It has a very good potential for application in loan default prediction and lending automation through credit scoring models. But NN per se has some limitations when there are a large number of input variables and the relationship among these variables is non-linear and complex. This is where, a conventional algorithm based technique like factor analysis gives a lending hand. It is able to reduce the complexities to a great extent by reducing the number of variables to a few numbers of dimensions. When these factor values are given as input values to NN, it is able to perform dramatically well. Factor analysis alone is not able to achieve 98 percent prediction accuracy. But, factor analysis, coupled with NN, is able to deliver 98 percent prediction accuracy. This has been illustrated in this paper with the help of live data from a financial services company. The future holds for a combinatorial approach to prediction with help of linear and non-linear based techniques.
References: 1. Colgate, Mark and Lang, Bodo [2001], Switching barriers in consumer markets: an investigation of the financial services industry, Journal of consumer marketing, Vol. 18, No.4, PP. 332-347 2. Eric Rahimian, Seema Singh, thongchai Thammachote and Rajiv Virmani, Bankruptcy prediction by Neural Network 3. Kevin G Coleman, Timothy J Graettinger and William F. Lawrence, [1991], Neural Networks for Bankruptcy prediction: The power to solve financial problems AI review, July/ August 1991, pp 48-50 4. Kar Yan Tam and Melody Y. Kiang [1992], Managerial applications of Neural Networks: The case of bank failure predictions, Management science, Vol .38, No. 7, July 1992, pp 926-947 5. Linda M. Salchenberger, E. Mine Cinar and Nocholas A. Lash, [1992], Neural networks: A new tool for predicting thrift failures, Decision sciences, vol 23, No.4, july/august 1992, pp-899-916 6. Marcos D. Odom and Ramesh Sharda [1992], A Neural Network Model for Bankruptcy Prediction, IEEE, International conference on Neural Network, PP. II163-II168, SanDiego, CA 7. Minhas, R.S. and Jacobs, E.M. [1996], Benefit segmentation by factor analysis: an improved method of targeting customers for financial services, International Journal of Bank Marketing, 14/3, March, 3-13 8. Pinches, A. Mingo & J.Kent Caruthers, [1973], The stability of financial patterns in industrial organizations, Journal of finance 28, 389-396 9. Scannell, Nancy J, Safdari, Cyrus and Newton Judy, An extended application of factor analysis in establishing peer groups among banks in Armenia, Journal of the Academy of Business and Economics, January 10. Spearman, http://www. Indiana.edu/~intell/spearman.shtml,accessed on Ocotber 22, 2005 11. Timo Salmi, ilkka virtanen, Paavo Yli-Olli, on the classification financial ratios, a factor and transformation analysis of accrual, cash flow and market based ratios, No.25, Business administration No.9, Accounting and Finance, University Wasaensis 12. Wullianallur Raghupathi, Lawrence l. Schkade and Bapi S. Raju [1991], A Neural Network approach to Bankruptcy prediction, IEEE. Proceedings of the IEEE 24th Annual Hawaii International conference on systems sciences
13. Yli-Olli, Paavo & Virtanen, [1990], Transformation analysis applied to long term stability and structural invariance of financial ratio patterns: US vs. Finnish firms, American Journal of Mathematical and Management sciences 10
Details of Authors: 1. M.J. Xavier: 2. Sundharamurthy: 3. P.K. Viswanathan: Faculty, Institute for Financial Management and Reseach, IFMR, pkv@ifmr.ac.in 4. G. Balasubramanian, Faculty, Institute for Financial Management and Research, IFMR, bala@ifmr.ac.in

Improving Prediction Accuracy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Prediction Accuracy

Uploaded by

Copyright:

Available Formats

Improving prediction accuracy of loan default- A case in rural credit

Visits were made to dealers, master field investigators, field investigators,

A database of Customers database

around 12000 customers was taken. The

was a mixture of good and bad customers. The database had 38

The rotated factor matrix is shown

Total Correct Wrong Unknown Good Bad

You might also like