You are on page 1of 6

Business Analytics Assignment2

Credit Scoring

Submitting to :

Dr.Abhinanda Sarkar

Submitted by

Siddhanth Hiremath-
MYDM-2019-07
Credit Scoring Using Logistic Regression:

Breaking Data into Training and Test Sample:

The following code creates a training data set comprised of randomly selected 80% of the data and
test sample being a random 20% sample remaining.

ind=sample(1225,225)
cred.test=cred[ind,]
cred.train=cred[-ind,]
dim(cred.train)

Build model and Calculate ROC Curve:

There is a strong literature based showing that the most optimal credit scoring cut off decisions can
be made using ROC curves which plot the business implications of both the true positive rate of the
model vs. false positive rate for each score cut off point.

log.train=glm(BAD~DOB+SINC+DAINC+RES+DOUTCC,family=binomial,data=cred.train)
cred.test.p=predict(log.train,type='response',cred.test)
pred=prediction(cred.test.p,cred.test$BAD)
perf <- performance(pred,"tpr","fpr")
plot(perf)

Calculating KS Statistic

KS is the maximum difference between the cumulative true positive and cumulative false positive
rate.

max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])

0.2637025

If you do not use this cut off point the KS in essence does not mean much for actual separation of the
cut off chosen for the credit granting decision.
 Accuracy on Train dataset is 59.1% using model built on it.
 Misclassification is 40.9%
 Accuracy is 63.55% on test dataset using model built on train data set with cutoff being .26
 Misclassification is 36.45%

Confusion Matrix and Statistics:

0 1
0 100 69
1 16 40

Accuracy : 0.6222
95% CI : (0.5554, 0.6858)
No Information Rate : 0.5156
P-Value [Acc > NIR] : 0.000812
Mcnemar's Test P-Value : 1.699e-08
Sensitivity : 0.8621
Specificity : 0.3670
Pos Pred Value : 0.5917
Neg Pred Value : 0.7143
Prevalence : 0.5156
Detection Rate : 0.4444
Detection Prevalence : 0.7511
Balanced Accuracy : 0.6145

Accuracy of the above model is 62.22% with 95% of the confidence interval with 86.21% sensitivity
And 36.70% specificity with 0.26 as cutoff.

Linear Discriminant Analysis:

index=sample(1225,1000)
cred.train=cred[index,]
cred.test=cred[-index,]
Lda_mod=lda(BAD~.,data=cred.train)
plot(lda_mod)
mean(lda_pred$class==cred.test$BAD)
->0.71

It can be seen that, our model correctly classified 71% of observations, which is good.

Variable selection:

Note that, if the predictor variables are standardized before computing LDA, the discriminator
weights can be used as measures of variable importance for feature selection.

lda_mod=lda(BAD~DOB+SINC+DAINC+RES+DOUTCC,data=cred.train)
plot(lda_mod)

lda_pred=predict(lda_mod)
names(lda_pred)
[1] "class" "posterior" "x"

Model accuracy:

mean(lda_pred$class==cred.test$BAD)

0.723

It can be seen that, our model correctly classified 72.3% of observations, which is good.

summary(lda_predict$class)
0 1
213 12

The model predicts there will be 213 non-defaulters and 12 defaulters out of test sample.

table(l_p$class,cred.test$BAD)

0 1
0 163 50
1 6 6

xtab=table(l_p$class,cred.test$BAD)
confusionMatrix(xtab)
Confusion Matrix and Statistics

0 1
0 163 50
1 6 6

Accuracy : 0.7511 ie.75.11%


95% CI : (0.6893, 0.8062)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 0.5358
Mcnemar's Test P-Value : 9.132e-09
Sensitivity : 0.9645 ie.96.45%
Specificity : 0.1071 ie.10.71%
Pos Pred Value : 0.7653
Neg Pred Value : 0.5000
Prevalence : 0.7511
Detection Rate : 0.7244
Detection Prevalence : 0.9467
Balanced Accuracy : 0.5358
'Positive' Class : 0

The model is 75.11% accurate and this was achieved with a 95% Confidence Interval level.

Qudratic Discriminant analysis:

qda_mod=qda(BAD~DOB+SINC+DAINC+RES+DOUTCC,data=cred.train)
qda_pred=predict(qda_mod)

QDA model accuracy:

mean(qda_pred$class==cred.test$BAD)
0.583

It can be seen that, QDA model correctly classifies 58.3% of observations, which is not good enough.

Observation and Comparison:

Accuracy Comparison:

Cutoff Logistic Regression Linear Discriminant Analysis


Train Test Train Test
Accuracy Misclasif Accuracy Misclass Accuracy Misclasif Accuracy Misclass
ication Ification ication Ification
0.26 59.1 40.9 63.55 36.45
73.5 26.5 72.3 27.7
0.4 72.2 27.8 78.22 21.78

Quadratic Discriminant Analysis


Train Test
Accuracy Misclasif Accuracy Misclass
ication Ification

65.1 34.9 58.3 41.7


For cutoff >=0.4 Logistic Model accuracy is better for Logistic model than the other two(LDA and QDA)
which is >72% and for cutoff<0.4 LDA accuracy is better.

Sensitivity and specificity Comparison:

Logistic Model:

From the above graph we can conclude that for given model we can predict 95.41% of Bad loans
correctly

LDA Model:

Sensitivity : 96.45%(ie true positives of 96.5%)


Specificity : 10.71%
False positives of rate is 89.55%.

Conclusion:

In general, Logistic Model and Linear discrimant model produced similar results. Both methods
estimated the more or less similar statistical significant coefficients , with similar effect size and
direction, although logistic regression estimated slightly larger coefficients than LDA and QDA models.
The overall classification rate for both Logistic(cutoff>-.26) and linear discriminant models was good,
and either of these can be helpful in predicting. Logistic regression slightly exceeds discriminant
function in the correct classification rate but the differences in the AUC was not that high, thus
indicating no discriminating difference between the models. Ultimately, the choice of analysis method
will depend on the particular characteristics and requirements of its application, including the
plausibility of required assumptions and computational convenience.