You are on page 1of 25

Partial Least Square Discriminat Analysis

Classifier to treat unbalanced data:


A Case study on Malaysian Bankruptcy
SMEs

Sallehuddin Hussin
2013597357

Problem Statement
Problems with bankruptcy study on SMEs:
Unbalanced dataset- Bankruptcy is a very rare
event. The number of bankrupt firms is very
small compared to non-bankrupt firms so the
dataset will be highly imbalanced.
Minority class is less than 5% known as a rare event
(Au et al. 2010)
Model is not meaningful because lack of information
to learn from the rare event (Bee Wah Yap et al.
2014)

Problem Statement
Multicollinearity- Financial ratios are ratios of

two financial items (such as total assets, total


liabilities, current assets and etc.). The same
item might be used to make up different ratios.
This can cause multicollinearity.
Multicollinearity is a major problemw hen building
models based on financial data (Serrano et al. 2013)
The advantage of PLS is on its capability to deal with
multicollinearity, its robustness to missing data and
skew distributions (Cassel et al. 1999)

Research Question
How efficient are the re-sampling techniques

between SMOTE and One-Sided Selection


(OSS) in classifying bankrupt SMEs from the
non-bankrupt ones?
How efficient is the PLSDA classifier in
classifying bankrupt SMEs from the nonbankrupt ones?

Objectives
To compare the performance of re-sampling

techniques between SMOTE and One-Sided


Selection (OSS) for bankruptcy dataset.
To determine the efficiency of PLSDA classifier
in classifying bankrupt and non-bankrupt
Malaysian SMEs.

Framework

Classifier: PLSDA
Original
PLSDA

PLSDA+
exponential

PLS DA+
Nearest
Neighbors

Evaluation
AUC

Accurac
y

Sensitivity

Specificity

MODEL
EVALUATION

One-Sided
Selection (OSS)

SMOTE

MODEL
CONSTRUCION

Resampling

Resampling-SMOTE
Synthetic Minority Over-sampling Technique
Oversampling technique proposed by Chawla

et al (2002)
The idea is to form new minority examples by
interpolating between example of the same
class.

Framework

Classifier: PLSDA
Original
PLSDA

PLSDA+
exponential

PLS DA+
Nearest
Neighbors

Evaluation
AUC

Accurac
y

Sensitivity

Specificity

MODEL
EVALUATION

One-Sided
Selection (OSS)

SMOTE

MODEL
CONSTRUCION

Resampling

Resampling-OSS
One Sided Selection
Under-sampling technique purposed by Kubat

et al (1997).
The idea is reduce the majority class by
considers important observations at the
border classes, and the minority group.

Framework

Classifier: PLSDA
Original
PLSDA

PLSDA+
exponential

PLS DA+
Nearest
Neighbors

Evaluation
AUC

Accurac
y

Sensitivity

Specificity

MODEL
EVALUATION

One-Sided
Selection (OSS)

SMOTE

MODEL
CONSTRUCION

Resampling

Classifier: Partial Least


Square
Not originally designed for discriminant but

empirical studies shown it perform well for


classification.
PLS transforms original variables into

orthogonal components (latent variable) by


taking into account both the independent and
dependent variables
Dimension reduction which extract the score
vectors as new independent variables

Classifier: Partial Least


Square
PLS was introduced by Wold (1966) who

created Non-linear iterative partial least


squares (NIPALS) algorithm
Then PLS was improved by Jong (1993) with
SIMPLS by simplify the algorithm
Barker (2003) properly discussed PLS for
discrimination by synchronizing PLS algorithm
and Linear Discriminat Analysis (LDA). Partial
least squares Discriminant Analysis (PLS-DA)
is a variant used when the dependent variable
is categorical.

Framework

Classifier: PLSDA
Original
PLSDA

PLSDA+
exponential

PLS DA+
Nearest
Neighbors

Evaluation
AUC

Accurac
y

Sensitivity

Specificity

MODEL
EVALUATION

One-Sided
Selection (OSS)

SMOTE

MODEL
CONSTRUCION

Resampling

Model Performance measures


Accuracy =
Sensitivity=
Specificity=

Actual

Prediction
Positive

Negative

Positive

TP

FN

Negative

FP

TN

Area under ROC curve (AUC) : ROC graphs are

two-dimensional graph in which Sensitivity is


plotted on the Y axis and 1-Specificity is
plotted on X axis.

Methodology-Data
This research use s secondary data (financial

statement)which was obtained from


Suruhanjaya Syarikat Malaysia (SSM)
This study only focuses on firm categorical
under TRANSPORTATION & STORAGE service
sector Malaysian SMEs.

Methodology-Data
One year
before firm go
bankrupt
Two years
before firm go
bankrupt
Three years

2006
9
5
5

2007
9
6
4

2008

7
1
7

2009

2010

9
1
10

Bankrupt

19

19

15

6
7
1
14

NonBankrupt
Total

1548

1706

1883

1925

2051

1567

1725

1898

1939

2061

before firm go
bankrupt

Total

77
(0.8%)
9113
(99.2%)
9190

The table shows the number of bankrupt and

non bankrupt firms based on year-end.


This study uses 3 year financial statement
before firm bankrupt as the bankruptcy data.
Select all data for existing firm in the period of
the study. as Non bankruptcy
Total number of data from 2006 to 2010 is
9190 which 0.8% of bankruptcy and the rest
for nonbankruptcy. This shows that the
dataset is highly unbalanced.

Methodology-Data
2006
9
5
5

2007
9
6
4

2008

7
1
7

2009

2010

9
1
10

Bankrupt

19

19

15

6
7
1
14

NonBankrupt
Total

1548

1706

1883

1925

2051

1567

1725

1898

1939

2061

training

testing

60%

40%

Total

77
(0.8%)
9113
(99.2%)
9190

For the classification model, the data is split

into model construction (or training set) and


model evaluation (or testing set).
For this study the training data is set on
periods 2006 to 2008 inclusively while the
testing data is set from 2009 to 2010 The ratio
is approximate to 60:40.

Methodology-Variables
The output variables is the status of company which

is bankrupt or non-bankrupt.
The input variables are the financial ratios of

company
Adopt from Hossari&Rahman (2006) who ranked48

Financial Ratios based on its popularity


Used 22 out of 48 financial Ratio from the list.
According to Ivo (2011), any ratio that can take a zero
or negative denominator never makes any sense. She
suggest to winsorise* the denominator to smaller
positive value.

Methodology-Variables
No Variable

Detail

No

Variable

Detail

Net Income/Total Assets


Current Assets/Current
Liabilities

13

S/FA*

Sales/Fix Assets

14

TE/TL

Total Equity/Total Liabilities

NI/TA

CA/CL*

TL/TA

Total Liabilities/Total Assets

15

FA/TA

Fix Assets/Total Assets

WC/TA

Working Capital/Total Assets

16

FA/TE*

Fix Assets/Total Equity

TL/TE*

Total Liabilities/Total Equity

17

LTL/TA

S/TA

Sales/Total Assets

18

CL/TA

CA/S

Current Assets/Sales

19

CL/TE*

CA/TA

Current Assets/Total Assets

20

EBT/TA

NI/S

Net Income/Sales

21

LTL/TE*

10

NI/TE*

Net Income/Total Equity

22

S/TE*

11

TE/TA

Total Equity/Total Assets

23

TE/LTL*

12

WC/S

Working Capital/Sales

Long-Term Liabilities/Total
Assets
Current Liabilities/Total
Assets
Current Liabilities/Total
Equity
Earnings Before Taxes/Total
Assets
Long-Term Liabilities/Total
Equity
Sales/Total Equity
Total Equity/Long-Term
Liabilities

Preliminary StudyDescriptive

F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12
F13
F14
F15
F16
F17
F18
F19
F20
F21
F22
F23

Bankrupt
-1.21
8.44
1.74
-1.31
5.79
1.88
2.75
-0.6
-2.93
-7.08
-1.54
-6.68
3.55
6.99
0.76
5.34
1.47
1.62
5.96
-0.73
3.23
1.67
-0.07

Skew
Nonbankru
pt
67.37
57.2
56.62
-56.97
73.49
95.43
18.55
-0.82
0.02
-21.96
-56.61
-4.81
4.7
35.87
0.16
50.59
51.46
57
20.87
81.95
80.25
3.62
59.87

All
Bankrupt
67.66
4.68
57.42
70.07
56.86
3.7
-57.21
2.88
73.62
39.36
95.83
5.37
18.55
7.42
-0.81
-0.86
0.02
13.62
-21.79
54.69
-56.85
2.54
-4.82
50.52
4.7
11.75
36.02
53.81
0.17
-0.55
50.78
31.38
51.62
2.39
57.24
3.44
20.39
41
82.29
4.53
80.58
9.78
3.59
1.81
60.12
7.54

Kurtosis
Nonbankru
pt
5009.88
3781
3334.42
3364.21
6221.63
9105.96
588.06
1.35
868.96
870.26
3334.13
374.18
27.99
1534.99
8.85
3078.55
3213.93
3366.66
664.74
7185.99
7049.44
14.77
4664.15

All
Bankrupt
5052.06
0.01
3811.32
32.53
3362.54
0.05
3392.57
0.05
6252.71
1024.38
9182.96
0.18
589.54
0.06
1.33
0.03
873.54
0.01
859.66
136.17
3362.24
0.05
373.74
0.09
27.92
18.56
1547.93
0.14
8.78
0.03
3102.74
190.71
3236.38
0.02
3395.04
0.05
636.28
943.81
7246.62
0.02
7108.66
116.52
14.48
563.38
4703.16
39.46

se
Nonbankru
pt
0.03
18.39
0.12
0.12
212.61
54.83
0.01
0
0
11.15
0.12
0.01
1.87
0.36
0
92.64
0.01
0.12
47.19
0.05
184.58
28.14
57.01

All
0.03
18.24
0.12
0.12
211.01
54.37
0.01
0
0
11.11
0.12
0.01
1.87
0.36
0
91.88
0.01
0.12
47.48
0.05
183.04
28.36
56.54

Preliminary Study-PLSDA
Mixomic package in R was used to come out

with Original PLS DA result.


Prediction
Actual

Total

Bankruptcy

Nonbankruptcy

Bankruptcy

17

24

Nonbankruptcy

520

3456

3976

Total

527

3473

4000

Accuracy= 0.86575
Specificity= 0.869215
Sensitivity= 0.291667
AUC= 0.58

R package coding
##set trainning and testing sample -ignore data year
train<-subset(mydata, Year < 2009, select= -Year)
test<-subset(mydata, Year >= 2009, select=-Year)
X<- train[,1:23]
Y<- train$Status
X.test<- test [,1:23]
Y.test<- test $Status
plsda.train <- plsda(X, Y, ncomp = 2)
test.predict<- predict(plsda.train, X.test, method = "mahalanobis.dist")
Prediction<- levels(Y)[test.predict$class$mahalanobis.dist[, 2]]
cbind(Y = as.character(Y.test,) Prediction)

Thank you

You might also like