Professional Documents
Culture Documents
Abstract—Education makes a human being steady, stable and the function so that one can easily predict the target variable
prosperous in his way of leading life. In the same way, the number when a new input is given. In this paper, a threshold value
of higher educated persons in a country can contribute to the based approach has been presented where the threshold value
development of the country. However, this number decreases
due to dropout of students at early stage of the education. is calculated by using the attributes of the datasets and their
Furthermore, if a student can’t continue or drop out, the corresponding information gain value. By using the threshold
resources of a nation is attenuated. Although nowadays the rate value it can easily be identified whether the students are at
of drop out students is diminishing, till now it is a huge challenge risk of dropping out.
for an educational institution to identify the dropout students at The leavings of the paper is systematized as follows: in
the beginning. To address this issue, several approaches have
been discussed in educational data mining to identify the rate of section II, the related tasks by many researchers have been
drop out students. Following this line in this paper, a threshold discussed. Proposed method has been described in section
based approach has been proposed to identify dropout students III. In section IV, a brief introduction to the datasets and
that outperforms than the existing approaches. the metrics for performance analysis has been given. The
Index Terms—Educational Data Mining, threshold, classifica- performance of the proposed approach is analyzed in section
tion, drop out, important features.
V. And in section VI, the conclusion of our work has been
provided.
I. I NTRODUCTION
Education is the important factor for succeeding a long II. R ELATED W ORK
term improvement in any sector of a country. It enlightens Identification of drop out students is a major task because
the individual and develops his/her capability to the limit. The without educating people building up a developed country
main purpose of the education system is to build up each is not possible. For this, the educational department and the
and every student’s skills and knowledge needed to reach the government of a country are aware of this. Researchers have
successful career pathway. This is the main target of most of tried to invent new models for identifying students who are at
the educational institutions. But the task is a huge challenge risk of dropping out. Predictive, Descriptive and also Subgroup
because a large amount of students drops out every year due Discovery algorithms have been utilized in this field to identify
to various reasons. A vast resource is being wasted due to the drop out students and the key factors which are responsible for
dropping out of students. It pulls down the nation backward. drop out students. P. Cortez and A. Silva have implemented
As the rate of drop out students’ increases, the economic different types of predictive and descriptive model and tried
growth and development of a nation decreases. It’s very to compare among them [3]. In [4], A. Tamhane, S. Ikbal, B.
difficult for the educational institutions to analyze and find out Sengupta, M. Duggirala and J. Appleton have reported on a
the key reasons behind the dropping out of students by looking large-scale study and used several classification algorithms for
after each and every student. It’s a costly procedure and needs identifying the risk students such as Decision Tree, Decision
a lot of manpower. In this case, data mining techniques can Table, Logistic Regression, and Naive Bayes. In a recent
be used to predict the students who are at risk of dropout. work [5] the authors have tried to find out who, when and
Data mining aims at discovering previously unknown, po- why are at risk of not graduating on time. They have used
tentially useful and non-trivial knowledge from a huge amount Decision Tree, Logistic Regression, Stepwise Regression, and
of data [1]. Data mining techniques have both predictive and Cox Regression model for their way of task. In another
descriptive natures. Usually, supervised learning methods have work [6], the authors have proposed new evaluation metrics
predictive nature and unsupervised learning methods have to measure the goodness of machine learning algorithms
descriptive nature [2]. By using the predictive nature of data from the view of an educator’s. In this paper, the authors
mining techniques, one can identify the students who are at have used different classification algorithm. They are Random
risk of drop out. In supervised learning method both input Forest, AdaBoost, Linear Regression, Support Vector Machine,
and target class are given and an algorithm is used to learn Decision Tree. Some traditional evaluation metrics such as
1
𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦 𝛼
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 In Dataset A, number of positive class instances is 293,
(2) and number of negative class instances is 102. In Dataset B,
Consequenlty, the above eq. 1 and eq. 2 can be merged as there are 584 positive instances and 65 negative instances. For
eq. 3 as following: Dataset A, IR is 2.87 and for Dataset B, IR is 8.98. Imbalance
𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 Ratio (IR) is calculated as eq. 6 [15]
𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦 𝛼
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 𝑇𝑚𝑎𝑗
(3) 𝐼𝑅 = (6)
𝑇𝑚𝑖𝑛
where, 𝐼𝑅 is Imbalance Ratio, 𝑇𝑚𝑎𝑗 is the number of positive E. Evaluation Metrics
class instances, 𝑇𝑚𝑖𝑛 is the number of negative class instances. To assess the performance of our work, different metrics
such as accuracy, precision [17], recall [17], F1-score [18]
B. Feature Importance and AUC [19] have been used. Accuracy, precision, recall and
F1-score can be represented as following:
To identify the drop out students, the most used features
should be considered. By extracting most important features, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 + 𝑇 𝑁/𝐴𝑙𝑙 (8)
the model can be built utilizing those features. In our proposed
approach, Information Gain (IG) has been used to extract the 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃/𝑇 𝑃 + 𝐹 𝑃 (9)
features. It is based on Claude Shannon Entropy [16]. The 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃/𝑇 𝑃 + 𝐹 𝑁 (10)
attribute with highest information gain is referred as the most
important feature. The information gain can be calculated as 𝐹 1𝑠𝑐𝑜𝑟𝑒 = 2∗𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙/𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (11)
[14]
Where, TP = True Positive, means that the number of students
∣𝑐∣ who can continue study are correctly classified, FP = False
𝐼𝐺(𝑡) = − Σ𝑖=1 𝑃 (𝑐𝑖 )𝑙𝑜𝑔𝑃 (𝑐𝑖 )
+ 𝑃 (𝑡)𝑃 (𝑐𝑖 ∣𝑡)𝑙𝑜𝑔𝑃 (𝑐𝑖 ∣𝑡)
Positive, means that number of drop out students are wrongly
+ 𝑃 (1 − 𝑡)𝑃 (𝑐𝑖 ∣(1 − 𝑡))𝑙𝑜𝑔𝑃 (𝑐𝑖 ∣(1 − 𝑡)) classified as the student who can continue studying, FN =
(7) False Negative, means that the number of students who can
continue study are wrongly classified as drop out students, TN
where, 𝑐𝑖 represents the ith category, 𝑃 (𝑐𝑖 ) is the probability of = True Negative, means that number of drop out students are
ith category, P(t) is the probability appears in the documents, correctly classified.
P(1-t) is the probability that doesn’t appear in the documents,
𝑃 (𝑐𝑖 ∣𝑡) is the conditional probability given that t is appeared, V. P ERFORMANCE A NALYSIS
𝑃 (𝑐𝑖 ∣(1 − 𝑡)) is the conditional probability given that t is not Evaluation criteria are a key factor to measure the classifica-
appeared. tion performance. Our goal is to assess the performance of the
Using the eq. 7, the most important features can be found proposed approach on the task of predicting if a student can
out. Using these most important features and ignoring the less complete his graduation on time or not. Binary classification
important features the dimensionality of the datasets can be has been considered in our work. Fig. 1 represents the bar
reduced. This to address as high dimensional data are the most graph of classification accuracy, comparing the four methods:
important concerns in classification problem because of space Logistic regression, Naive Bayes Classifier, Support Vector
and time complexity [20]. Machine and proposed approach. The accuracy for proposed
approach is better than the classifier algorithms.
C. Outlier Detection
Outlier is a data object that varies from other data objects.
It originates due to mechanical errors, changes in system
activities, fraudulent behavior, human error, mechanism error
or simply through regular deviances in inhabitants. Outlier
detection is the method of ruling the patterns of data whose
behaviors are different from other expected data patterns.
Misclassification occurs in the presence of outlier and it creates
error in identifying the required class. Thats why, detecting
outliers is important for proper classification. In this paper,
Cooks distance [21] has been used for outlier detection. Fig. 1. Accuracy for implemented methods (Original Datasets)
Evaluation Dataset A
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9514 0.9420 0.8927 0.9450
Recall 0.9444 0.9366 0.9087 0.9274
F1-score 0.9476 0.9383 0.9005 0.9357
Evaluation Dataset B
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9646 0.9628 0.8841 0.9753
Recall 0.9662 0.9588 0.9590 0.9137 Fig. 3. Accuracy for implemented method (After Detecting Outliers)
F1-score 0.9651 0.9607 0.9186 0.9541
TABLE III
E VALUATING PERFORMANCE FOR THE IMPLEMENTED METHODS (A FTER
D ETECTING O UTLIERS )
Evaluation Dataset A
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9645 0.9639 0.9358 0.9788
Recall 0.9527 0.9570 0.9366 0.8894
F1-score 0.9595 0.9604 0.9360 0.9313
Evaluation Dataset B
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9845 1.0000 0.9763 1.0000
Recall 0.9931 0.9894 1.0000 0.9845
F1-score 0.9885 0.9946 0.9879 0.9920