You are on page 1of 5

2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), 7-9 February, 2019

Identification of Drop Out Students Using


Educational Data Mining
Nafisa Tasnim Mahit Kumar Paul A. H. M. Sarowar Sattar
Dept. of Computer Science & Engineering Dept. of Computer Science & Engineering Dept. of Computer Science & Engineering
Rajshahi University of Engineering & Technology Rajshahi University of Engineering & Technology Rajshahi University of Engineering & Technology
Rajshahi, Bangladesh Rajshahi, Bangladesh Rajshahi, Bangladesh
nafisa.ruetcse16@gmail.com mahit.cse@gmail.com sarowar@gmail.com

Abstract—Education makes a human being steady, stable and the function so that one can easily predict the target variable
prosperous in his way of leading life. In the same way, the number when a new input is given. In this paper, a threshold value
of higher educated persons in a country can contribute to the based approach has been presented where the threshold value
development of the country. However, this number decreases
due to dropout of students at early stage of the education. is calculated by using the attributes of the datasets and their
Furthermore, if a student can’t continue or drop out, the corresponding information gain value. By using the threshold
resources of a nation is attenuated. Although nowadays the rate value it can easily be identified whether the students are at
of drop out students is diminishing, till now it is a huge challenge risk of dropping out.
for an educational institution to identify the dropout students at The leavings of the paper is systematized as follows: in
the beginning. To address this issue, several approaches have
been discussed in educational data mining to identify the rate of section II, the related tasks by many researchers have been
drop out students. Following this line in this paper, a threshold discussed. Proposed method has been described in section
based approach has been proposed to identify dropout students III. In section IV, a brief introduction to the datasets and
that outperforms than the existing approaches. the metrics for performance analysis has been given. The
Index Terms—Educational Data Mining, threshold, classifica- performance of the proposed approach is analyzed in section
tion, drop out, important features.
V. And in section VI, the conclusion of our work has been
provided.
I. I NTRODUCTION
Education is the important factor for succeeding a long II. R ELATED W ORK
term improvement in any sector of a country. It enlightens Identification of drop out students is a major task because
the individual and develops his/her capability to the limit. The without educating people building up a developed country
main purpose of the education system is to build up each is not possible. For this, the educational department and the
and every student’s skills and knowledge needed to reach the government of a country are aware of this. Researchers have
successful career pathway. This is the main target of most of tried to invent new models for identifying students who are at
the educational institutions. But the task is a huge challenge risk of dropping out. Predictive, Descriptive and also Subgroup
because a large amount of students drops out every year due Discovery algorithms have been utilized in this field to identify
to various reasons. A vast resource is being wasted due to the drop out students and the key factors which are responsible for
dropping out of students. It pulls down the nation backward. drop out students. P. Cortez and A. Silva have implemented
As the rate of drop out students’ increases, the economic different types of predictive and descriptive model and tried
growth and development of a nation decreases. It’s very to compare among them [3]. In [4], A. Tamhane, S. Ikbal, B.
difficult for the educational institutions to analyze and find out Sengupta, M. Duggirala and J. Appleton have reported on a
the key reasons behind the dropping out of students by looking large-scale study and used several classification algorithms for
after each and every student. It’s a costly procedure and needs identifying the risk students such as Decision Tree, Decision
a lot of manpower. In this case, data mining techniques can Table, Logistic Regression, and Naive Bayes. In a recent
be used to predict the students who are at risk of dropout. work [5] the authors have tried to find out who, when and
Data mining aims at discovering previously unknown, po- why are at risk of not graduating on time. They have used
tentially useful and non-trivial knowledge from a huge amount Decision Tree, Logistic Regression, Stepwise Regression, and
of data [1]. Data mining techniques have both predictive and Cox Regression model for their way of task. In another
descriptive natures. Usually, supervised learning methods have work [6], the authors have proposed new evaluation metrics
predictive nature and unsupervised learning methods have to measure the goodness of machine learning algorithms
descriptive nature [2]. By using the predictive nature of data from the view of an educator’s. In this paper, the authors
mining techniques, one can identify the students who are at have used different classification algorithm. They are Random
risk of drop out. In supervised learning method both input Forest, AdaBoost, Linear Regression, Support Vector Machine,
and target class are given and an algorithm is used to learn Decision Tree. Some traditional evaluation metrics such as

978-1-5386-9111-3/19/$31.00 © 2019 IEEE


ROC curves, precision and recall have been used for evaluating If we substitute 𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦
the model. They have identified heavily used features by using with 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 with 𝑎𝑖 that
Information Gain and Gini Index. Jaccard Similarity metrics represents attributes having increasing factor property,
have been used for evaluating various algorithms. In 2017, and 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 with 𝑏𝑖 that represents attributes
D. B. Fernandez and S. Lujan-Mora have used three data having decreasing factor property.
mining tools (RapidMiner, Knime and Weka) for this task Then the equation takes the form as following:
[7]. Some well-known machine learning techniques such as 𝑎𝑖
Random Forests, Logistic Regression, Decision Trees etc. have 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝛼 (4)
𝑏𝑖
been used by them to predict the performance of students [9]–
Furthermore, the eq. 4 can be written as eq. 5.
[12].
𝑎𝑖 ∗ 𝑤𝑎𝑖
Subgroup discovery is a significant data mining technique 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = (5)
which has been largely used in educational data mining. The 𝑏𝑖 ∗ 𝑤𝑏𝑖
subgroup discovery technique discovers interesting associa- where, 𝑤𝑎𝑖 is the weight value for increasing factors, 𝑤𝑏𝑖 is
tions among different variable [8]. Using this technique, the the weight value for decreasing factors, 𝑤𝑎𝑖 is calculated as
authors have identified key factors by using the methods SNS, 𝑎𝑖 ∗ 𝐼𝐺(𝑎), 𝑤𝑏𝑖 is calculated as 𝐼𝐺(𝑏)
DSSD, NMEEF-SD, BSD, SD-Map and Apriori-SD [13]. The threshold value is calculated in such a way that increas-
ing factors are used as the numerator and decreasing factors are
III. P ROPOSED M ETHOD used as the denominator. When the numerator is incremented
In this paper, to predict about the drop out students, we and the denominator is decremented, the threshold value is
have proposed a novel approach based on threshold value incremented. For this reason, the weight value is calculated
calculation. Threshold values are calculated using important in such a way that, the numerator is incremented and the
features. These features are selected using Information Gain denominator is decremented. But the students who are at risk
[14]. The more is the information gain, the feature is more of dropping out, the value of increasing factors are smaller and
important. According to the information gain, we have taken the value of decreasing factors are larger. For the larger value
some features to calculate the threshold value. of the denominator and smaller value of numerator creates a
For calculating the threshold value we have considered two smaller threshold value. This creates a smaller threshold value
properties of the dataset attributes: increasing and decreasing for drop out student, and larger threshold value for the student
factors. Increasing factor refers to the ability of an attribute who can continue his study. The method is a self-organised
to increase the probability of a student being graduated. For method. It can adjust the threshold value by updating it. After
example, if the students’ previous class test marks is better, calculating the threshold value if the threshold of new pattern
then we can assume that s/he will do better or can continue is less than the given threshold, then it is classified as drop
the study. On the other hand, decreasing factor is vice versa out. Otherwise, it is classified as continuing study.
of increasing factor that decreases the probability of a student
IV. I MPLEMENTATION
being graduated. For example, if the absence rate of a student
is higher, then the probability of drop out is also higher. Let, A. Dataset Description
we have considered a feature. Then the feature values are The dataset “Student Performance Analysis” has been taken
splitted in two portions. In a portion, the lower values are from the renowned dataset repository named as UCI Machine
included and in another portion the higher values are included. Learning Repository [3]. Two datasets are integrated in “Stu-
When the attribute with lower values have probability to drop dent Performance Analysis” dataset. One has 395 instances
out then the feature is considered as increasing factors. And and another has 649 instances. Both datasets have 33 attributes.
when the attribute with higher values have probability to drop
out then the feature is considered as decreasing factors. TABLE I
Thus, considering the above two factors, the following DATASET D ESCRIPTION
equations can be formulated:
Datasets Dataset Attribute Number of Number of
𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦 𝛼 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 Characteristics Characteristics Attributes Instances
(1) A Multivariate Integer 33 395
B Multivariate Integer 33 649

1
𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦 𝛼
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 In Dataset A, number of positive class instances is 293,
(2) and number of negative class instances is 102. In Dataset B,
Consequenlty, the above eq. 1 and eq. 2 can be merged as there are 584 positive instances and 65 negative instances. For
eq. 3 as following: Dataset A, IR is 2.87 and for Dataset B, IR is 8.98. Imbalance
𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 Ratio (IR) is calculated as eq. 6 [15]
𝑃 𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑠𝑡𝑢𝑑𝑦 𝛼
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓 𝑎𝑐𝑡𝑜𝑟𝑠 𝑇𝑚𝑎𝑗
(3) 𝐼𝑅 = (6)
𝑇𝑚𝑖𝑛
where, 𝐼𝑅 is Imbalance Ratio, 𝑇𝑚𝑎𝑗 is the number of positive E. Evaluation Metrics
class instances, 𝑇𝑚𝑖𝑛 is the number of negative class instances. To assess the performance of our work, different metrics
such as accuracy, precision [17], recall [17], F1-score [18]
B. Feature Importance and AUC [19] have been used. Accuracy, precision, recall and
F1-score can be represented as following:
To identify the drop out students, the most used features
should be considered. By extracting most important features, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 + 𝑇 𝑁/𝐴𝑙𝑙 (8)
the model can be built utilizing those features. In our proposed
approach, Information Gain (IG) has been used to extract the 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃/𝑇 𝑃 + 𝐹 𝑃 (9)
features. It is based on Claude Shannon Entropy [16]. The 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃/𝑇 𝑃 + 𝐹 𝑁 (10)
attribute with highest information gain is referred as the most
important feature. The information gain can be calculated as 𝐹 1𝑠𝑐𝑜𝑟𝑒 = 2∗𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙/𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (11)
[14]
Where, TP = True Positive, means that the number of students
∣𝑐∣ who can continue study are correctly classified, FP = False
𝐼𝐺(𝑡) = − Σ𝑖=1 𝑃 (𝑐𝑖 )𝑙𝑜𝑔𝑃 (𝑐𝑖 )
+ 𝑃 (𝑡)𝑃 (𝑐𝑖 ∣𝑡)𝑙𝑜𝑔𝑃 (𝑐𝑖 ∣𝑡)
Positive, means that number of drop out students are wrongly
+ 𝑃 (1 − 𝑡)𝑃 (𝑐𝑖 ∣(1 − 𝑡))𝑙𝑜𝑔𝑃 (𝑐𝑖 ∣(1 − 𝑡)) classified as the student who can continue studying, FN =
(7) False Negative, means that the number of students who can
continue study are wrongly classified as drop out students, TN
where, 𝑐𝑖 represents the ith category, 𝑃 (𝑐𝑖 ) is the probability of = True Negative, means that number of drop out students are
ith category, P(t) is the probability appears in the documents, correctly classified.
P(1-t) is the probability that doesn’t appear in the documents,
𝑃 (𝑐𝑖 ∣𝑡) is the conditional probability given that t is appeared, V. P ERFORMANCE A NALYSIS
𝑃 (𝑐𝑖 ∣(1 − 𝑡)) is the conditional probability given that t is not Evaluation criteria are a key factor to measure the classifica-
appeared. tion performance. Our goal is to assess the performance of the
Using the eq. 7, the most important features can be found proposed approach on the task of predicting if a student can
out. Using these most important features and ignoring the less complete his graduation on time or not. Binary classification
important features the dimensionality of the datasets can be has been considered in our work. Fig. 1 represents the bar
reduced. This to address as high dimensional data are the most graph of classification accuracy, comparing the four methods:
important concerns in classification problem because of space Logistic regression, Naive Bayes Classifier, Support Vector
and time complexity [20]. Machine and proposed approach. The accuracy for proposed
approach is better than the classifier algorithms.
C. Outlier Detection
Outlier is a data object that varies from other data objects.
It originates due to mechanical errors, changes in system
activities, fraudulent behavior, human error, mechanism error
or simply through regular deviances in inhabitants. Outlier
detection is the method of ruling the patterns of data whose
behaviors are different from other expected data patterns.
Misclassification occurs in the presence of outlier and it creates
error in identifying the required class. Thats why, detecting
outliers is important for proper classification. In this paper,
Cooks distance [21] has been used for outlier detection. Fig. 1. Accuracy for implemented methods (Original Datasets)

D. Experimental Setup In an imbalanced dataset, accuracy does not characterize


between the numbers of appropriately classified patterns of
In our work, k-fold more specifically 5 fold cross validation dissimilar classes. For this, other well-known metrics are also
is used for evaluating the performance. Firstly, the dataset is used for evaluating the performance.
divided into 5 splits. After that among from 5 splits, 4 splits Table II summarizes the precision, recall and F1-score for
are used as training data and 1 split is used as test data. And the methods. The table represents that proposed approach
the method is run for 5 times, so that there remains no data has better performance in terms of Precision, Recall and F1-
which aren’t being tested. score. Fig. 2, shows the Area Under the ROC curves (Original
The algorithms are executed in MATLAB R2017a. We have Datasets) and it represents that the AUC for proposed approach
performed all the experiments on a computer occupies; pro- is greater than the AUC for the other classifiers.
cessor: Intel(R) Core(TM) i5-3230M CPU @ 2.60Hz; RAM: Also for the datasets after detecting outlier, it is observed
4GB, system type: 64-bit operating system. that the accuracy for proposed approach also perform well.
TABLE II
E VALUATING PERFORMANCE FOR THE IMPLEMENTED METHODS
(O RIGINAL DATASETS )

Evaluation Dataset A
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9514 0.9420 0.8927 0.9450
Recall 0.9444 0.9366 0.9087 0.9274
F1-score 0.9476 0.9383 0.9005 0.9357
Evaluation Dataset B
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9646 0.9628 0.8841 0.9753
Recall 0.9662 0.9588 0.9590 0.9137 Fig. 3. Accuracy for implemented method (After Detecting Outliers)
F1-score 0.9651 0.9607 0.9186 0.9541

TABLE III
E VALUATING PERFORMANCE FOR THE IMPLEMENTED METHODS (A FTER
D ETECTING O UTLIERS )

Evaluation Dataset A
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9645 0.9639 0.9358 0.9788
Recall 0.9527 0.9570 0.9366 0.8894
F1-score 0.9595 0.9604 0.9360 0.9313
Evaluation Dataset B
Metrics Threshold Logistic Naive Support
Regression Bayes Vector Machine
Precision 0.9845 1.0000 0.9763 1.0000
Recall 0.9931 0.9894 1.0000 0.9845
F1-score 0.9885 0.9946 0.9879 0.9920

Fig. 2. ROC Curves (Original Datasets)

Fig. 3 represents the bar graph of comparing accuracy among


the implemented methods (After Detecting Outliers).
Table III represents the precision, recall and F1-score after
detecting outliers from both of the datasets. From the table it
is observed that the values of precision and F1-score of the
proposed approach is nearly the same as the Logistic Regres-
sion, Naive Bayes Classifier and Support Vector Machine.
In Fig. 4, ROC Curves (After Detecting Outliers) shows the
Area Under the ROC curves for the datasets after detecting
outliers. From Fig. 4, it is found that the Area Under the Curve
for the proposed approach is also larger for the datasets after
detecting outlier.
Here, outlier detection is used to show that, other classi-
fiers perform better after removing outlier. But this method
performs well even there remains any outlier. The method can Fig. 4. ROC Curves (After Detecting Outliers)
handle outliers and identify the students who are at risk of
drop out or not. [12] S.K. Yadav, B. Bharadwaj, and S. Pal, “Data mining applications:
A comparative study for predicting student’s performance,” in arXiv
VI. C ONCLUSION preprint arXiv:1202.4815,2012.
[13] S. Helal, J. Li, L. Liu, E. Ebrahime, S. Dawson, and D. J. Murray,
In this paper, a threshold based approach has been described. “Identifying key factors of student academic performance by subgroup
discovery,” In Journal of Computer Science and Technology 31(3): 561-
To extract the important features, one only needs the attribute 576 May 2016.
values and their corresponding information gain. From the [14] H. Ugus, “A two-stage feature selection method for text categorization
extracted features the threshold value can be calculated. Once by using information gain, principal component analysis and genetic
algorithm,” In Knowledge-Based Systems, vol. 24, issue 7, pp. 1024-
the threshold value is calculated, no classifier is needed for 1032, October 2011.
classifying new pattern. Only the threshold value of the new [15] B. Pal, and M.K. Paul, “A Gaussian mixture based boosted classification
pattern has to be calculated and comparing with the previously scheme for imbalanced and oversampled data”, in 1st International
Conference on Electrical, Computer and Communication Engineering
calculated threshold value, it can be classified. If the threshold (ECCE), 2017.
value is less than the calculated threshold value, then it can [16] J. Han, M. Kamber, and J. Pei, “Data mining concepts and techniques,”
be said that the student is in risk of drop out. In the section 3rd ed., vol. 2. by Elsevier, 2012, pp.336–340.
[17] T. Fawcett, “An introduction to ROC analysis,” Pattern recognition
of performance analysis, performance has been shown in two letters, vol. 27, no. 8, pp. 861-874, 2006
stages: for original datasets and after detecting outliers. Both [18] Y. Yang, and X. Liu, “A re-examination of text categorization methods,”
for the original datasets and the datasets after detecting outlier Proceedings of the 22nd annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 42-49, 1999
our proposed approach works better. The work in this paper is [19] A.P. Bradley, “The use of the area under the ROC curve in the evaluation
limited to applying the method for original datasets and for the of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp.
datasets of after detecting outliers. In future, the imbalance of 1145-1159, 1997
[20] A. K. Uysal, “A novel probabilistic feature selection method for text
the datasets can be considered and removal of the imbalance classification,” In Knowledge-Based Systems, vol. 36, pp. 226-235,
can enhance the performance of the proposed method. December 2012.
[21] J.A. Diaz-Garcia, and G. Gonzalez-Farias, “A note on the Cook’s
R EFERENCES distance,” In Journal of Statistical Planning and Inference, vol. 120,
issues 1-2, pp. 119-136.
[1] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “Knowledge discovery
and data mining: Towards a unifying framework,” in the 2nd Interna-
tional Conference on Knowledge Discovery and Data Mining (KDD),
Aug. 1996, pp. 82-88.
[2] C. J. Carmona, P. Gonzales, M. J. Jesus, and F Herrera, “NMEEF-SD:
Non-dominated Multi-objective Evolutionary algorithm for Extracting
Fuzzy Rules in Subgroup Discovery”, in IEEE international conference
on fuzzy systems, pp. 1706-1711, 2010.
[3] P. Cortez and A. Silva, “Using data mining to predict secondary
school student performance,” in Proceedings of 5th FUture BUsiness
TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal,
EUROSIS, ISBN 978-9077381-39-7, April 2008.
[4] A. Tamhane, S. Ikbal, B. Sengupta, M. Duggirala and J. Appleton,
“Predicting Student Risks Through Longitudinal Analysis,” in Proceed-
ings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining Pages 1544-1552, New York, New York,
USA August 24 - 27, 2014.
[5] E. Aguaiar, H. Lakkaraju, N. Bhanpuri, D. Miller, B. Yuhas and K. L.
Addison, “Who, When, and Why: A Machine Learning Approach to
Prioritizing Students at Risk of not Graduating High School on Time,”
Proceedings of the Fifth International Conference on Learning Analytics
And Knowledge, Pages 93-102, Poughkeepsie, New York March 16 -
20, 2015.
[6] H. Lakkaraju, E. Aguiar, C. Shan, D. Miller, N. Bhanpuri, R. Ghani and
K. L. Addison, “A Machine Learning Framework to Identify Students at
Risk of Adverse Academic Outcomes,” Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 1909-1918, Sydney, NSW, Australia, August 10-13, 2015.
[7] D. B. Fernandez, and S. Lujan-Mora, “Comparison of applications
for educational data mining in engineering education,” In IEEE World
Engineering Eduation Conference (EDUNINE), March 19-22, 2017,
[8] S. Helal, “Subgroup Discovery Algorithms: A Survey and Empirical
Evaluation,” In Journal of Computer Science and Technology 31(3):
561-576 May 2016.
[9] E. Aguiar, G. A. Ambrose, N. V. Chawla, V. Goodrich, and J. Brockman,
“Engagement vs performance: Using electronic portfolios to predict
first semester engineering student persistence,” in Journal of Learning
Analytics, 1(3), 2014.
[10] G. M. Dekker, M. Pechenizkiy, and J. M. Vleeshouwers, “Predicting
students drop out: A case study,” in International Working Group on
Educational Data Mining, 2009.
[11] K. Pittman, “Comparison of data mining techniques used to predict
student retention,” in ProQuest, 2008.

You might also like