Professional Documents
Culture Documents
doi: http://dx.doi.org/10.16925/in.v13i21.1729
E-mail: smerchanr@unbosque.edu.co
2
M.Sc. Information and Communication Sciences. Teaching Researcher, gitis Research Group
3
Student, osiris Research Group
Universidad El Bosque, Bogotá, Colombia
Received date: August 16, 2016 Accepted date: December 10, 2017
How to cite this article: S. Merchan-Rubiano, A. Beltrán-Gómez, J. Duarte-García, “Engineering Students Academic
Performance Prediction usign icfes Test Scores and Demografic Data”, Ingeniería Solidaría, vol. 13, n.° 21, pp. 53-61, January
2017. doi: http://dx.doi.org/10.16925/in.v13i21.1729
Abstract. Introduction: This paper is part of a research project that aims to cons-
truct a predictive model for students’ academic performance, as result of an ite-
rative process of experimentation and evaluation of the pertinence of some data
mining techniques. Methodology: This paper was written in 2016 in the Univer-
sidad El Bosque, Bogotá, Colombia, and presents a comparative analysis of the
performance and relevance of the J48 and Random Forest algorithms, in order to
identify the most influential demographic and icfes score variables, as well as the
classification rules, to predict the first year academic performance of the Enginee-
ring Faculty students, in Universidad El Bosque, Bogotá, Colombia. Results: The
analysis process was carried out on 7,644 students’ records, and it was developed
in two phases. Firstly, the data needed to feed the mining process was extracted
and prepared. Secondly, the data mining process itself was implemented through
preprocessing data and executing the classification algorithms available in Weka.
Some significant variables and rules to predict academic performance are found,
according to the studied population characteristics. Conclusions: The academic risk
seen as the cause of the desertion phenomenon must be studied as a phenomenon
itself. Establishing its causes facilitates the creation of preventive strategies for the
accompaniment of students through their process, aimed to mitigate the risk of
both phenomena.
BY NC ND
BY NC ND
Engineering Students’ Academic Performance Prediction 55
In the Colombian university context, student deser- All the available Engineering Programs students’
tion, understood as an educative phenomenon, has demographic and icfes scores data were extracted
been studied in order to identify which academic from the University’s information systems data-
and demographic variables are the most influencial. bases, from 2005 to 2015. A database with 7,644
National government studies [1] have found that the students’ individual records was obtained. Table
most influential demographic and social variables 1 shows the extracted attributes directly from the
are owning a home, the family’s income level, num- database, sorted by category.
ber of siblings, the student’s mother education, age,
gender and job status. The most influential acade- Table 1. Defined instances and attributes
mic variables are repetition rate (as evidence of low Students’ academic data Students’ demographic
performance), career choice and the icfes test sco- attributes data attributes
res1 [2]. In those studies and in others conducted by Average icfes test score,
some Colombian universities [3], it has been found icfes subject test scores
Date of birth, gender,
that low academic performance in the first year of (biology, math, philosophy,
residence distance to
physics, history, chemistry,
studies is one of the most important factors that language, geography and
University (calculated from
impacts student desertion [4-7]. Nonetheless, aca- home address), city of
English –Test Type 1–,
origin, social class (ranked
demic performance has not been studied as a phe- critical lecture, social and
1-6 according to national
nomenon itself. Taking into account other research civic competences, natural
legislation), marital status,
science, quantitative
projects in the international context [8-11], this reasoning –Test Type 2–),
job status, whether the
paper is part of a research project [12-14] where data student registers a mother,
students’ high school, year
mother’s education, moth-
mining techniques are used in order to predict aca- and period of enrollment,
er’s job, mother’s city of
demic performance, identifying its causes based on whether the student has re-
origin, whether the student
ceived academic incentives,
the demographic and icfes test scores information registers a father, father’s
class schedule type (day
collected during each student’s enrollment. This job, father’s education,
or evening classes), first
father’s city of origin, num-
work presents the predictive models for academic semester grades (6 sub-
ber of siblings.
performance resulted from J48 and Random Forest jects) and second semester
classification algorithms, and a comparative analy- grades (6 subjects)
sis of their performance. Compiled by the authors
• All attributes with empty or “dirty” records were Table 2. Final data subsets to construct the models
removed, including whether the student recei- Type 1 Test Type 2 Test
ved academic incentives and the name of their Valid instances 4,274 386
high school, among others. Definitive attributes 29 27
• All records with no icfes test score data were
Demographic attributes 15 15
removed.
Academic attributes 14 12
• The total dataset was split in two subsets accor-
ding to the type of icfes test taken by the stu- Source: Compiled by the authors
dents at the end of their basic education (the first
was composed by subjects and applied before the In the last step of this phase, the prediction
second semester of 2014, and the second is com- problem to be solved was modeled, identifying the
posed by competencies and has been applied af- input variables, the data mining algorithms and
ter the second semester of 2014). Table 2 shows the output variables. Figure 1 shows de prediction
the final data subsets. problem to resolve:
To select the data mining algorithms, a pre- [15] could have a better accuracy for the model
liminary experiment was perfomed, randomly construction. Table 3 summarizes the results of
splitting the total dataset into four subsets and iden- this experiment.
tifying which of the algorithms provided by Weka
The RemoveMisClassified preprocessing fil- performance class), the Re-sampling filter was
ter was applied for removing attributes that don’t applied, which oversamples the minority class and
have relevant influence in the prediction, bearing undersamples the majority class, obtaining a more
in mind that the purpose of the study is to iden- balanced distribution of data.
tify the most influencing attributes and the rela-
tionships between them. Taking into account that
the obtained accuracy rates for predicting acade- 3. Results and Analysis
mic performance vary significantly between output
classes, as a result of the imbalanced dataset (most Finally, the algorithms were executed resulting in
of the instances belonged to the medium academic the classification accuracy rates shown in table 5.
Observing the accuracy rates obtained by both Regarding the academic performance predic-
classifier algorithms after preprocessing, it is pos- tion for the group of students who took the icfes
sible to state that academic risk performance (or test after 2014’s second semester, although the
low performance) and average performance (me- model was constructed with a relatively small num-
dium performance) prediction for those students ber of instances, high accuracy rates were obtained
who took the Type 1 icfes test is more accurate using for the three types of performance with the J48
the J48 algorithm and the RemoveMisClassified filter. algorithm. This allows for a high confidence in the
However, the results obtained by the resampled data- rules obtained after preprocessing. Table 6 shows
set cannot be completely discarded, because when predictions made by both algorithms for the Type
data is collected (as a student enrolls), the removed 1 icfes test dataset, and table 7 shows predictions
attributes remain and the accuracy rates are accep- made for the Type 2 icfes test dataset, in the form
table. Regarding high academic performance, the of confusion matrices.
algorithm with the best results was Random Forest,
which could be a consequence of the level of cleaning
achieved in the data subset for this class.
From the above confusion matrices and accu- algorithm, after removing the misclassified instances,
racy rates, it could be said that reliable predictive for the Type 1 icfes test dataset (first row) and for the
models were reached for both data subsets. Although Type 2 icfes test dataset (second row). The most rele-
good accuracy rates were obtained applying the res- vant rules to predict student’s academic performance
ampling filter, the best results were obtained just by for the Engineering Faculty at Universidad El Bosque
removing the misclassified instances. were determined, based on demographic and aca-
Table 8 shows the most important rules obtained demic enrollment data.
with the best accuracy rates, corresponding to the J48
Engineering Students’ Academic Performance Prediction 59
Table 8. Rules obtained by J48 algorithm for the Type 1 icfes test dataset
In relation with academic risk prediction for favored. The rules found to predict academic per-
the first dataset, this one impacts more on students formance for those students that took the Type 2
with a icfes test general score below 46.8, physics test, indicate that the most influential variables are
and chemistry scores below 44 and 54 respectively. the icfes test general score, gender, daytime/eve-
This impacts more on students with a medium ning classes and the quantitative reasoning sco-
social stratum who are not working at the time of res obtained on the test. The low performance is
enrollment. When the icfes test general score rises mainly influenced in the general test score, male
and the job status is not working, performance is gender, and math score below 56.
60 Research Ingeniería Solidaria / Volumen 13, Número 21 / enero 2017
[14] S. M. Merchán Rubiano & J. A. Duarte García, [15] Machine Learning Group at the University of Wai-
“Analysis of data mining techniques for construc- kato, Weka 3: Data Mining Software in Java, The
ting a predictive model for academic performance”, University of Waikato, Hamilton, New Zealand.
in unesco-unir ict & Education Latam Congress Available: http://www.cs.waikato.ac.nz/ml/weka
2016, Online, June, 22-24, 2016, pp. 39-48.