You are on page 1of 12

Using Data Mining to Predict Student Performance

Luis Pineda

The aim of our project was threefold: first; to find the most significant attribute(s) related to a student’s
academic performance in math, second; to dispel any prior notions of what makes a student successful,
and third; to identify harmful traits and practices in the academic context. To accomplish this, we used
data of 395 Portuguese students enrolled in a mathematics class provided by the UC Irvine Machine
Learning Repository. Because there were 33 attributes, we first divided the dataset into separate potential
contributors to student success: demographics, student life, and parental influences. On each of these
categories, we used Decision Trees and K-Nearest Neighbor to classify students on a binary pass or fail
basis on both the balanced and unbalanced datasets. It is important to note that unlike the study from
which we derived the data set, we did not include students’ first and second semester grades in our
classification techniques because of their high correlation. As such, we suffered from lower accuracies in
the 60-70%’s. Nonetheless, our results were quite telling. Of the three categories, we found Student Life
to have the most important attributes in predicting whether a certain student will pass; namely whether or
not a student had failed any classes prior, the amount of free time a student had, and the amount of
alcohol consumed during the weekdays. Running these techniques on the unbalanced dataset, we also
found that the amount of family support a student received was also a strong contributor to their success.

Student success is not genetic; although we are raised differently, every child has the potential to be
academically successful. It is often times a troubling combination of external circumstances-- a lack of
money for extra help and resources, problems at home, little or no help from family members, few family
role models, and other external responsibilities--that truly hinder a student’s ability to succeed, let alone
pass a single class. And yet, these endangered students are often the ones most ignored or stripped of
resources. We did not ignore these students; they were our main concern in doing this project.

While the technical goal of our project was to be able to classify successful students and identify the
habits and circumstances that lead to passing a class, mathematics in this case, we always intended to use
these positive habits to provide troubled students with the resources they need to succeed. If we can
accurately say that, for example, a student is more likely to pass a class if they were present for most of
their class sessions, we could turn this information around to put an emphasis on students with a large
amount of absences while at the same time encouraging students to continue their good attendance. To
take this example further, a teacher with this knowledge in hand, could put emphasis on attendance by
promoting some kind of incentives.

Literature Review
Cortez and Alice Silva
In this study from which the actual dataset came from, researchers also sought to classify student
performance. In this study however, they included G1 and G2 (hence their high accuracies) and used
different classifying criteria: pass/fail, letter grade (F-A). Their analysis was also much more through than
ours, as they used Naive Bayes, Neural Networks, Decision Trees, Random Forests, and Support Vector
Machines algorithms. Overall, they were able to accomplish their best results with Naive Bayes and
Random Forests.

A chart of the results can be found in the appendix.


Jennifer Cleland, Mary Macleod
This paper aimed to examine the relationships between student debt, mental health and academic
performance.Students' perceptions of their own levels of debt rather than level of debt ​per se​ relates to
performance. Students who worry about money have higher debts and perform less well than their peers
in degree examinations. Students from lower socioeconomic backgrounds and postgraduate students had
higher debts. There was no direct correlation between debt, class ranking or General Health Questionnaire
(GHQ) score; however, a subgroup of 125 students (37.7%), who said that worrying about money
affected their studies, did have higher debt and were ranked lower in their classes.


Damush, Teresa M,Hays, Ron D, DiMatteo, M Robin
Although emotional circumstances were largely ignored in our dataset, it is important to keep in the mind
the effect stressful events and emotional circumstances could have on a student’s academic success. This
study sought to answer just how big the effect events like these could have through a simple correlation
analysis, however in the context of measuring a student’s health-related quality of life. Through their
analysis, it was discovered that illness, sexuality-related events, and deviance events had the largest

The correlation matrix produced by the study can be seen in the appendix.


ACHIEVEMENT OF STUDENTS IN ENGINEERING. Hackett, Gail; Betz, Nancy E.; Casas, J. Manuel;
Rocha-Singh, Indra A.
This paper examined the relationships of measures of occupational and academic self-efficacy; vocational
interests; outcome expectations; academic ability; and perceived stress, support, and coping to the
academic achievement of women and men enrolled in university-level engineering/science programs
197 students from diverse racial/ethnic backgrounds responded to scales measuring the variables of
interest; high school and college academic data were obtained from university records. Self-efficacy for
academic milestones, in combination with other academic and support variables, was found to be the
strongest predictor of college academic achievement. Outcome expectations, vocational interests, and low
levels of stress were in turn the strongest predictors of academic self-efficacy
AN EDUCATIONAL WEB-BASED SYSTEM: Behrouz Minaei-Bidgoli I, Deborah A. Kashy ', Gerd
Kortemeyer', William F. Punch
Data Mining seems, to us at least, a modern practice. This study, done in the early 2000’s, proved us
wrong. Using the classification techniques: Quadratic Bayesian classifier, k-nearest neighbor (k-NN),
Parzen-window, Multilayer perceptron (MLP), and Decision Trees, this study aimed at providing better
online courses for students by finding accurate classifying models based on web features. Like our own
project, they also used a binary pass/fail classifying criteria alone with high, middle, and low grades, and


There were two separate groups, a policy group where students were required to show up to class. Then
there was a no policy group, where students were not required to show up to class. Classes were split up
into two groups and were given three exams throughout a semester as an experiment to see how the no
policy group would do the exams if they were not present during the previous classes​. ​It was found that
the no attendance policy group was more likely to answer a question incorrectly by 9% on the first exam,

Data Description & Exploration
Our data consisted of thirty descriptive, and three predictive attributes for three hundred and ninety five
Portuguese students enrolled in a math class. These attributes were acquired from a combination of school
reports and questionnaires. Of these thirty three attributes, the majority were either strings or numbers.
For attributes like ​Student’s Guardian ​and ​Extra School Support ​for example, their values were either
“yes/no” or “mother/father”. For attributes that documented things like the quality of a student’s
relationship with their families, their values were a range of numbers encoded with meaning by the
researchers: with 0 meaning “bad” and 5 meaning “excellent”. Of the 395 students in this study, 208 were
female and 187 were male. Of these males and females, 130 of them failed the class, however we did not
know this until first transforming the data.

Running a correlation analysis on our initial dataset, we found that a very strong correlation was found
between G1, G2, and G3; the first, second, and final semester grades, respectively. This was expected, as
the final grade was simply an average of the two semester grades. Because of this obviously strong
correlation and because more interesting knowledge was to be found, we decided to exclude G1 and G2
from our data mining process. As for the other attributes, a correlation was also found between
passing/failing a class and: Mother/Father’s education level, the amount of time spent going out, and the
number of past class failures. Of these correlations, number of past failures had the most significance with
a correlation value of -.338.

A full description of attributes and their values, as provided by the source, can be found in the appendix

Data Transformation
Originally, the final grade attribute was an integer in the range 0 - 20, with 0 being the lowest possible
score and 20 the highest. Because we found that the Portuguese grading scale considers anything below
10 a failing grade, we decided to transform these range of values into a simple binary pass or fail. Since
the number of passing students were greater than those who failed, we decided to create a separate
balanced dataset with an equal number of instances for both cases through random selection. Our data
mining techniques were done on both data sets for increased comprehensiveness.

Because we wanted to understand what conditions proved most important for a student’s academic
success, we chose to do a sort of manual PCA and divide our data into different sections: Student Life,
Demographics, and Parental Influences. The ​student life​ section’s purpose was to see just how much of a
student’s academic success was directly under their control, which was reflected in its attributes: study
time, extra curricular activities, and time spent going out, among others. ​Demographics​ simply wanted to
observe how elements outside of a student’s direct control-- things like their sex, location, time spent
traveling to school-- contributed, if at all, to a student passing a class. As for ​parental influences​, its
purpose was to measure the direct effect parents had on their child’s academic success.

Analysis of General Data

Before dividing up our data into our subsections, we ran the Decision Tree algorithm on the raw,
unfiltered, unbalanced data. Surprisingly, we acquired the best accuracy on the unbalanced decision trees,
with an average of 70% testing accuracy. Our best results were retrieved with a maximum tree depth of 5,
16 parent cases, and 8 child cases, with a testing accuracy of 73.9%. The three most significant attributes
were: failures, absences, and time spent going out. A chart of our results can be seen below.

Analysis of Student Life

Attributes of interest: Failures, Absences, Going Out, Internet Access, Weekday Alcohol Consumption, Free Time,
Desires Higher Education, Travel Time, Weekend Alcohol Consumption, Extracurricular activities, Nursery School,
Romantic Relationships, Health, and Study time

D-T and KNN algorithms were run on the Student Life section as well. Because all of the most significant
attributes identified in the previous analysis are found in this section, we expected to have accurate
classification models in this subsection of our data. It was somewhat surprising then, when the best model
we were able to find through running the decision tree algorithm multiple times had a testing accuracy for
the unbalanced set of 68.8% and 64.8% on the balanced data. Of the two cases, correctly classifying a
failing student had very low accuracies in both the testing and training sets. This difficulty in classifying
failing students was likely the source our lower-than-expected overall accuracies. While failures and time
spent going out remained two of the most significant attributes, weekday alcohol consumption also
seemed to play a strong role in classifying a passing or failing student.

KNN yielded mixed, yet somewhat better results, especially when running feature selection. Again
however, the best results were found when running the algorithm on the unbalanced data set. The KNN
algorithm was run multiple times with four different values for K: 1, 3, 6, and 9. Of these K values, we
found 3 and 6 neighbors to be optimal, with a 71.6% accuracy on the unbalanced dataset. Our results on
the balanced/unbalanced and feature selection/no feature selection can be seen below.

Analysis of Demographics
Attributes of interest: Sex, Age, Address, Family Size, and Parent Status

We decided to look at both unbalanced and balanced data and analyzed the data sets using Decision trees
and K-NN, while also using no feature and feature selection on the K-NN. Based on our results from the
unbalanced and balanced decision trees, we found that the unbalanced tree was more accurate at correctly
classifying students in regards to passing or failing their class based on the above attributes. We did this
with a depth setting of 10, parent node equal to 18 and child node to 9. This is what yielded the best
results in terms of decision trees. As for KNN, we did not receive as much of a positive response the first
go around. (See charts below). So we decided to change the sample size from 66% Training and 33%
Testing to 50% Training and 50% Testing. By doing this we increased the accuracy of the testing by a
maximum of 7-10% per test while training still remained virtually the same. What this tells us is that the
attributes chosen for this class, was extremely irrelevant when predicting a pass or fail given a certain
student. (Family size, sex, address proved to be the most relevant attributes)


K Trainin Test

1 70.9% 29.1%

3 72.9% 27.1%

6 71.1% 28.9%
No Feature

K Training Test

1 70.1% 29.9%

3 65.8% 34.2%

6 71.1% 28.9%

Analysis of Parental Influences

Attributes of interest: Father’s Job, Mother’s Job, Quality of Family Relationship, Mother’s Education, Father’s

We decided to look at both unbalanced and balanced data and analyzed the data sets using Decision trees
and K-NN, while also using no feature and feature selection on the K-NN. Based on our results from the
unbalanced and balanced decision trees, we found that the balanced tree was more accurate at correctly
classifying students in regards to passing or failing their class based on the above attributes. We did this
with a depth setting of 10, parent node equal to 16 and child node to 8. This is what yielded the best
results in terms of decision trees.

As for KNN, we again did not see much of a positive response. (See charts below). Just like before, we
decided to change the sample size from 66% Training and 33% Testing to 50% Training and 50%
Testing. By doing this we increased the accuracy of the testing by a maximum of 12-16% per test while
training still remained virtually the same. What these attributes tell us is that most of these attributes, are
irrelevant to predicting whether or not a student will pass, but are more relevant than the demographic
attributes which were very general & generic to say the least. (Fathers job, mothers job, and mothers
education proved to be the most relevant attributes in this class.


K Training Test

1 67.8% 32.2%

3 71.2% 28.8%

6 69.6% 30.4%

No Feature

K Trainin Test

1 73.9% 26.1%

3 72.7% 27.3%

6 68.9% 31.1%

Both group members collaborated on which classification models to follow, and came to the conclusion
of using decision trees and KNN to analyze our data. Luis focused more on the transformation of data
along with the analysis of student life, while focused more on the demographics and parental influences to
figure out whether or not the certain attributes chosen would allow for knowledge discovery. Ben found
that the classes he analyzed were irrelevant when trying to accurately predict whether or not a student
would pass or fail, while Luis found that absences, and if a student has previously failed a class would
make them more vulnerable to either pass or fail the class. Both members worked on the presentation and
paper together & overall worked very well together in collaboration.

Conclusions & Future Work

Though our accuracies were mixed, we are satisfied with our results considering our choice to exempt G1
and G2. As seen in each of our subsections, it is more difficult to classify a failing student. We do
however, know that attributes like having multiple absences or having previously failed a class before are
likely to contribute to a student’s success. As such, we could use the information from this process to
encourage instructors to provide increased attention to the students who fall under these circumstances.

In the future, we would like to transform this largely categorical data into data that would allow us to do a
regression analysis, as well as clustering techniques. It is our thinking process that perhaps these
techniques could provide more accurate, or interesting, analysis. It would also be interesting to attempt to
use the same techniques used in the source study, especially Naive Bayes which yielded them the best




ACHIEVEMENT OF STUDENTS IN ENGINEERING. Hackett, Gail; Betz, Nancy E.; Casas, J. Manuel;
Rocha-Singh, Indra A.


AN EDUCATIONAL WEB-BASED SYSTEM: Behrouz Minaei-Bidgoli I, Deborah A. Kashy ', Gerd
Kortemeyer', William F. Punch


Jennifer Cleland, Mary Macleod


Damush, Teresa M,Hays, Ron D, DiMatteo, M Robin


Cortez and Alice Silva
Attribute Description
1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3
secondary education or 4 higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3
secondary education or 4 higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police),
at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police),
'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4
- >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)
Descriptive Data Statistics
Using Data Mining to Predict Secondary School Performance

Stressful life events and health-related quality of life in college students

You might also like