Professional Documents
Culture Documents
Luis Pineda
Abstract
The aim of our project was threefold: first; to find the most significant attribute(s) related to a student’s
academic performance in math, second; to dispel any prior notions of what makes a student successful,
and third; to identify harmful traits and practices in the academic context. To accomplish this, we used
data of 395 Portuguese students enrolled in a mathematics class provided by the UC Irvine Machine
Learning Repository. Because there were 33 attributes, we first divided the dataset into separate potential
contributors to student success: demographics, student life, and parental influences. On each of these
categories, we used Decision Trees and K-Nearest Neighbor to classify students on a binary pass or fail
basis on both the balanced and unbalanced datasets. It is important to note that unlike the study from
which we derived the data set, we did not include students’ first and second semester grades in our
classification techniques because of their high correlation. As such, we suffered from lower accuracies in
the 60-70%’s. Nonetheless, our results were quite telling. Of the three categories, we found Student Life
to have the most important attributes in predicting whether a certain student will pass; namely whether or
not a student had failed any classes prior, the amount of free time a student had, and the amount of
alcohol consumed during the weekdays. Running these techniques on the unbalanced dataset, we also
found that the amount of family support a student received was also a strong contributor to their success.
Introduction
Student success is not genetic; although we are raised differently, every child has the potential to be
academically successful. It is often times a troubling combination of external circumstances-- a lack of
money for extra help and resources, problems at home, little or no help from family members, few family
role models, and other external responsibilities--that truly hinder a student’s ability to succeed, let alone
pass a single class. And yet, these endangered students are often the ones most ignored or stripped of
resources. We did not ignore these students; they were our main concern in doing this project.
While the technical goal of our project was to be able to classify successful students and identify the
habits and circumstances that lead to passing a class, mathematics in this case, we always intended to use
these positive habits to provide troubled students with the resources they need to succeed. If we can
accurately say that, for example, a student is more likely to pass a class if they were present for most of
their class sessions, we could turn this information around to put an emphasis on students with a large
amount of absences while at the same time encouraging students to continue their good attendance. To
take this example further, a teacher with this knowledge in hand, could put emphasis on attendance by
promoting some kind of incentives.
Literature Review
USING DATA MINING TO PREDICT SECONDARY SCHOOL STUDENT PERFORMANCE: Paulo
Cortez and Alice Silva
In this study from which the actual dataset came from, researchers also sought to classify student
performance. In this study however, they included G1 and G2 (hence their high accuracies) and used
different classifying criteria: pass/fail, letter grade (F-A). Their analysis was also much more through than
ours, as they used Naive Bayes, Neural Networks, Decision Trees, Random Forests, and Support Vector
Machines algorithms. Overall, they were able to accomplish their best results with Naive Bayes and
Random Forests.
The correlation matrix produced by the study can be seen in the appendix.
Methodology
Data Description & Exploration
Our data consisted of thirty descriptive, and three predictive attributes for three hundred and ninety five
Portuguese students enrolled in a math class. These attributes were acquired from a combination of school
reports and questionnaires. Of these thirty three attributes, the majority were either strings or numbers.
For attributes like Student’s Guardian and Extra School Support for example, their values were either
“yes/no” or “mother/father”. For attributes that documented things like the quality of a student’s
relationship with their families, their values were a range of numbers encoded with meaning by the
researchers: with 0 meaning “bad” and 5 meaning “excellent”. Of the 395 students in this study, 208 were
female and 187 were male. Of these males and females, 130 of them failed the class, however we did not
know this until first transforming the data.
Running a correlation analysis on our initial dataset, we found that a very strong correlation was found
between G1, G2, and G3; the first, second, and final semester grades, respectively. This was expected, as
the final grade was simply an average of the two semester grades. Because of this obviously strong
correlation and because more interesting knowledge was to be found, we decided to exclude G1 and G2
from our data mining process. As for the other attributes, a correlation was also found between
passing/failing a class and: Mother/Father’s education level, the amount of time spent going out, and the
number of past class failures. Of these correlations, number of past failures had the most significance with
a correlation value of -.338.
A full description of attributes and their values, as provided by the source, can be found in the appendix
Data Transformation
Originally, the final grade attribute was an integer in the range 0 - 20, with 0 being the lowest possible
score and 20 the highest. Because we found that the Portuguese grading scale considers anything below
10 a failing grade, we decided to transform these range of values into a simple binary pass or fail. Since
the number of passing students were greater than those who failed, we decided to create a separate
balanced dataset with an equal number of instances for both cases through random selection. Our data
mining techniques were done on both data sets for increased comprehensiveness.
Because we wanted to understand what conditions proved most important for a student’s academic
success, we chose to do a sort of manual PCA and divide our data into different sections: Student Life,
Demographics, and Parental Influences. The student life section’s purpose was to see just how much of a
student’s academic success was directly under their control, which was reflected in its attributes: study
time, extra curricular activities, and time spent going out, among others. Demographics simply wanted to
observe how elements outside of a student’s direct control-- things like their sex, location, time spent
traveling to school-- contributed, if at all, to a student passing a class. As for parental influences, its
purpose was to measure the direct effect parents had on their child’s academic success.
D-T and KNN algorithms were run on the Student Life section as well. Because all of the most significant
attributes identified in the previous analysis are found in this section, we expected to have accurate
classification models in this subsection of our data. It was somewhat surprising then, when the best model
we were able to find through running the decision tree algorithm multiple times had a testing accuracy for
the unbalanced set of 68.8% and 64.8% on the balanced data. Of the two cases, correctly classifying a
failing student had very low accuracies in both the testing and training sets. This difficulty in classifying
failing students was likely the source our lower-than-expected overall accuracies. While failures and time
spent going out remained two of the most significant attributes, weekday alcohol consumption also
seemed to play a strong role in classifying a passing or failing student.
KNN yielded mixed, yet somewhat better results, especially when running feature selection. Again
however, the best results were found when running the algorithm on the unbalanced data set. The KNN
algorithm was run multiple times with four different values for K: 1, 3, 6, and 9. Of these K values, we
found 3 and 6 neighbors to be optimal, with a 71.6% accuracy on the unbalanced dataset. Our results on
the balanced/unbalanced and feature selection/no feature selection can be seen below.
Analysis of Demographics
Attributes of interest: Sex, Age, Address, Family Size, and Parent Status
We decided to look at both unbalanced and balanced data and analyzed the data sets using Decision trees
and K-NN, while also using no feature and feature selection on the K-NN. Based on our results from the
unbalanced and balanced decision trees, we found that the unbalanced tree was more accurate at correctly
classifying students in regards to passing or failing their class based on the above attributes. We did this
with a depth setting of 10, parent node equal to 18 and child node to 9. This is what yielded the best
results in terms of decision trees. As for KNN, we did not receive as much of a positive response the first
go around. (See charts below). So we decided to change the sample size from 66% Training and 33%
Testing to 50% Training and 50% Testing. By doing this we increased the accuracy of the testing by a
maximum of 7-10% per test while training still remained virtually the same. What this tells us is that the
attributes chosen for this class, was extremely irrelevant when predicting a pass or fail given a certain
student. (Family size, sex, address proved to be the most relevant attributes)
Feature
K Trainin Test
g
1 70.9% 29.1%
3 72.9% 27.1%
6 71.1% 28.9%
No Feature
K Training Test
1 70.1% 29.9%
3 65.8% 34.2%
6 71.1% 28.9%
We decided to look at both unbalanced and balanced data and analyzed the data sets using Decision trees
and K-NN, while also using no feature and feature selection on the K-NN. Based on our results from the
unbalanced and balanced decision trees, we found that the balanced tree was more accurate at correctly
classifying students in regards to passing or failing their class based on the above attributes. We did this
with a depth setting of 10, parent node equal to 16 and child node to 8. This is what yielded the best
results in terms of decision trees.
As for KNN, we again did not see much of a positive response. (See charts below). Just like before, we
decided to change the sample size from 66% Training and 33% Testing to 50% Training and 50%
Testing. By doing this we increased the accuracy of the testing by a maximum of 12-16% per test while
training still remained virtually the same. What these attributes tell us is that most of these attributes, are
irrelevant to predicting whether or not a student will pass, but are more relevant than the demographic
attributes which were very general & generic to say the least. (Fathers job, mothers job, and mothers
education proved to be the most relevant attributes in this class.
Feature
K Training Test
1 67.8% 32.2%
3 71.2% 28.8%
6 69.6% 30.4%
No Feature
K Trainin Test
g
1 73.9% 26.1%
3 72.7% 27.3%
6 68.9% 31.1%
Contributions
Both group members collaborated on which classification models to follow, and came to the conclusion
of using decision trees and KNN to analyze our data. Luis focused more on the transformation of data
along with the analysis of student life, while focused more on the demographics and parental influences to
figure out whether or not the certain attributes chosen would allow for knowledge discovery. Ben found
that the classes he analyzed were irrelevant when trying to accurately predict whether or not a student
would pass or fail, while Luis found that absences, and if a student has previously failed a class would
make them more vulnerable to either pass or fail the class. Both members worked on the presentation and
paper together & overall worked very well together in collaboration.
In the future, we would like to transform this largely categorical data into data that would allow us to do a
regression analysis, as well as clustering techniques. It is our thinking process that perhaps these
techniques could provide more accurate, or interesting, analysis. It would also be interesting to attempt to
use the same techniques used in the source study, especially Naive Bayes which yielded them the best
results.
References