Professional Documents
Culture Documents
Abstract—An experiential study is conducted to solve binary relationships to solve the variety of problems through data
classification problem on big dataset of European Survey of analysis. About data mining research, every year the research
Schools: ICT in Education (known as ESSIE) using IBM modeler community addresses new open problems and new problem
version 18.1. The survey was conducted by ESSIE at various
levels [1-3] of schools ISCED (International Standard Classifica- areas, for many of which data mining can provide value-added
tion of Education). To predict the gender of teachers based on answers and results [2]. The machine learning research is to
their answers, the authors applied 4 supervised machine learning automatically learn to recognize complex patterns and make
algorithms filtering out of 12 classifiers using auto classifiers intelligent decisions based on data [3]. The supervised learning
on ISCED-1 and ISCED-2 level of schools. Out of total 158 assumes that training examples are classified (labeled by class
attributes, self-reduction and auto classifier stabilized only 134
attributes for the Bayesian Network (BN) and Random Tree labels) and predictive modeling is most trending to forecast
(RT) at level-1 and 134 attributes for logistic regression and a target or dependent attribute based on the value of other
41 attributes for Decision Tree (C5) at level-2. The MissingValue attributes [2]. The C5.0 classifier produces decision tree with
filter of Weka 3.8.1 tool handled well 55641 in ISCED-2 level and rules that provide the maximum information gain at each level
19415 at the ISCED-1 level and normalization is also applied as and the response variable mandatory to be categorical [4, 5, 6].
well. The outcomes of the study reveal that decision tree (C5)
classifier outperformed the logistic regression (LR) after feature Bayesian Network encodes probabilistic associations among
extraction at ISCED-2 level schools and Random Tree classifier different variables [7] and BN consists of several variables
predicted more accurately gender of the teacher as compare and set of edges between the variables, resulting in an acyclic
to the Bayesian Network at level-1 schools. Further, presented graph [8]. Depending on train dataset, it gives the highest
predictive models stabilized 134 attributes with 2926 instances for accuracy in classification and more discussion is available in
predict gender of teachers of level-1 schools and 134 attributes
with 7542 instances for level-2 schools. [9, 10, 11]. Logistic regression was applied to develop the
Index Terms—Classification, Supervised Machine learning, model for the early and reliable prediction of students pass or
Sensitivity, Teacher gender prediction. fail status of the undergraduate level [12]. The random tree
classifier is an improvement over regression and classification
I. I NTRODUCTION tree comes with bagging feature and better for binary classifi-
In 2011, European Commission has been conducted a mega- cation problem [13]. According to [14] The Auto Classifier es-
survey over 190,000 filled questionnaires from students, teach- timates and compares models for either nominal (set) or binary
ers and head teachers in 27 countries to analysis the Informa- (yes/no) targets, using many different methods. Classification
tion and Communication Technology (ICT) in ISCED level-1 is an important task of data mining; it is a supervised class
(primary level of education), ISCED level-2 (lower secondary prediction technique. The main goal is to accurately predict
level of education) and ISCED level-3 (upper secondary level the class for each data [15], provided that sufficient numbers
of education), distinguishing level-3A academic and level-3B of classes are available. in the past, classification has been
vocational [1]. The authors considered teacher dataset belongs applied in several fields of research like terrorism prediction,
to ISCED Level-1 and ISCED Level-2 to predicting the gender finance, weather prediction, medical etc. A classification can
based on their response provided. Data mining is the process of be binary or multiclass, where binary classification is the task
sorting through large data sets to identify patterns and establish of classifying the two groups based on a classification rule
978-1-5386-6678-4/18/$31.00 2018
c IEEE 213
[16]. Binary Logistic regression has been applied to predict of the whole dataset. The rescaling of the dataset to the range
gender of European school’s students with 62% accuracy [17]. of 0 to 1 is achieved using Normalize filter normalized all
Also various supervised machine learning classifiers such as data except target attribute gender. In both of dataset, attributes
SVM, RF and decision tree also applied to predict gender and numbered from 1 to 11 and from 36 to 147 are removed using
residence state of students [18], [19]. IBM SPSS Modeler is Self Reduction method as they are indexed and mean values.
a set of data mining tools that enable you to quickly develop Further, the auto classifier in IBM modeler version 18.1 is used
predictive models using business expertise and deploy them to select significant variables and best classifiers for training
into business operations to improve decision making [5]. In and testing dataset. Hence, after applied feature reduction, 134
the present paper, authors solved binary classification problem attributes with 2926 instances are selected for the Bayesian
on European school dataset to predict the gender of teachers network and random tree at ISCED level-1 and 134 attributes
using supervised machine learning classifiers such as BN, with 7542 instances are elected for logistic regression and 41
C5 tree, Random Tree and Logistic Regression. The present attributes stabilized for decision tree at ISCED level-2.Fig.1
study may help to predict more demographical features of is reflecting the schematic diagram of research after feature
teachers such as age, locality,expertise based on their answers extraction with used classifiers to predict gender of teacher.
provided in any ICT survey held in educational institution. The
C. Classifiers
experimental study is organized into the following sections.
The filtered dataset is trained using auto classifier algorithm
II. METHODS AND TECHNIQUES FOR at 10-fold cross-validation. The discard policy of auto classifier
STABILIZATION OF ATTRIBUTES is set up as less than 80% accuracy with 0.8 AUC (Area
A. Dataset Under Curve). Out of total 12 machine learning classifiers, the
auto classifier algorithm suggested two best models C5 and
More than 2500 schools, from 27 European countries
LR at ISCED-2 level school’s dataset and two best models
have participated in the survey held in 2011 (primary and
BN and RT at ISCED-1 level school’s dataset. Therefore,
lower secondary level of education). An experimental study
to predict the target variable based on 134 predictors, four
is conducted to predict the gender of the teacher based on
optimal supervised machine learning algorithms C5, RT, BN,
their answers provided during survey towards information
and LR are selected by applying auto classifier algorithm.
and communication technology using IBM modeler 8.1. The
authors trained two different datasets collected by ESSIE and D. Performance Evaluation
funded by the European Commission Information Society and Experimental results of presented models are evaluated
Media Directorate General which is also available online [1] using the following six major performance measures: (a)
in which ISCED-1 level dataset has total 3088 instances, 158 Classification Accuracy: The number of correct predictions of
attributes and 19415 missing values and ISCED-2 level dataset teacher gender from over all predictions. (b) Error: presents
has total 7897 instances, 158 attributes and 55641 missing the misclassification of target classes. (c) AUC: To show the
values are found. The Gender attribute has three classes 1- accuracy of models’ area under the curve of ROC is also
Female, 2-Male, X-Misplaced. Hence, 355 instances belong appropriate. (d) ROC: Receiver operating characteristics curve
to X category are removed manually. The authors considered presents the graphical evaluation of models which shows the
the only gender as a target attribute. The attributes of dataset true positive rate (Sensitivity) on the y-axis and false positive
belong to Experience with ICT, ICT access, Support to ICT rate (1-Specificity) at x-axis with various thresholds. (e) Co-
use, ICT based activities, ICT material and obstacles in ICT incidence matrix: reflects the performance of a classification
use, Learning activities, Teacher skills and Teacher opinions model on a set of test data for which the true values are
and attitudes etc. The scale of measurement was the mixed known. (f) Gini: It is calculated by subtracting the sum of the
type such as nominal, ordinal, interval and categorical etc. squared probabilities of each class from one which computes
The type of answers given by teachers was numeric. the inequality among values of a frequency distribution.
B. Preprocessing To evaluate the results, IBM modeler Analysis algorithm
(node) is applied to find accuracy, Gini, AUC and coincidence
Before use dataset, it is essential to improve the data quality matrix and evaluation algorithm is also applied to produce
[20]. There are a few numbers of techniques used for data pre- ROC curves.
processing [21] as aggregation, sampling, dimension reduc-
tion, variable transformation, and dealing with missing values. E. Optimal Frequency Distribution
To make quality and stabilized dataset Weka version 3.8.1 The auto classifier algorithm is the powerful technique
tool is applied to two different files belongs to ISCED-1 level which selects optimal classifier for precision based on the
and ISCED-2 level. The MissingValue filter counted 55641 dataset. The authors have trained all dataset with cross-
missing values in the ISCED-2 level dataset and 19415 missing validation by tested 12 different supervised machine learning
values in the ISCED-1 level dataset. According to [21, 22] to classifiers using auto classifier. The thresholds limit of overall
there is requirement to handle the missing data values. The accuracy is applied 80%.
missing values are handled with the ReplaceMissingValue fil- It can be also seen that auto classifier algorithm selected
ter which replaces missing values with mean and mode values two best classifiers named C5 tree and Logistic regression
out of 12 machine learning classifiers to predict the gender Bayesian network for ISCED level-1. It is clear from above
of the teacher at ISCED-1 level. Fig. 2 (a) shows the best classification table that Bayesian networks classified accurately
predictive distribution of gender-based on the Random tree 2302 female teachers and only a few numbers of females went
and Bayesian network which shows the 96.27% and 89.71% to the male category. In case of male prediction, less count of
accuracy over 134 attributes respectively. Fig. 2 (b) also shows teachers falls into the female category which is 181.
the optimal distribution of gender of the teacher at the ISCED- Fig. 5 shows classification table of the random tree at
2 level based on the C5 tree having 82.76% accuracy over ISCED level-1 teachers. It is also visible that random tree clas-
41 attributes after feature reduction and logistic regression sified much more accurately female teachers (2402) and only
having 81.65% accuracies with 134 attributes. One hand, the tiny number of males (28) are misclassified. This classifier
distribution models infer at the ISCED-1 level, the maximum also performed well for male gender prediction too. Further,
prediction is achieved by the Random tree as compared to to validate the presented models and classification tables ROC
Bayesian network and on another hand, the C5 tree won from curves are build using evaluation node of IBM Modeler.
the logistic regression in gender prediction at ISCED-2 level. The ROC curve consists of true positive (TP) rate across Y-
axis which is the sensitivity of model and false positive (FP)
F. An Experimental Results, Analysis, and Evaluation for rate which equals to 1-specificity across X-axis. The sensitivity
ISCED Level-1 Schools of model signifies accurate prediction of female over actual
Based on optimal distribution, dataset belongs to ISCED-1 females and specificity tells accurate prediction of male over
level 2926 instances are tested through the Bayesian network actual males.
including one target and 134 predictors. The Fig. 3 shows It can be seen from Fig.6 that Bayesian network shows
the significant role of each predictor to predict the gender of increasing TP rate for the female which starts from 0.36 and
teacher. It can be seen from the network the dark blue circles maximum up to 0.99 with varying thresholds.It is found 0.90
have highest importance values such as 1.0, 0.8 and 0.6 etc. TP rate (sensitivity) with 0.19 FP rate at 0.2 thresholds; 0.99
All 134 predictors have more than 0.4 percent contribution in TP rate and 0.7 FP rate at 0.8 thresholds which reveals the
the Bayesian network. significance of the predictive model.
It is found that the Bayesian network performed well in Further, Fig.7 shows random tree model validation using
a classification of gender-based on the dataset. The graph ROC which reflects significant TP rate starts from 0.54 and
board node is very powerful algorithm which take executable ends to 0.99 with updating thresholds. Also, can be seen at
predictive model and produces classification tables. Fig. 4 thresholds 0.1 the sensitivity is high 0.97 and FP rate is 0.07
shows classification table of teacher gender based on the which concludes the significance of model too. As thresholds
reach 0.4, the model sensing at point 0.98 and FP rate is 0.18.
Therefore, random tree outperformed the Bayesian network to
predict male teachers at ISCED-1 level.