You are on page 1of 8

Gender Prediction of the European School’s

Teachers Using Machine Learning: Preliminary


Results
1st Chaman Verma 2nd Ahmad S. Tarawneh
Department of Media and Educational Informatics Department of Algorithm and Their Application
Eötvös Loránd University Eötvös Loránd University
Budapest, Hungary Budapest, Hungary
chaman@inf.elte.hu, illes@inf.elte.hu

3rd Zoltán Illés 4th Veronika Stoffová 5th Sanjay Dahiya


Department of Media and Educational Informatics Faculty of Education Ch.Devi Lal State Institute of
Eötvös Loránd University Trnava University Engineering & Technology
Budapest, Hungary Trnava, Slovakia Sirsa, India

Abstract—An experiential study is conducted to solve binary relationships to solve the variety of problems through data
classification problem on big dataset of European Survey of analysis. About data mining research, every year the research
Schools: ICT in Education (known as ESSIE) using IBM modeler community addresses new open problems and new problem
version 18.1. The survey was conducted by ESSIE at various
levels [1-3] of schools ISCED (International Standard Classifica- areas, for many of which data mining can provide value-added
tion of Education). To predict the gender of teachers based on answers and results [2]. The machine learning research is to
their answers, the authors applied 4 supervised machine learning automatically learn to recognize complex patterns and make
algorithms filtering out of 12 classifiers using auto classifiers intelligent decisions based on data [3]. The supervised learning
on ISCED-1 and ISCED-2 level of schools. Out of total 158 assumes that training examples are classified (labeled by class
attributes, self-reduction and auto classifier stabilized only 134
attributes for the Bayesian Network (BN) and Random Tree labels) and predictive modeling is most trending to forecast
(RT) at level-1 and 134 attributes for logistic regression and a target or dependent attribute based on the value of other
41 attributes for Decision Tree (C5) at level-2. The MissingValue attributes [2]. The C5.0 classifier produces decision tree with
filter of Weka 3.8.1 tool handled well 55641 in ISCED-2 level and rules that provide the maximum information gain at each level
19415 at the ISCED-1 level and normalization is also applied as and the response variable mandatory to be categorical [4, 5, 6].
well. The outcomes of the study reveal that decision tree (C5)
classifier outperformed the logistic regression (LR) after feature Bayesian Network encodes probabilistic associations among
extraction at ISCED-2 level schools and Random Tree classifier different variables [7] and BN consists of several variables
predicted more accurately gender of the teacher as compare and set of edges between the variables, resulting in an acyclic
to the Bayesian Network at level-1 schools. Further, presented graph [8]. Depending on train dataset, it gives the highest
predictive models stabilized 134 attributes with 2926 instances for accuracy in classification and more discussion is available in
predict gender of teachers of level-1 schools and 134 attributes
with 7542 instances for level-2 schools. [9, 10, 11]. Logistic regression was applied to develop the
Index Terms—Classification, Supervised Machine learning, model for the early and reliable prediction of students pass or
Sensitivity, Teacher gender prediction. fail status of the undergraduate level [12]. The random tree
classifier is an improvement over regression and classification
I. I NTRODUCTION tree comes with bagging feature and better for binary classifi-
In 2011, European Commission has been conducted a mega- cation problem [13]. According to [14] The Auto Classifier es-
survey over 190,000 filled questionnaires from students, teach- timates and compares models for either nominal (set) or binary
ers and head teachers in 27 countries to analysis the Informa- (yes/no) targets, using many different methods. Classification
tion and Communication Technology (ICT) in ISCED level-1 is an important task of data mining; it is a supervised class
(primary level of education), ISCED level-2 (lower secondary prediction technique. The main goal is to accurately predict
level of education) and ISCED level-3 (upper secondary level the class for each data [15], provided that sufficient numbers
of education), distinguishing level-3A academic and level-3B of classes are available. in the past, classification has been
vocational [1]. The authors considered teacher dataset belongs applied in several fields of research like terrorism prediction,
to ISCED Level-1 and ISCED Level-2 to predicting the gender finance, weather prediction, medical etc. A classification can
based on their response provided. Data mining is the process of be binary or multiclass, where binary classification is the task
sorting through large data sets to identify patterns and establish of classifying the two groups based on a classification rule

978-1-5386-6678-4/18/$31.00 2018
c IEEE 213
[16]. Binary Logistic regression has been applied to predict of the whole dataset. The rescaling of the dataset to the range
gender of European school’s students with 62% accuracy [17]. of 0 to 1 is achieved using Normalize filter normalized all
Also various supervised machine learning classifiers such as data except target attribute gender. In both of dataset, attributes
SVM, RF and decision tree also applied to predict gender and numbered from 1 to 11 and from 36 to 147 are removed using
residence state of students [18], [19]. IBM SPSS Modeler is Self Reduction method as they are indexed and mean values.
a set of data mining tools that enable you to quickly develop Further, the auto classifier in IBM modeler version 18.1 is used
predictive models using business expertise and deploy them to select significant variables and best classifiers for training
into business operations to improve decision making [5]. In and testing dataset. Hence, after applied feature reduction, 134
the present paper, authors solved binary classification problem attributes with 2926 instances are selected for the Bayesian
on European school dataset to predict the gender of teachers network and random tree at ISCED level-1 and 134 attributes
using supervised machine learning classifiers such as BN, with 7542 instances are elected for logistic regression and 41
C5 tree, Random Tree and Logistic Regression. The present attributes stabilized for decision tree at ISCED level-2.Fig.1
study may help to predict more demographical features of is reflecting the schematic diagram of research after feature
teachers such as age, locality,expertise based on their answers extraction with used classifiers to predict gender of teacher.
provided in any ICT survey held in educational institution. The
C. Classifiers
experimental study is organized into the following sections.
The filtered dataset is trained using auto classifier algorithm
II. METHODS AND TECHNIQUES FOR at 10-fold cross-validation. The discard policy of auto classifier
STABILIZATION OF ATTRIBUTES is set up as less than 80% accuracy with 0.8 AUC (Area
A. Dataset Under Curve). Out of total 12 machine learning classifiers, the
auto classifier algorithm suggested two best models C5 and
More than 2500 schools, from 27 European countries
LR at ISCED-2 level school’s dataset and two best models
have participated in the survey held in 2011 (primary and
BN and RT at ISCED-1 level school’s dataset. Therefore,
lower secondary level of education). An experimental study
to predict the target variable based on 134 predictors, four
is conducted to predict the gender of the teacher based on
optimal supervised machine learning algorithms C5, RT, BN,
their answers provided during survey towards information
and LR are selected by applying auto classifier algorithm.
and communication technology using IBM modeler 8.1. The
authors trained two different datasets collected by ESSIE and D. Performance Evaluation
funded by the European Commission Information Society and Experimental results of presented models are evaluated
Media Directorate General which is also available online [1] using the following six major performance measures: (a)
in which ISCED-1 level dataset has total 3088 instances, 158 Classification Accuracy: The number of correct predictions of
attributes and 19415 missing values and ISCED-2 level dataset teacher gender from over all predictions. (b) Error: presents
has total 7897 instances, 158 attributes and 55641 missing the misclassification of target classes. (c) AUC: To show the
values are found. The Gender attribute has three classes 1- accuracy of models’ area under the curve of ROC is also
Female, 2-Male, X-Misplaced. Hence, 355 instances belong appropriate. (d) ROC: Receiver operating characteristics curve
to X category are removed manually. The authors considered presents the graphical evaluation of models which shows the
the only gender as a target attribute. The attributes of dataset true positive rate (Sensitivity) on the y-axis and false positive
belong to Experience with ICT, ICT access, Support to ICT rate (1-Specificity) at x-axis with various thresholds. (e) Co-
use, ICT based activities, ICT material and obstacles in ICT incidence matrix: reflects the performance of a classification
use, Learning activities, Teacher skills and Teacher opinions model on a set of test data for which the true values are
and attitudes etc. The scale of measurement was the mixed known. (f) Gini: It is calculated by subtracting the sum of the
type such as nominal, ordinal, interval and categorical etc. squared probabilities of each class from one which computes
The type of answers given by teachers was numeric. the inequality among values of a frequency distribution.
B. Preprocessing To evaluate the results, IBM modeler Analysis algorithm
(node) is applied to find accuracy, Gini, AUC and coincidence
Before use dataset, it is essential to improve the data quality matrix and evaluation algorithm is also applied to produce
[20]. There are a few numbers of techniques used for data pre- ROC curves.
processing [21] as aggregation, sampling, dimension reduc-
tion, variable transformation, and dealing with missing values. E. Optimal Frequency Distribution
To make quality and stabilized dataset Weka version 3.8.1 The auto classifier algorithm is the powerful technique
tool is applied to two different files belongs to ISCED-1 level which selects optimal classifier for precision based on the
and ISCED-2 level. The MissingValue filter counted 55641 dataset. The authors have trained all dataset with cross-
missing values in the ISCED-2 level dataset and 19415 missing validation by tested 12 different supervised machine learning
values in the ISCED-1 level dataset. According to [21, 22] to classifiers using auto classifier. The thresholds limit of overall
there is requirement to handle the missing data values. The accuracy is applied 80%.
missing values are handled with the ReplaceMissingValue fil- It can be also seen that auto classifier algorithm selected
ter which replaces missing values with mean and mode values two best classifiers named C5 tree and Logistic regression

214 8th International Advance Computing Conference (IACC).


Fig. 1: Stabilizing Features

out of 12 machine learning classifiers to predict the gender Bayesian network for ISCED level-1. It is clear from above
of the teacher at ISCED-1 level. Fig. 2 (a) shows the best classification table that Bayesian networks classified accurately
predictive distribution of gender-based on the Random tree 2302 female teachers and only a few numbers of females went
and Bayesian network which shows the 96.27% and 89.71% to the male category. In case of male prediction, less count of
accuracy over 134 attributes respectively. Fig. 2 (b) also shows teachers falls into the female category which is 181.
the optimal distribution of gender of the teacher at the ISCED- Fig. 5 shows classification table of the random tree at
2 level based on the C5 tree having 82.76% accuracy over ISCED level-1 teachers. It is also visible that random tree clas-
41 attributes after feature reduction and logistic regression sified much more accurately female teachers (2402) and only
having 81.65% accuracies with 134 attributes. One hand, the tiny number of males (28) are misclassified. This classifier
distribution models infer at the ISCED-1 level, the maximum also performed well for male gender prediction too. Further,
prediction is achieved by the Random tree as compared to to validate the presented models and classification tables ROC
Bayesian network and on another hand, the C5 tree won from curves are build using evaluation node of IBM Modeler.
the logistic regression in gender prediction at ISCED-2 level. The ROC curve consists of true positive (TP) rate across Y-
axis which is the sensitivity of model and false positive (FP)
F. An Experimental Results, Analysis, and Evaluation for rate which equals to 1-specificity across X-axis. The sensitivity
ISCED Level-1 Schools of model signifies accurate prediction of female over actual
Based on optimal distribution, dataset belongs to ISCED-1 females and specificity tells accurate prediction of male over
level 2926 instances are tested through the Bayesian network actual males.
including one target and 134 predictors. The Fig. 3 shows It can be seen from Fig.6 that Bayesian network shows
the significant role of each predictor to predict the gender of increasing TP rate for the female which starts from 0.36 and
teacher. It can be seen from the network the dark blue circles maximum up to 0.99 with varying thresholds.It is found 0.90
have highest importance values such as 1.0, 0.8 and 0.6 etc. TP rate (sensitivity) with 0.19 FP rate at 0.2 thresholds; 0.99
All 134 predictors have more than 0.4 percent contribution in TP rate and 0.7 FP rate at 0.8 thresholds which reveals the
the Bayesian network. significance of the predictive model.
It is found that the Bayesian network performed well in Further, Fig.7 shows random tree model validation using
a classification of gender-based on the dataset. The graph ROC which reflects significant TP rate starts from 0.54 and
board node is very powerful algorithm which take executable ends to 0.99 with updating thresholds. Also, can be seen at
predictive model and produces classification tables. Fig. 4 thresholds 0.1 the sensitivity is high 0.97 and FP rate is 0.07
shows classification table of teacher gender based on the which concludes the significance of model too. As thresholds

8th International Advance Computing Conference (IACC). 215


Fig. 2: (a) Best predictive distribution based on Random tree Fig. 3: Significant of predictors towards Teachers Gender
and Bayesian network of ISCED Level-1 using Auto classifier
(b) Best predictive distribution based on C5 tree and logistic
regression ISCED Level-2 using Auto Classifier

reach 0.4, the model sensing at point 0.98 and FP rate is 0.18.
Therefore, random tree outperformed the Bayesian network to
predict male teachers at ISCED-1 level.

G. An Experimental Results, Analysis, and Evaluation for


ISCED Level-2 Schools
To predict gender at the ISCED-2 level, auto classifier
performed feature extraction and selected decision tree (C5)
with 41 attributes and 7542 instances and logistic regression
(LR) with 134 attributes with 10-fold cross-validation. Fig. 8
shows classification table of teacher gender based on logistic
regression for ISCED level-2 teachers. It can be seen from
classification table that LR classifier correctly classified 5649
female and incorrectly classified 270 male teachers and only
1114 females wrongly went to the male category. Total 509
males are correctly classified.
The accuracy of C5 tree classifier is improved by applying
feature extraction using auto classifier. At ISCED-2 level, the
C5 classifier is tested with 10 -fold cross validation with 41
Fig. 4: Gender classification by Bayesian Network of ISCED-1
attributes and 7542 and logistic regression (LR) with 134
level teachers
attributes with 10-fold cross-validation. Fig. 9 shows classi-
fication table of teacher gender based on logistic regression
for ISCED level -2 teachers. It can be seen from classification
table that LR predicted 5649 female and 509 male teachers. predictive model. From Fig.10 (b) C5 model built using 41
Data from Fig.10 (a) shows the sensitivity of LR model is attributes and validated using ROC which reflects significant
directly proportional to the thresholds. The TP rate is varying TP rate 0.84 at 0.8 cutoffs.Therefore,feature extraction of C5
from 0.19 to 0.98. The LR produced significant TP rate (sensi- tree outperformed the LR to predict in gender prediction at
tivity) 0.97 and 0.85 thresholds proved the significance of the ISCED-2 level.

216 8th International Advance Computing Conference (IACC).


Fig. 5: Gender classification by Random Tree of ISCED-1 Fig. 7: ROC of the Random tree at ISCED level-1
level teachers

Fig. 8: Gender classification by LR of ISCED level-2 teachers


Fig. 6: ROC of the Bayesian network at ISCED level-1

which is a combination of four coincidence matrices generated


III. C OINCIDENCE M ATRICES AND E VALUATION by respective models. At ISCED level-2 schools, the maximum
M ATRICES number of correct prediction for female teachers (5752) is
To produce significant coincidence matrices and evaluation provided by the C5 tree with 41 attributes but it fails to
matrices, analysis algorithm of IBM Modeler also played a correctly classify of male teachers (167). Subsequently, LR
vital role to evaluate results. The results of an experiment with 134 attributes prediction ratio (correct predicted/actual)
based on using 10-fold cross validation represented in Table 1 for female teachers is 5649/5919 and for male teachers is

8th International Advance Computing Conference (IACC). 217


(a)

Fig. 9: Gender classification by C5 tree of ISCED level-2


teachers

509/1623. After feature extraction, C5 tree outperformed the


LR in the prediction of gender. At ISCED level-1 schools, RT
with 134 attributes predicted correctly 2414 males out of total
2483 and predicted correctly 410 females out of 443. Whereas, (b)
BN classifier with 134 attributes has classified correctly 2302 Fig. 10: (a) ROC of Logistic regression at ISCED Level-2 (b)
males and 323 females. Therefore, RT performed excellently ROC of C5 tree at ISCED Level-2 with 10-fold
in the prediction of the gender of teachers at ISCED level-1
schools.
Data from Table 2 maximum accuracy 96.27% is achieved mining tools Weka and IBM Modeler on big dataset available
by random tree classifier with 134 attributes to predict the online [1]. The auto classifier of Modeler suggested 4 best
gender of teachers at ISCED level-1 schools and C5 tree with classifiers to apply on both types of dataset. During the first
41 attributes gained the highest accuracy 82.76% for gender phase of the experimental study, RT and BN are tested and
prediction at ISCED level-2 schools. The maximum area under trained ISCED level-1 dataset with k-fold cross-validation to
the curve (AUC) is found 0.988 by RT with 134 attributes to predict the gender of the teacher. During the second phase of
prove the significant relevance of overall accuracy of the model the study, C5 and LR are applied on the ISCED level-2 dataset.
for prediction and it is above the benchmark of ROC curve. The maximum accuracy is achieved with 134 attributes by RT
Further, LR has maximum AUC 0.804 as compare to C5 AUC (96.27%) as compare to BN (89.71%) to predict teacher gender
which is 0.697, hence, it does not mean that LR predicting at ISCED level-1. The C5 tree classifier obtained the highest
better as compare to C5. After feature reduction, C5 provided accuracy (82.76%) with 41 attributes after applied feature
higher accuracy as compared to LR at ISCED level-2 schools. extraction by auto classifier algorithm to predict gender at
RT has highest Gini value 0.976 proved inequality of male ISCED level-2 school teachers. The binary logistic regression
and female to satisfy perfect prediction of gender and low is also gained better accuracy (81.65%) with 134 attributes
misclassification error 3.73%. to present the significant predictive model for ISCED level-2
schools.
IV. CONCLUSION
The presented paper has stabilized the attributes of Euro- ACKNOWLEDGMENT
pean commission dataset to predict the gender of teachers at The authors would like to thank European Commission
both levels of education such as ISCED level-1 and ISCED to provide dataset online. The present study is funded by
level-2 using supervised machine learning classifiers. The Eötvös Loránd University and sponsored by Tempus Public
experimental study is carried out with the help of popular data Foundation, Budapest, Hungary.

218 8th International Advance Computing Conference (IACC).


Bayesian network Random tree C5 tree Logistic regression
Models
ISCED-1 LEVEL ISCED-1 LEVEL ISCED-2 LEVEL ISCED-2 LEVEL
Prediction Prediction Prediction Prediction Prediction
Gender F M F M F M F M
F 2302 181 2402 81 5752 167 5649 270
Actual
M 120 323 28 415 1133 490 1114 509
TABLE I: Coincidence matrices at 10-folds cross-validation

BN-134 RT-134 C5 tree-41 LR-134


Models
ISCED level-1 ISCED Level-1 ISCED Level-2 ISCED Level-2
AUC 0.937 0.988 0.697 0.804
Gini 0.874 0.976 0.393 0.607
Accuracy 89.71% 96.27% 82.76% 81.65%
Error 10.29% 3.73% 17.24% 18.35%
TABLE II: Evaluation matrices

R EFERENCES conference on Uncertainty in artificial intelligence, pages


101–108. Morgan Kaufmann Publishers Inc., 1999.
[1] European Commission, https://ec. [11] Finn V Jensen. An introduction to Bayesian networks,
europa.eu/digital-single-market/news/ volume 210. UCL press London, 1996.
ict-education-essie-survey-smart-20100039, note = [12] Gerard JA Baars, Theo Stijnen, and Ted AW Splinter. A
Accessed: 2018-02-14. model to predict student failure in the first year of the
[2] Johannes Fürnkranz, Dragan Gamberger, and Nada undergraduate medical curriculum. Health Professions
Lavrač. Foundations of rule learning. Springer Science Education, 3(1):5–14, 2017.
& Business Media, 2012. [13] IBM, https://www.ibm.com/support/knowledgecenter/
[3] G MeeraGandhi. Machine learning approach for attack en/ss3ra7 17.1.0/modeler mainhelp client ddita/
prediction and classification using supervised learning clementine/rf general.html, note = Accessed: 2018-05-
algorithms. Int. J. Comput. Sci. Commun, 1(2), 2010. 06.
[4] IBM Knowledge center, https://www.ibm.com/support/ [14] IBM Watson , https://dataplatform.
knowledgecenter/en/ss3ra7 15.0.0/com.ibm.spss. ibm.com/docs/content/spss-visualization/
modeler.help/nodes treebuilding.htm, note = Accessed: spss-viz-auto-classifier-overview.html?audience=dr&
2018-02-14. context=refinery, note = Accessed: 2018-05-06.
[5] IBM Modeler Users Guide, ftp://public.dhe.ibm.com/ [15] Tulips Angel Thankachan and Kumudha Raimond. A
software/analytics/spss/documentation/modeler/18.1/en/ survey on classification and rule extraction techniques
modelerusersguide.pdf, note = Accessed: 2018-08-18. for datamining. IOSR Journal of Computer Engineering,
[6] Kemal Polat and Salih Güneş. A novel hybrid intelligent 8(5):75–78, 2013.
method based on c4. 5 decision tree classifier and one- [16] Ghada M Tolan and Omar S Soliman. An experimental
against-all approach for multi-class classification prob- study of classification algorithms for terrorism predic-
lems. Expert Systems with Applications, 36(2):1587– tion. International Journal of Knowledge Engineering-
1592, 2009. IACSIT, 1(2):107–112, 2015.
[7] Javed Ashraf and Seemab Latif. Handling intrusion and [17] Chaman Verma, Veronika Stoffová, Zoltán Illés, and
ddos attacks in software defined networks using machine Sanjay Dahiya. Binary logistic regression classifying
learning techniques. In Software Engineering Conference the gender of student towards computer learning in
(NSEC), 2014 National, pages 55–60. IEEE, 2014. european schools. In THE 11TH CONFERENCE OF
[8] V Muralidharan and V Sugumaran. A comparative study PHD STUDENTS IN COMPUTER SCIENCE, page 45,
of naı̈ve bayes classifier and bayes net classifier for fault 2018.
diagnosis of monoblock centrifugal pump using wavelet [18] Chaman Verma Ahmed S. Tarawneh Veronika Stoffová,
analysis. Applied Soft Computing, 12(8):2023–2029, Zoltán Illés. Forecasting residence state of indian student
2012. based on responses towards information and commu-
[9] Mark A Friedl and Carla E Brodley. Decision tree nication technology awareness: A primarily outcomes
classification of land cover from remotely sensed data. using machine learning. In International Conference on
Remote sensing of environment, 61(3):399–409, 1997. Innovations in Engineering, Technology and Sciences.
[10] Jie Cheng and Russell Greiner. Comparing bayesian IEEE In Press, 2018.
network classifiers. In Proceedings of the Fifteenth [19] Chaman Verma Zoltán Illés, Veronika Stoffová. An

8th International Advance Computing Conference (IACC). 219


ensemble approach to identifying the student gen-
der towards information and communication technology
awareness in european schools using machine learning.
International Journal of Engineering and Technology,
2018 in Press.
[20] What Is Data Mining. Data mining: Concepts and
techniques. Morgan Kaufinann, 2006.
[21] Hongbo Du. Data mining techniques and applications:
An introduction. Cenage Learning EMEA, 2010.
[22] MDRV Gimpy. Missing value imputation in multi
attribute data set. International Journal of Computer
Science and Information Technologies, 5(4):1–7, 2014.

220 8th International Advance Computing Conference (IACC).

You might also like