Professional Documents
Culture Documents
Classification Algorithms
1 Introduction
There are many important real-life classification problems in which a data point
can be a member of more than one class simultaneously [9]. For example, a gene
sequence can be a member of multiple functional classes, or a piece of music can
be tagged with multiple genres. These types of problems are known as multi-label
classification problems [23]. In multi-label problems there are typically a finite
set of potential labels that can be applied to data points. The set of labels that
are applicable to a specific data point are known as the relevant labels, while
those that are not applicable are known as irrelevant labels.
Early, naı̈ve approaches to the multi-label problem (e.g. [1]) consider each la-
bel independently using a one-versus-all binary classification approach to predict
the relevance of an individual label to a data point. The outputs of a set of these
individual classifiers are then aggregated into a set of relevant labels. Although
these approaches can work well [11], their performance tends to degrade signif-
icantly as the number of potential labels increases. The prediction of a group
of relevant labels effectively involves finding a point in a multi-dimensional label
space, and as the number of labels increases this becomes more challenging as
this space becomes more and more sparse. An added challenge is that multi-label
problems can suffer from a very high degree of label imbalance. To address these
challenges, more sophisticated multi-label classification algorithms [9] attempt
to exploit the associations between labels, and use ensemble approaches to break
the problem into a series of less complex problems (e.g. [1, 20, 17, 14]).
We describe an experiment to benchmark the performance of eleven of the
most widely-cited approaches to multi-label classification on a set of eleven multi-
label classification datasets. While there are existing benchmarks of this type (eg.
[14, 15]), they do not sufficiently tune the hyper-parameters for each algorithm,
and so do not compare approaches in a fair way. In this experiment extensive
hyper-parameter tuning is performed. The paper also presents the results of an
initial experiment to investigate how the performance of different multi-label
classification algorithms changes as the characteristics of datasets (e.g. the size
of the set of potential labels) change.
The remainder of the paper is structured as follows. Section 2 provides a brief
survey of existing multi-label classification algorithms and previous benchmark
studies. Section 3 describes the benchmark experiment, along with an analysis of
the results of this experiment. Section 4 describes the experiment performed to
explore the performance of multi-label classification algorithms as the character-
istics of the dataset change. Section 5 draws conclusions from the experimental
results and outlines a path for future work.
A number of papers that describe new multi-label classification approaches [3, 14,
15] benchmark different multi-label classification algorithms against their newly
proposed method. One of the limitations of these studies, however, is a lack
of hyper-parameter tuning, and a reliance on default hyper-parameter settings.
Rather than proposing a new algorithm, Madjarov et al. [13] describes a bench-
mark study of several multi-label classification algorithms using several datasets.
Hyper-parameter tuning is performed in this study. There is, however, a mis-
match between the hamming loss measure used to select hyper-parameters and
the measures used to evaluate performance in the benchmark. The study iden-
tifies HOMER, binary relevance, and classifier chains as promising approaches.
To perform a fair comparison of algorithms, the benchmark experiment de-
scribed in this paper uses extensive parameter tuning. For consistency, the mea-
sure used to guide this parameter tuning—label based macro averaged F-Score
(see Section 3.2)—is the same as the measure used to compare algorithms in
the benchmark. The set of algorithms used overlaps with, but is different than,
those in Madjarov et al. [13].
3.1 Datasets
Table 1 describes the eleven datasets used in this experiment. The datasets cho-
sen are widely used in the multi-label literature, and have a diverse set of prop-
erties, listed in Table 1. Instances, inputs and labels indicate the total number
of data points, the number of predictor variables, and the number of potential
labels, respectively. Total labelsets indicates the number of unique combinations
of relevant labels in the dataset, where each such unique label combination is
a labelset. Single labelsets indicates the number of data points having a unique
combination of relevant labels. Cardinality indicates the average number of la-
bels assigned per data point. Density is a normalised dimensionless indicator
of cardinality computed by dividing the value of cardinality by the number of
labels. MeanIR [2] indicates the average degree of label imbalance in the multi-
label dataset—a higher value indicates more imbalance. These label parameters
Table 1: Datasets
Total Single
Dataset Instances Inputs Labels Labelsets Labelsets Cardinality Density MeanIR
yeast 2417 103 14 198 77 4.237 0.303 7.197
scene 2407 294 6 15 3 1.074 0.179 1.254
emotions 593 72 6 27 4 1.869 0.311 1.478
medical 978 1449 45 94 33 1.245 0.028 89.501
enron 1702 1001 53 753 573 3.378 0.064 73.953
birds 322 260 20 89 55 1.503 0.075 13.004
genbase 662 1186 27 32 10 1.252 0.046 37.315
cal500 502 68 174 502 502 26.044 0.150 20.578
llog 1460 1004 75 304 189 1.180 0.016 39.267
slashdot 3782 1079 22 156 56 1.181 0.054 17.693
corel5k 5000 499 374 3175 2523 3.522 0.009 189.568
together describe the properties of the datasets which may influence the perfor-
mance of the algorithms. Collectively, these properties will be referred to as label
complexity in the remainder of this text.
All datasets were acquired from [18]. In the birds dataset, several data
points are without any assigned label. To avoid problems computing perfor-
mance scores, we have added an extra other label to this dataset which is added
to a data point when it has no other labels assigned to it.
In this study we use label based macro averaged F-measure [23] for both hyper-
parameter selection and performance comparison. Higher values indicate better
performance. This measure was selected as it allows performance of algorithms
on minority labels to be captured and balances precision and recall for each label
[10].
The algorithms used in this experiment are: binary relevance (BR) [1], classi-
fier chains (CC) [14], label powerset (LP) [1], RAkEL-d [20], HOMER [17], CLR
[8], BRkNN [16], MLkNN [25], DMLkNN [22], IBLR-ML [4] and BPMLL [24].
All algorithm implementations come from the Java library MULAN [19]. For
each algorithm-dataset pair, a grid search on different parameter combinations
was performed. For an algorithm-dataset pair, for each parameter combination
selected from the grid, a 2 × 5-fold cross-validation run was performed, and
the F-measure was recorded. When the grid search is complete, the parameter
combination with the highest F-measure was selected. These selected scores are
shown in Table 2 and used to compare the classifiers.
For each problem transformation method—CC, BR, LP and CLR—a sup-
port vector machine with a radial basis kernel (SVM-RBK) was used as the base
classifier. The SVM models were tuned over 12 parameter combinations of the
regularisation parameter (from the set {1, 10, 100}) and the kernel spread pa-
rameter (from the set {0.01, 0.05, 0.001, 0.005}). For RAkEL-d the subset size
Table 2: Best mean Label Based Macro Averaged F-Measure
Dataset CC RAkEL-d BPMLL LP HOMER BR CLR IBLR-ML MLkNN BRkNN DMLkNN
yeast 0.451 0.437 0.436 0.451 0.448 0.387 0.399 0.394 0.377 0.392 0.380
scene 0.804 0.802 0.778 0.802 0.800 0.799 0.793 0.749 0.742 0.695 0.750
emotions 0.624 0.628 0.690 0.596 0.621 0.604 0.616 0.658 0.629 0.633 0.634
medical 0.692 0.697 0.558 0.659 0.611 0.676 0.520 0.434 0.540 0.474 0.505
enron 0.289 0.288 0.281 0.278 0.281 0.284 0.286 0.153 0.177 0.169 0.163
birds 0.158 0.181 0.343 0.181 0.155 0.157 0.156 0.255 0.226 0.273 0.216
genbase 0.944 0.943 0.815 0.941 0.939 0.941 0.931 0.910 0.850 0.837 0.821
cal500 0.185 0.179 0.237 0.178 0.199 0.181 0.169 0.178 0.101 0.124 0.107
llog 0.292 0.300 0.295 0.297 0.256 0.296 0.281 0.110 0.263 0.255 0.248
slashdot 0.469 0.472 0.209 0.474 0.477 0.466 0.151 0.214 0.194 0.164 0.200
corel5k 0.222 0.217 0.219 0.210 0.197 0.213 DNF 0.084 0.190 0.186 0.181
Average Rank 3.364 3.455 4.818 4.909 5.455 5.546 7.300 7.909 8.091 8.364 8.546
was varied between 3 and 6, and for HOMER the cluster size was varied between
3 and 6. For both RAkEL-d and HOMER, the base classifiers were label pow-
erset models, using SVM-RBK models tuned as outlined above. The BRkNN,
MLkNN, DMLkNN and IBLR-ML were tuned over 4 to 26 nearest neighbours,
with a step size of 2. For BPMLL the tuning was two step in order to make it
computationally feasible. First, a grid with 120 different parameter combinations
for the regularisation weight, learning rate, number of iterations and the num-
ber of hidden units were created and the best combination was found using only
the yeast dataset. Next, using this best combination of hyper-parameters other
algorithm-dataset pairs were tuned over hidden layers containing units equal to
20%, 40%, 60%, 80% and 100% of the number of inputs for each dataset, as
recommended by Zhang et al. [24].
4 Label Analysis
A preliminary experiment was also performed to understand how multi-label
classification approaches perform when the number of labels is increased, while
the input space is kept the same. Section 4.1 describes the experimental setup
and Section 4.2 discusses the results.
(a) Macro average F-Measure performance (b) Macro average F-Measure performance
changes, yeast. changes, emotions.
Scores of different nos of labels in Yeast dataset Scores of different nos of labels in Emotions dataset
0.65
0.70
BPMLL
●
●
●
CLR
● ●
●
●
●
●
IBLR−ML ●
0.60
● ●
● ● RAKEL−d ●
0.65
● ●
● ● ● ●
● ●
●
●
LP ●
Macro Averaged F−Measure
● ●
●
● BRKNN ● ● ●
0.60
● ● ● ●
● ● ●
●
●
● ●
●
● ●
●
0.50
● ● ●
●
● ●
0.55
●
●
●
●
●
●
●
● ● ●
0.45
●
● ● ●
●
●
● ● ●
●
● ● ●
●
0.50
●
●
●
● ● ●
● ● ●
0.40
●
●
● ● BPMLL
●
● ● ●
● CLR
● ●
● ● IBLR−ML
0.45
●
RAKEL−d
0.35
●
●
● LP
●
CC
BR
0.30
0.40
BRKNN
2 4 6 8 10 12 14 2 3 4 5 6
(c) Relative rank changes, yeast. (d) Relative rank changes, emotions.
Relative rankings on increasing number of Yeast labels Relative rankings on increasing number of Emotions labels
2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 4 5 6
BRKNN ● ● ● ● ● ● ● ● ● ● ● ● ● LP BRKNN ● ● ● ● ● BPMLL
BR ● ● ● ● ● ● ● ● ● ● ● ● ● BPMLL BR ● ● ● ● ● BRKNN
LP ● ● ● ● ● ● ● ● ● ● ● ● ● CLR RAKEL−d ● ● ● ● ● BR
CC ● ● ● ● ● ● ● ● ● ● ● ● ● IBLR−ML BPMLL ● ● ● ● ● LP
BPMLL ● ● ● ● ● ● ● ● ● ● ● ● ● BR CC ● ● ● ● ● CLR
bottom positions, though IBLR-ML was able to get a better rank than BRkNN
most of the times. In Figure 2d related to emotions dataset, BPMLL and CC
both continued to rise up, CLR and BR floated down, IBLR-ML and BRkNN
were relatively flat, while IBLR-ML achieved a better ranking most of the time.
This preliminary study indicates that LP, CC and BPMLL were able to per-
form comparatively better than others, while BR showed consistent decrease in
rank. To establish a definite relation, a more detailed study should be performed.
Figure 3 shows how the label complexity parameters for the yeast and emo-
tions datasets change as the number of labels are varied in the synthetically
Fig. 3: Change of a few label complexity parameters as the number of labels
change
● ●
4
●
3
Cardinality
Cardinality
●
●
●
●
●
●
2
2
●
●
● ●
●
●
1
1
●
●
●
2 4 6 8 10 12 14 2 3 4 5 6
Labels Labels
for yeast for emotions
0.38
0.38
●
●
●
●
0.34
0.34
●
●
Density
Density
●
●
●
● ●
0.30
0.30
● ●
●
●
●
0.26
0.26
2 4 6 8 10 12 14 2 3 4 5 6
Labels Labels
for yeast for emotions
●
7
7
6
6
5
5
MeanIR
MeanIR
4
●
●
3
●
●
2
● ● ● ●
●
● ●
● ● ● ●
1
2 4 6 8 10 12 14 2 3 4 5 6
Labels Labels
for yeast for emotions
generated datasets. Although it looks like there is some relationship between the
change of Density in Figure 3 with the change of performance in Figures 2a and
2b, but such a conclusion from this experiment may be misleading, and hence
requires further study.
References
1. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classifi-
cation. Pattern Recognition 37(9), 1757 – 1771 (2004)
2. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in
multilabel classification: Measures and random resampling algorithms. Neurocom-
puting 163, 3 – 16 (2015)
3. Chen, W.J., Shao, Y.H., Li, C.N., Deng, N.Y.: Mltsvm: A novel twin support vector
machine to multi-label learning. Pattern Recognition 52, 61 – 74 (2016)
4. Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic re-
gression for multilabel classification. Machine Learning 76(2), 211–225 (2009)
5. Clare, A., Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype
data. In: In: Lecture Notes in Computer Science. pp. 42–53. Springer (2001)
6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach.
Learn. Res. 7, 1–30 (Dec 2006)
7. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: In
Advances in Neural Information Processing Systems 14. pp. 681–687. MIT Press
(2001)
8. Fürnkranz, J., Hüllermeier, E., Loza Mencı́a, E., Brinker, K.: Multilabel classifica-
tion via calibrated label ranking. Machine Learning 73(2), 133–153 (2008)
9. Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and
ongoing research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 4(6), 411–444 (2014)
10. Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of Machine Learning for
Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. The
MIT Press (2015)
11. Luaces, O., Dı́ez, J., Barranquero, J., del Coz, J.J., Bahamonde, A.: Binary rele-
vance efficacy for multilabel classification. Progress in Artificial Intelligence 1(4),
303–313 (2012)
12. Madjarov, G., Gjorgjevikj, D., Džeroski, S.: Two stage architecture for multi-label
learning. Pattern Recognition 45(3), 1019 – 1034 (2012)
13. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental
comparison of methods for multi-label learning. Pattern Recognition 45(9), 3084 –
3104 (2012), best Papers of Iberian Conference on Pattern Recognition and Image
Analysis (IbPRIA’2011)
14. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. Machine Learning 85(3), 333–359 (2011)
15. Shi, C., Kong, X., Fu, D., Yu, P.S., Wu, B.: Multi-label classification based on
multi-objective optimization. ACM Trans. Intell. Syst. Technol. 5(2), 35:1–35:22
(Apr 2014)
16. Spyromitros, E., Tsoumakas, G., Vlahavas, I.: An empirical study of lazy multi-
label classification algorithms. In: Proceedings of the 5th Hellenic Conference on
Artificial Intelligence: Theories, Models and Applications. pp. 401–406. SETN ’08,
Springer-Verlag, Berlin, Heidelberg (2008)
17. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and Efficient Multilabel Classi-
fication in Domains with Large Number of Labels. In: Proc. ECML/PKDD 2008
Workshop on Mining Multidimensional Data (MMD’08) (2008)
18. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN multi-label dataset
repository
19. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java
library for multi-label learning. Journal of Machine Learning Research 12, 2411–
2414 (2011)
20. Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for multi-
label classification. In: Machine Learning: ECML 2007: 18th European Conference
on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings. pp.
406–417. Springer Berlin Heidelberg (2007)
21. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. Trans.
Evol. Comp 1(1), 67–82 (Apr 1997)
22. Younes, Z., Abdallah, F., Denoeux, T.: Multi-label classification algorithm derived
from k-nearest neighbor rule with label dependencies. In: Signal Processing Con-
ference, 2008 16th European. pp. 1–5 (Aug 2008)
23. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans-
actions on Knowledge and Data Engineering 26(8), 1819–1837 (2014)
24. Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with applications to func-
tional genomics and text categorization. IEEE Transactions on Knowledge and
Data Engineering 18(10), 1338–1351 (Oct 2006)
25. Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning.
Pattern Recognition 40(7), 2038 – 2048 (2007)