You are on page 1of 11

Multi-label Classification of Twitter Data

Using Modified ML-KNN

Saurabh Kumar Srivastava and Sandeep Kumar Singh

Abstract Social media has become a very rich source of information. Labeling
unstructured social media text is a critical task as features belong to multiple labels.
Without appropriate labels, raw data does not make any sense. So it is mandatory
to provide appropriate labels. In this work, we have proposed a modified multilabel
K nearest neighbor (Modified ML-KNN) for generating multiple labels of tweets
which when configured with a certain distance measure and number of nearest neigh-
bors gives better performance than conventional ML-KNN. To validate the proposed
approach, we have used two different twitter data sets, one Disease related tweets
set prepared by us using five different disease keywords and an other benchmark
Seattle data set consisting of incident-related tweets. The modified ML-KNN is able
to improve the performance of conventional ML-KNN with a minimum of 5% in
both the datasets.

Keywords Twitter · Multi-label classification · Disease dataset


Seattle dataset

1 Introduction and Related Work

Social media is a place where people use a lot of text postings. Social media text
classification systems retrieve such posts to user’s interest and views in the form
of summaries. Textual data over social media belongs to either the unstructured or
semi-structured category. Due to the emergence of web 3.0 especially online infor-
mation is growing enormously. Thus, we require some automatic tools for analyzing
such large collection of textual data. In this regard, work in [1] proposed architec-
ture to track real-time disease-related posting for early disease outbreaks prediction.
Support vector machine (SVM) used for classifying postings, achieved up to 88%

S. K. Srivastava (B) · S. K. Singh


Department of Computer Science & Engineering, Jaypee University, 62 Noida, India
e-mail: phd.jiit@gmail.com
S. K. Singh
e-mail: sandeepk.singh@jiit.ac.in
© Springer Nature Singapore Pte Ltd. 2019 31
M. L. Kolhe et al. (eds.), Advances in Data and Information Sciences, Lecture Notes
in Networks and Systems 39, https://doi.org/10.1007/978-981-13-0277-0_3
32 S. K. Srivastava and S. K. Singh

accuracy in terms of performance. It is a very challenging task to determine tweets


in multi-labels. The increasing volume of data demands classification into one or
more concrete category in automated or a mutually exclusive manner. It is found that
unstructured text has multiple labels. Due to overlapping terms, it is a very challeng-
ing task to determine tweets in multiple labels. Multi-label classification is reported
by many researchers using Twitter data. Health-related discussions are very com-
mon on social media. People frequently share their experiences related to disease
and use process of diagnosis which can be used to capture health-related insights
from the social media. The authors in [2] have used semi-structured data for multi-
label classification, problem transformation and algorithm adaptation methods are
used in reported literature. Experimentation concluded that Binary Relevance (BR)
is better over Label Powerset (LP) and ML-KNN both. The author in [3] proposed
an annotation tool to collect and annotate twitter messages related to diseases. The
tool automatically makes feature set for relevance filtering. The author in [4] pro-
posed a methodology to identify incident related information using Twitter. This one
has identified that assigning a single label to the text may lose important situational
information for decision-making. In the paper problem transformation algorithms
BR, LP, and Classifier Chain (CC) are used with Support Vector Machine (SVM) as
a base classifier. Results are compared using precision and recall values. The above
work illustrates that text data is highly sparse and nowadays research tries to uti-
lize this real-time data for preparing an expert system that can use tweets/postings
for surveillance over social media. Twitter can be a source of real-time surveillance
using spatial and temporal text mining. In the context of real-time text mining, we
are doing a very initial level task which can be further utilized by the surveillance
module for better insights. In our work, we have introduced modified ML-KNN for
health-related surveillance over social media.

2 Algorithms Used

In multi-label classification, reported work has been done in two well-known cate-
gories of algorithms.

2.1 Problem Transformation Methods

Problem transformation method [5] are multi-label learning algorithms that trans-
form learning problem into one or more single-label classification. The problem
transformation methods are binary relevance, label powerset, and classifier chains
method.
Multi-label Classification of Twitter Data Using Modified ML-KNN 33

2.2 Algorithm Adaptation Methods

Algorithm adaptation methods [5] adapted machine learning algorithms for the task
of multi-label classification. Following popular machine learning algorithms have
been adapted in the literature like boosting, k-nearest neighbors, decision trees, and
neural networks. The adapted methods can directly handle multi-label data. Here,
in this research work, we have presented a modified multi-label k-nearest neigh-
bor method that upgrades the nearest neighbor family using appropriate similarity
measures and number of nearest neighbors.

3 Conventional Versus Modified ML-KNN

3.1 Conventional ML-KNN

ML-KNN is derived from the popular k-nearest neighbor (KNN) algorithm [6]. It
works in two different phases. First, k-nearest neighbors of each test instance in the
training set is identified. Then, according to the number of neighboring instances
belonging to each possible class, maximum a posteriori (MAP) principle is utilized
to determine the label set for the test instance. The original ML-KNN uses Euclidean
similarity measure with default 8 nearest neighbors. In our work, the effectiveness
of ML-KNN is evaluated based on four similarity measures of Minkowski family
mentioned in [7] and their variations with number of nearest neighbors.

3.2 Modified ML-KNN

In modified ML-KNN, we have used four types of similarity measures in which


Manhattan, Euclidean, Minkowski, and Chebyshev are used with different nearest
neighbors parameter (5, 8, 11, 14) which is used for the evaluation of ML-KNN.
The experiment shows that the performance of ML-KNN can be improved by select-
ing some well-experimented similarity measures and appropriate number of nearest
instances belonging to each possible class.

4 Architecture Used for Result Evaluation

Real-time information filtering of relevant postings with their unique labels is an


important task on social media. Informative postings can be further used for effective
surveillance. The filtering task can improve the performance of system as it contains
unique and noise-free data. Generally, we think that postings will belong to only one
34 S. K. Srivastava and S. K. Singh

Fig. 1 Framework for


empirical analysis

category but in real-world scenario, each tweet is associated with multiple labels.
Following architecture shows our methodology for efficient result evaluation.
We have considered two different configurations of dataset. First when it belongs
to raw category (C0). Here, raw category is defined by removing link, special sym-
bols, and duplicate tweets from the corpus. Second, when stop words are removed
and all the text data is stemmed means processed category (C3). For both the dataset,
we have identified the appropriate similarity measure and number of nearest neigh-
bors (NN) which can give better performance. We have used few configurations in
ML-KNN for improving the multi-label algorithm. We have used MULAN Library
[8] for result evaluation.

5 Data Sets Description and ML-KNN

In our research work, we have created our own disease corpus and found some
motivating examples that belong to multiple categories of diseases. We have prepared
tweets dataset manually annotated with the help of medical domain expert and the
prepared corpus is used for result evaluation. Some of motivating examples which
belong to multiple categories are as follows (Fig. 1, Table 1).
Multi-label Classification of Twitter Data Using Modified ML-KNN 35

Table 1 Tweet belongs to multiple disease category


Tweets Label sets
Close youre mouth when youre sneezing and Cold, Cough, Congestion
coughing its not that difficult its actually pretty
simple
My stomach hurts so much Stomachache, Abdominal Cramps
Knocked out at bc my stomach hurt so bad Stomachache, Abdominal Cramps
woke up rn and Im about to go back to sleep
Dear Mr brain eye headaches each day is not Conjunctivitis, Headache
fun Its tough to look around nn yours truly a
salty irritated person
Keen yes that one K eyes are watery, Conjunctivitis, Inflammation
inflammation

We have used two different datasets 1. Disease corpus 2. Seattle dataset, both the
dataset are based on Twitter data. Seattle is a standard dataset mentioned in paper
[9]. We prepared our synthetic dataset based on disease keywords suggested in [10].
The disease data preparation phases are as follows.

5.1 Data Collection Phase

In data collection phase, raw tweets are collected to build a corpus. Twitter is the
source of information capturing where we used the disease keyword for capturing
relevant disease tweets from social media. Disease corpus is built by collecting tweets
for five (D-1 to D-5) different diseases—Abdominal pain, conjunctivitis, cough,
diarrhea, and nausea. The keywords to search tweets related to these diseases are
taken from one of the classical work [10]. We have used Tweepy streaming API
[11] for tweet collection. We collected only textual content of tweets in five different
categories. All tweets were processed to remove duplicate tweets as well as other
URLs. A total of 2009 unique disease tweets of five different disease categories were
used in the final disease corpus.

5.2 Data Cleaning Phase

In cleaning phase, raw tweets are first cleaned before they are subjected to different
preprocessing configurations. Cleaning process, generally taken in an effort to reduce
noise that improves quality of training model. The idea behind these steps is to remove
the noise from the dataset as special symbols, special syntaxes, duplicates, and stop
words are viewed as noise and will not be beneficial for the input in any models.
36 S. K. Srivastava and S. K. Singh

6 Measures Used for Result Evaluation

Following measures are used for performance evaluation of Modified ML-KNN and
ML-KNN.

6.1 Subset Accuracy

Subset accuracy [5] evaluates the fraction of correctly classified samples based on
their ground truth label set. It is a multi-label counterpart of the traditional accuracy
metric.

6.2 Hamming Loss

The hamming loss evaluates the fraction of misclassified instance-label pairs, it is


calculated when a relevant label is missed or an irrelevant is predicted. Note that
when each example in S is associated with only one label, hloss S(h) will be 2/q
times of the traditional misclassification
rate.
p
Zhang and Zhou [5] hloss (h)  1/p i1 1/q |h(xi )yi | Here,  stands for the
symmetric difference between two sets.

6.3 Example-Based Precision


N
Example-based precision is defined as—Precision (h)  1/N i1 |h(xi ) ∩ yi |/|yi |

6.4 Example-Based Recall


N
Example-based recall is defined as—Recall (h)  1/N i1 |h(xi ) ∩ yi |/|xi |

6.5 Example-Based F Measure

F measure score
 N is the harmonic mean between precision and recall and is defined
as F1 = 1/N i1 2∗ |h(xi ) ∩ yi |/|h(xi )|+|yi |. F measure score is an example-based
metric and its value is an average overall example in the dataset. F measure score
reaches its best value at 1 and worst score at 0.
Multi-label Classification of Twitter Data Using Modified ML-KNN 37

6.6 Micro-Averaged Precision

Micro-precision (precision
 averaged over all  the example/label pairs) is defined
as—Micro-precision = Qj1 t p j / Qj1 t p j + Qj1 f p j where tpj, fpj are defined
as macro-precision.

6.7 Micro-Averaged Recall

Micro-recall
 (recall  averaged overall the example/label pairs) is defined as—Micro-
recall = Qj1 t p j / Qj1 t p j + Qj1 f n j where tpj , fnj are defined as for macro-
recall.

6.8 Micro-Averaged F Measure

Micro-averaged F measure is the harmonic mean between micro-precision and micro-


recall. Micro-F is defined as micro-averaged F measure = 2 × micro-precision x
micro-recall/micro-precision + micro-recall.

7 Result Evaluation and Discussion

7.1 Result Discussion

With the above experimentation, it is evident that performance varies in both C0


and C3. We can easily depict that configurations C0 and C3 have different results
on ML-KNN when variations in distance measures and number of neighbors are
applied over original ML-KNN. For both the datasets, C3 configuration (When stop
words are removed and terms are stemmed) gives best subset accuracy and minimum
hamming loss.

7.1.1 C0 Configuration

When we use C0 configuration Tables 2 and 4, it is clearly visible that the Euclidean
and Minkowski distance measures along with eight neighbors perform best among
all with the value of 84.72% subset accuracy. In case of Seattle dataset, Euclidean
and Minkowski perform the best when configured with nearest neighbor value 5. The
subset accuracy, in this case, is 48.49%. The Chebyshev distance measure is having
poor performance among all the considered distance measures for both the datasets.
38 S. K. Srivastava and S. K. Singh

Table 2 Disease dataset with C0 configuration


Algorithm Subset Subset Configuration Distance NN−value
accuracy hamming measure
loss
M – ML KNN 83.33 4.24 C0 Manhattan 5
M – ML KNN 83.33 4.24 C0 Manhattan 8
M – ML KNN 83.73 4.16 C0 Manhattan 11
M – ML KN N 82.98 4.39 C0 Manhattan 14
M – ML KNN 84.72 4.48 C0 Euclidean 5
M – ML KNN 83.33 4.24 C0 Euclidean 8
M – ML KNN 83.73 4.16 C0 Euclidean 11
M – ML KNN 82.98 4.39 C0 Euclidean 14
M – ML KNN 84.72 4.48 C0 Minkowski 5
M – ML KN N 83.33 4.24 C0 Minkowski 8
M – ML KNN 83.73 4.16 C0 Minkowski 11
M – ML KNN 82.98 4.39 C0 Minkowski 14
M – ML KNN 5.08 19.02 C0 Chebyshev 5
M – ML KNN 5.03 19.02 C0 Chebyshev 8
M – ML KNN 5.13 19.03 C0 Chebyshev 11
M – ML KN N 4.33 19.18 C0 Chebyshev 14

Table 3 Disease dataset with C3 configuration


Algorithm Subset Subset Configuration Distance NN−value
accuracy hamming measure
loss
M – ML KNN 90.74 2.51 C3 Manhattan 5
M – ML KNN 91.44 2.34 C3 Manhattan 8
M – ML KNN 89.50 2.54 C3 Manhattan 11
M – ML KN N 89.15 2.53 C3 Manhattan 14
M – ML KNN 90.74 2.51 C3 Euclidean 5
M – ML KNN 91.44 2.34 C3 Euclidean 8
M – ML KNN 89.50 2.54 C3 Euclidean 11
M – ML KNN 89.15 2.53 C3 Euclidean 14
M – ML KNN 90.74 2.51 C3 Minkowski 5
M – ML KN N 91.44 2.34 C3 Minkowski 8
M – ML KNN 89.50 2.54 C3 Minkowski 11
M – ML KNN 89.15 2.53 C3 Minkowski 14
M – ML KNN 11.45 17.93 C3 Chebyshev 5
M – ML KNN 9.86 18.15 C3 Chebyshev 8
M – ML KNN 9.90 18.14 C3 Chebyshev 11
M – ML KN N 9.51 18.14 C3 Chebyshev 14
Multi-label Classification of Twitter Data Using Modified ML-KNN 39

Table 4 Seattle dataset with C0 configuration


Algorithm Subset Subset Configuration Distance NN−value
accuracy hamming measure
loss
M – ML KNN 2.25 35.15 C0 Manhattan 5
M – ML KNN 2.25 35.19 C0 Manhattan 8
M – ML KNN 1.70 35.31 C0 Manhattan 11
M – ML KN N 1.98 35.23 C0 Manhattan 14
M – ML KNN 48.49 26.85 C0 Euclidean 5
M – ML KNN 45.09 26.78 C0 Euclidean 8
M – ML KNN 47.45 26.07 C0 Euclidean 11
M – ML KNN 45.80 26.37 C0 Euclidean 14
M – ML KNN 48.49 26.85 C0 Minkowski 5
M – ML KN N 45.09 26.78 C0 Minkowski 8
M – ML KNN 47.45 26.07 C0 Minkowski 11
M – ML KNN 45.80 26.37 C0 Minkowski 14
M – ML KNN 2.25 35.15 C0 Chebyshev 5
M – ML KNN 2.25 35.19 C0 Chebyshev 8
M – ML KNN 1.70 35.31 C0 Chebyshev 11
M – ML KN N 1.98 35.23 C0 Chebyshev 14

7.1.2 C3 Configuration

When we use C3 configuration Tables 3 and 5, it means we use concrete feature set for
the classification task. We found Manhattan, Euclidean, and Minkowski with eight
nearest neighbor performs best among all with 91.44% overall subset accuracy in
case of Disease dataset. For the Seattle dataset, we found Manhattan, Euclidean, and
Minkowski with 14 nearest neighbor performs best among all with 53.15% overall
subset accuracy.
It is clearly visible with the experimentation that there is around 7% more accu-
racy in case of Disease data set and 5% more accuracy in case of Seattle dataset. This
stands that concrete features play important role in classification task irrespective
of their belongingness to single, multi-class, or multi-label classification.
40 S. K. Srivastava and S. K. Singh

Table 5 Seattle dataset with C3 configuration


Algorithm Subset Subset Configuration Distance NN−value
Accuracy hamming measure
loss
M – ML KNN 52.72 26.00 C3 Manhattan 5
M – ML KNN 52.83 25.33 C3 Manhattan 8
M – M L KNN 52.77 25.34 C3 Manhattan 11
M – ML KN N 53.15 25.27 C3 Manhattan 14
M – ML KNN 52.72 26.00 C3 Euclidean 5
M – ML KNN 52.83 25.33 C3 Euclidean 8
M – ML KNN 52.77 25.34 C3 Euclidean 11
M – ML KNN 53.15 25.27 C3 Euclidean 14
M – M L KNN 52.72 26.00 C3 Minkowski 5
M – ML KN N 52.83 25.33 C3 Minkowski 8
M – ML KNN 52.77 25.34 C3 Minkowski 11
M – ML KNN 53.15 25.27 C3 Minkowski 14
M – ML KNN 3.84 34.74 C3 Chebyshev 5
M – ML KNN 3.73 34.81 C3 Chebyshev 8
M – M L KNN 3.46 34.82 C3 Chebyshev 11
M – ML KN N 3.62 34.78 C3 Chebyshev 14

8 Conclusion

In this paper, the performance of the conventional ML-KNN algorithm is validated


by changing appropriate similarity measure and number of nearest neighbors. Based
on nearest neighboring instances information and distance measures between the
feature of test instances, modified ML-KNN utilizes maximum a posteriori principle
to determine the label set for the unseen instances. Experiments on two real-world
multi-label datasets showed that performance of Modified ML-KNN is improved on
the basis of their distance measures and number of nearest neighbors. Manhattan,
Euclidean, and Minkowski show that modified ML-KNN outperforms with C3 con-
figuration and there is around 5–7% hike in subset accuracy. In this paper, the distance
between instances is simply measured by four distance metric Manhattan, Euclidean,
Minkowski, and Chebyshev. Experiment shows Chebyshev distance metric has the
worst performance among all.
Multi-label Classification of Twitter Data Using Modified ML-KNN 41

9 Future Work

Complex statistical information other than the membership counting statistics can
facilitate the usage of maximum a posteriori principle. This can be an interesting
issue for future work.

References

1. Sofean M, Smith M (2012) A real-time disease surveillance architecture using social networks.
Stud Health Technol Inf 180:823–827
2. Guo J, Zhang P, Guo L (2012) Mining hot topics from twitter streams. Procedia Comput Sci
9:2008–2011
3. Rui W, Xing K, Jia Y (2016) BOWL: Bag of word clusters text representation using word
embeddings. In: International conference on knowledge science, engineering and management.
Springer International Publishing
4. Ding W et al (2008) LRLW-LSI: an improved latent semantic indexing (LSI) text classifier.
Lect Note Comput Sci 5009:483
5. Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl
Data Eng 26(8):1819–1837
6. Aha DW (1991) Incremental constructive induction: an instance-based approach. In: Proceed-
ings of the eighth international workshop on machine learning
7. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability
density functions. City 1(2):1
8. Tsoumakas G et al (2011) Mulan: a java library for multi-label learning. J Mach Learn Res,
2411–2414
9. Schulz A et al (2014) Evaluating multi-label classification of incident-related tweets. In: Making
Sense of Microposts (Microposts2014), vol 7
10. Velardi P et al (2014) Twitter mining for fine-grained syndromic surveillance. Artif Intell Med
61(3):153–163
11. Roesslein J (2009) Tweepy documentation. http://tweepy.readthedocs.io/en/v3.5

You might also like