Professional Documents
Culture Documents
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we present the first deep learning approach to aspect extraction in opinion mining. Aspect
Received 18 November 2015 extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated
Revised 2 May 2016
text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or
Accepted 8 June 2016
complaining about. We used a 7-layer deep convolutional neural network to tag each word in opinionated
Available online 17 June 2016
sentences as either aspect or non-aspect word. We also developed a set of linguistic patterns for the
Keywords: same purpose and combined them with the neural network. The resulting ensemble classifier, coupled
Sentiment analysis with a word-embedding model for sentiment analysis, allowed our approach to obtain significantly better
Aspect extraction accuracy than state-of-the-art methods.
Opinion mining
CNN © 2016 Elsevier B.V. All rights reserved.
RBM
DNN
http://dx.doi.org/10.1016/j.knosys.2016.06.009
0950-7051/© 2016 Elsevier B.V. All rights reserved.
S. Poria et al. / Knowledge-Based Systems 108 (2016) 42–49 43
Most of the previous works in aspect term extraction have ei- semi-supervised model, which allows the user to set must-link
ther used conditional random fields (CRFs) [6,7] or linguistic pat- and cannot-link constraints. A must-link constraint means that two
terns [4,8]. Both of these approaches have their own limitations: terms must be in the same topic, while a cannot-link constraint
CRF is a linear model, so it needs a large number of features to means that two terms cannot be in the same topic. Poria et al.
work well; linguistic patterns need to be crafted by hand, and they [26] integrated common-sense computing [27] in the calculation
crucially depend on the grammatical accuracy of the sentences. of word distributions in the LDA algorithm, thus enabling the shift
In this paper, we overcome both limitations by using a convolu- from syntax to semantics in aspect-based sentiment analysis.
tional neural network (CNN), a non-linear supervised classifier that Wang et al. [28] proposed two semi-supervised models for
can more easily fit the data. Previously, [9] used such a network product aspect extraction based on the use of seeding aspects. In
to solve a range of tasks (not for aspect extraction), on which it the category of supervised methods, [29] employed seed words to
outperformed other state-of-the-art NLP methods. In addition, we guide topic models to learn topics of specific interest to a user,
use linguistic patterns to further improve the performance of the while [20] and [30] employed seeding words to extract related
method, though in this case the above-mentioned issues inherent product aspects from product reviews.
in linguistic patterns affect the framework. On the other hand, recent approaches using deep CNNs
This paper is the first one to introduce the application of a deep [9,31] showed significant performance improvement over the state-
learning approach to the task of aspect extraction. Our experimen- of-the-art methods on a range of natural language processing (NLP)
tal results show that a deep CNN is more efficient for aspect ex- tasks. Collobert et al. [9] fed word embeddings into a CNN to solve
traction than existing approaches. We also introduced specific lin- standard NLP problems such as named entity recognition (NER),
guistic patterns and combined a linguistic pattern approach with a part-of-speech (POS) tagging and semantic role labeling.
deep learning approach for the aspect extraction task.
3. Some background on deep CNN
2. Related work
A deep neural network (DNN) can be viewed as a composite
Aspect extraction from opinions was first studied by Hu and of simple, unsupervised models such as restricted Boltzmann ma-
Liu [4]. They introduced the distinction between explicit and im- chines (RBMs), where each RBM’s hidden layer serves as the visible
plicit aspects. However, the authors only dealt with explicit aspects layer for the next RBM. An RBM is a bipartite graph comprising of
and used a set of rules based on statistical observations. Hu and two layers of neurons: a visible and a hidden layer; connections
Liu’s method was later improved by Popescu and Etzioni [10] and between neurons in the same layer are not allowed.
by Blair-Goldensohn et al. [11]. Popescu and Etzioni [10] assumed To train such a multi-layer system, one needs to compute the
the product class is known in advance. Their algorithm detects gradient of the total energy function E with respect to weights in
whether a noun or noun phrase is a product feature by comput- all the layers. To learn these weights and maximize the global en-
ing the point-wise mutual information between the noun phrase ergy function, the approximate maximum likelihood contrastive di-
and the product class. vergence approach can be used. This method employs each training
Scaffidi et al. [12] presented a method that uses language model sample to initialize the visible layer. Next, it uses the Gibbs sam-
to identify product features. They assumed that product features pling algorithm to update the hidden layer and then reconstruct
are more frequent in product reviews than in a general natural the visible layer consecutively, until convergence occurs [32]. As
language text. However, their method seems to have low preci- an example, consider a logistic regression model to learn the bi-
sion since retrieved aspects are affected by noise. Some meth- nary hidden neurons. Each visible neuron is assumed to be a sam-
ods treated the aspect term extraction as sequence labeling and ple from a normal distribution [33]. The continuous state hˆ j of the
used CRF for that. Such methods have performed very well on the hidden neuron j, with bias bj , is a weighted sum over all continu-
datasets even in cross-domain experiments [6,7]. ous visible neurons v:
Topic modeling has been widely used as a basis to perform ex-
hˆ j = b j + vi wi j , (1)
traction and grouping of aspects [13,14]. Two models were con-
i
sidered: pLSA [15] and LDA [16]. Both models introduce a latent
variable “topic” between the observable variables “document” and where wij is the weight of connection from the visible neuron vi to
“word” to analyze the semantic topic distribution of documents. the hidden neuron j. The binary state hj of the hidden neuron can
In topic models, each document is represented as a random mix- be defined by a sigmoid activation function:
ture over latent topics, where each topic is characterized by a dis- 1
hj = . (2)
tribution over words. Such methods have been gaining popularity ˆ
1 + e −h j
in social media analysis like emerging political topic detection in
Twitter [17]. The LDA model defines a Dirichlet probabilistic gen- Similarly, at the next iteration, the continuous state of each visi-
erative process for document-topic distribution; in each document, ble neuron vi is reconstructed. Here, we determine the state of the
a latent aspect is chosen according to a multinomial distribution, visible neuron i, with bias ci , as a random sample from the nor-
controlled by a Dirichlet prior α . Then, given an aspect, a word mal distribution where the mean is a weighted sum over all binary
is extracted according to another multinomial distribution, con- hidden neurons:
trolled by another Dirichlet prior β . Among existing works em- vi = ci + hi w i j , (3)
ploying these models are the extraction of global aspects ( such j
as the brand of a product) and local aspects (such as the property
where wij is the weight of connection from the visible neuron i to
of a product [18]), the extraction of key phrases [19], the rating
the hidden one j. This continuous state is a random sample from
of multi-aspects [20], and the summarization of aspects and senti-
a normal distribution N (vi , σ ), where σ is the variance of all visi-
ments [21]. [22] employed the maximum entropy method to train
ble neurons. Unlike hidden neurons, in a Gaussian RBM the visible
a switch variable based on POS tags of words and used it to sepa-
ones can take continuous values.
rate aspect and sentiment words.
Then, the weights are updated as the difference between the
Mcauliffe and Blei [23] added user feedback to LDA as a
original data vdata and reconstructed visible layer vrecon :
response-variable related to each document. Lu and Zhai [24] pro-
posed a semi-supervised model. DF-LDA [25] also represents a wi j = α (vi h j data − vi h j recon ), (4)
44 S. Poria et al. / Knowledge-Based Systems 108 (2016) 42–49
where α is the learning rate and vi hj is the expected frequency time the score for all paths that end in a given tag [9]. Let ytk de-
with which the visible neuron i and the hidden neuron j are active note all paths that end with the tag k at the token t. Then, using
together, when the visible vectors are sampled from the training recursion, we obtain
set and the hidden neurons are calculated according to (1)–(3), af-
δt (k ) = logadd s(x, p, θ ) = ht,k + logadd δt−1 ( j ) + A j,k . (12)
ter some k iterations. p∈ytk j
Finally, the energy of a DNN can be determined from the final
For the sake of brevity, we shall not delve into details of the recur-
layer (the one before the output layer) as:
sive procedure, which can be found in [9]. The next equation gives
E=− vi h j wi j . (5) the log-add for all the paths to the token T:
i, j logadd s(x, y, θ ) = logadd δT (i ). (13)
p,y i
To extend the deep neural network to a deep CNN, one simply
partitions the hidden layer into Z groups. Each of the Z groups is Using these equations, we can maximize the likelihood of
associated with an nx × ny filter, where nx is the height of the (11) over all training pairs. For inference, we need to find the best
kernel and ny is the width of the kernel. Assume that the input has tag path using the Viterbi algorithm; e.g., we need to find the best
dimensions Lx × Ly , which in our case is given by Lx words in the tag path that minimizes the sentence score (10).
sentence and Ly features, such as word embedding, of each word.
5. Our network architecture
Then the convolution will result in a hidden layer of Z groups, each
of dimension (Lx − nx + 1 ) × (Ly − ny + 1 ).
The features of an aspect term depend on its surrounding
The learned weights of these kernels are shared among all hid-
words. Thus, we used a window of 5 words around each word in
den neurons in a particular group. The energy function of the layer
a sentence, i.e., ± 2 words. We formed the local features of that
l is now a sum over the energy of individual blocks:
window and considered them to be features of the middle word.
(Lx −nx +1 ), Then, the feature vector was fed to a CNN.
(Ly −ny +1 ) nx ,ny
Z The network contained one input layer, two convolution lay-
l
E =− vi+r−1, j+s−1 hzi j wlrs . (6) ers, two max-pool layers, and a fully connected layer with soft-
z=1 i, j r,s
max output. The first convolution layer consisted of 100 feature
maps with filter size 2. The second convolution layer had 50 fea-
4. Training CNN for sequential data ture maps with filter size 3. The stride in each convolution layer
is 1 as we wanted to tag each word. A max-pooling layer followed
We used a special training algorithm suitable for sequential each convolution layer. The pool size we use in the max-pool lay-
data, proposed by Collobert et al. [9]. We will summarize it here, ers was 2. We used regularization with dropout on the penultimate
mainly following [34]. layer with a constraint on L2-norms of the weight vectors, with 30
The algorithm trains the neural network by back-propagation in epochs. The output of each convolution layer was computed using
order to maximize the likelihood over training sentences. Consider a non-linear function; in our case we used the hyperbolic tangent.
the network parameter θ . We say that hy is the output score for As features, we used word embeddings trained on two differ-
the likelihood of an input x to have the tag y. Then, the probability ent corpora. We also used some additional features and rules to
to assign the label y to x is calculated as boost the accuracy; see Section 7. The CNN produces local features
around each word in a sentence and then combines these features
exp (hy )
p(y|x, θ ) = . (7) into a global feature vector. Since the kernel size for the two con-
j exp (h j ) volution layers was different, the dimensionality Lx × Ly mentioned
in Section 3 was 3 × 300 and 2 × 300, respectively. The input layer
Define the logadd operation as
was 65 × 300, where 65 was the maximum number of words in
logadd hi = log exp hi , (8) a sentence, and 300 the dimensionality of the word embeddings
i i used, per each word.
The process was performed for each word in a sentence. Un-
then for a training example, the log-likelihood becomes
like traditional max-likelihood leaning scheme, we trained the sys-
log p(y|x, θ ) = hy − logadd hi . (9) tem using propagation after convolving all tokens in the sentence.
i Namely, we stored the weights, biases, and features for each to-
In aspect term extraction, the terms can be organized as chunks ken after convolution and only back-propagated the error in order
and are also often surrounded by opinion terms. Hence, it is im- to correct them once all tokens were processed using the training
portant to consider sentence structure on a whole in order to ob- scheme from in Section 4.
tain additional clues. Let it be given that there are T tokens in a If a training instance s had n words, then we represented the
sentence and y is the tag sequence while ht, i is the network score input vector for that instance as s1:n = s1 s2 . . . sn . Here,
k
si ∈ R is a k-dimensional feature vector for the word si . We found
for the t-th tag having i-th tag. We introduce Ai, j transition score
from moving tag i to tag j. Then, the score tag for the sentence s that this network architecture produced good results on both of
to have the tag path y is defined by our benchmark datasets. Adding extra layers or changing the pool-
ing size and window size did not contribute to the accuracy much
T
but only increased computational cost.
s(x, y, θ ) = (ht,yt + Ayt−1 ,yt ). (10)
t=1
6. Datasets used
This formula represents the tag path probability over all possible
paths. Now, from (8) we can write the log-likelihood In this section, we present the data used in our experiments.
The number of tag paths has exponential growth. However, using Word embeddings are distributed representations of text, which
dynamic programming techniques, one can compute in polynomial encode semantic and syntactic properties of words. Usually they
S. Poria et al. / Knowledge-Based Systems 108 (2016) 42–49 45
Table 1
Characteristics of the dataset developed by Qiu et al. [37] and comparison of our approach with the state of the art on
it. Popescu stands for [10] and Prof-dep for [37]; P stands for precision and R for recall.
P R P R P R
Table 3 Table 5
Random features vs. Google embeddings vs. Ama- Impact of linguistic patterns on the SemEval 2014 dataset.
zon embeddings on the SemEval 2014 dataset.
Domain Classifiers Recall Precision F-Score
Domain Feature F-Score
Laptop LP 62.39% 57.20% 59.68%
Laptop Random 71.21% Laptop CNN 76.31% 86.46% 81.06%
Laptop Google embeddings 77.32% Laptop CNN+LP 78.35% 86.72% 82.32%
Laptop Amazon embeddings 80.68% Restaurant LP 65.41% 60.50% 62.86%
Restaurant Random 77.05% Restaurant CNN 85.01% 87.42% 86.20%
Restaurant Google embeddings 83.50% Restaurant CNN+LP 86.10% 88.27% 87.17%
Restaurant Amazon embeddings 85.70%
Table 6
Table 4
Comparison with the state of the art. ZW stands for [7]; LP stands
Feature analysis for the CNN classifier.
for linguistic patterns.
Domain Features Recall Precision F-Score
Domain Framework Recall Precision F-Score
Laptop WE 75.20% 86.05% 80.68%
Laptop ZW 66.51% 84.80% 74.55%
Laptop WE+POS 76.31% 86.46% 81.06%
Laptop CNN+LP (our) 78.35% 86.72% 82.32%
Restaurant WE 84.11% 87.35% 85.70%
Restaurant ZW 82.72% 85.35% 84.01%
Restaurant WE+POS 85.01% 87.42% 86.20%
Restaurant CNN+LP (our) 86.10% 88.27% 87.17%
90
85
80
75 LP
Precision
70 CNN
65 CNN+LP
Laptop
60
Restaurant
55
50
60 65 70 75 80 85 90
Recall
Fig. 1. Comparison of the performance of CNN, CNN-LP and LP.
89
88
87
Precision
ZW
CNN+LP (our)
86
Laptop
85 Restaurant
84
65 70 75 80 85 90
Recall
Fig. 2. Comparison of the performance with the state of the art.
Table 8 Fig. 4 compares the proposed method with the state of the art.
Impact of linguistic patterns on the dataset by Qiu et al. [37].
We believe that there are two key reasons for our framework to
Domain Classifiers Precision Recall F-Score outperform state-of-the-art approaches. First, a deep CNN, which is
non-linear in nature, better fits the data than linear models such
Canon CNN 85.42% 77.21% 81.10%
Canon CNN+LP 92.59% 85.02% 88.64%
as CRF. Second, the pre-trained word embedding features help our
Nikon CNN 77.65% 82.30% 79.90% framework to outperform state-of-the-art methods that do not use
Nikon CNN+LP 82.65% 87.23% 84.87% word embeddings. The main advantage of our framework is that
DVD CNN 85.48% 79.25% 82.24% it does not need any feature engineering. This minimizes develop-
DVD CNN+LP 90.29% 84.30% 87.19%
ment cost and time.
Mp3 CNN 89.40% 83.77% 86.49%
Mp3 CNN+LP 92.75% 86.05% 89.27%
Cellphone CNN 90.15% 83.47% 86.68%
Cellphone CNN+LP 92.67% 88.32% 90.44%
9. Conclusion
94
92
90
CNN
88
Precision CNN+LP
86
Canon
84
Nikon
82
DVD
80
Mp3
78
Cellphone
76
76 78 80 82 84 86 88 90
Recall
Fig. 3. Comparison of the performance of CNN, CNN-LP and LP.
94
92
Popescu
90
Prop-dep
Precision
88 Our
86 Canon
Nikon
84
DVD
82 Mp3
80 Cellphone
73 75 77 79 81 83 85 87 89
Recall
Fig. 4. Comparison of the performance with the state of the art on Bing Liu dataset.
Methods in Natural Language Processing, Association for Computational Lin- [32] G.E. Hinton, Training products of experts by minimizing contrastive divergence,
guistics, 2010, pp. 56–65. Neural Comput. 14 (8) (2002) 1771–1800.
[23] J.D. Mcauliffe, D.M. Blei, Supervised topic models, in: Advances in Neural In- [33] G.W. Taylor, G.E. Hinton, S.T. Roweis, Modeling human motion using binary
formation Processing Systems, 2008, pp. 121–128. latent variables, in: Advances in Neural Information Processing Systems, 19,
[24] Y. Lu, C. Zhai, Opinion integration through semi-supervised topic modeling, in: MIT Press, 2007, pp. 1345–1352.
Proceedings of the 17th International Conference on World Wide Web, ACM, [34] E.R. Fonseca, J.L.G. Rosa, A two-step convolutional neural network approach for
2008, pp. 121–130. semantic role labeling, in: The 2013 International Joint Conference on Neural
[25] D. Andrzejewski, X. Zhu, M. Craven, Incorporating domain knowledge into Networks (IJCNN), IEEE, Dallas, TX, 2013, pp. 1–7.
topic modeling via Dirichlet forest priors, in: Proceedings of the 26th Annual [35] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic regularities in continuous space word
International Conference on Machine Learning, ACM, 2009, pp. 25–32. representations., in: HLT-NAACL, 2013, pp. 746–751.
[26] S. Poria, I. Chaturvedi, E. Cambria, F. Bisio, Sentic LDA: Improving on LDA with [36] J. McAuley, J. Leskovec, Hidden factors and hidden topics: Understanding rating
semantic similarity for aspect-based sentiment analysis, IJCNN, 2016. dimensions with review text, in: Proceedings of RecSys’13. Hong Kong, China,
[27] E. Cambria, A. Hussain, C. Havasi, C. Eckl, Common sense computing: From 2013.
the society of mind to digital intuition and beyond, in: J. Fierrez, J. Ortega, [37] G. Qiu, B. Liu, J. Bu, C. Chen, Opinion word expansion and target extraction
A. Esposito, A. Drygajlo, M. Faundez-Zanuy (Eds.), Biometric ID Management through double propagation, Comput. Linguist. 37 (1) (2011) 9–27.
and Multimodal Communication, Lecture Notes in Computer Science, 5707, [38] E. Cambria, D. Olsher, D. Rajagopal, SenticNet 3: a common and common-sense
Springer, Berlin Heidelberg, 2009, pp. 252–259. knowledge base for cognition-driven sentiment analysis, in: AAAI. Quebec City,
[28] T. Wang, Y. Cai, H.-f. Leung, R.Y. Lau, Q. Li, H. Min, Product aspect extraction su- 2014b, pp. 1515–1521.
pervised with online domain knowledge, Knowl.-Based Syst. 71 (2014) 86–100. [39] S. Poria, A. Gelbukh, E. Cambria, P. Yang, A. Hussain, T. Durrani, Merging sen-
[29] J. Jagarlamudi, H. Daumé III, R. Udupa, Incorporating lexical priors into topic ticnet and wordnet-affect emotion lists for sentiment analysis, in: Signal Pro-
models, in: Proceedings of the 13th EACL Conference, Association for Compu- cessing (ICSP), 2012 IEEE 11th International Conference on, vol. 2, IEEE, 2012a,
tational Linguistics, 2012, pp. 204–213. pp. 1251–1255.
[30] A. Mukherjee, B. Liu, Aspect extraction through semi-supervised modeling, in: [40] S. Poria, A. Gelbukh, E. Cambria, D. Das, S. Bandyopadhyay, Enriching Sentic-
Proceedings of 50th Annual Meeting of the ACL: Long Papers, Volume 1, ACL, Net polarity scores through semi-supervised fuzzy clustering, in: IEEE ICDM.
2012, pp. 339–348. Brussels, 2012b, pp. 709–716.
[31] S. Poria, E. Cambria, A. Gelbukh, Deep convolutional neural network textual [41] E. Cambria, A. Hussain, Sentic Computing: A Common-Sense-Based Framework
features and multiple kernel learning for utterance-level multimodal senti- for Concept-Level Sentiment Analysis, Springer, Cham, Switzerland, 2015.
ment analysis, in: EMNLP, 2015b, pp. 2539–2544.