You are on page 1of 7

Counter-fitting Word Vectors to Linguistic Constraints

Nikola Mrkšić1 , Diarmuid Ó Séaghdha2 , Blaise Thomson2 , Milica Gašić1


Lina Rojas-Barahona1 , Pei-Hao Su1 , David Vandyke1 , Tsung-Hsien Wen1 , Steve Young1
1
Department of Engineering, University of Cambridge, UK
2
Apple Inc.
{nm480,mg436,phs26,djv27,thw28,sjy}@cam.ac.uk
{doseaghdha, blaisethom}@apple.com
arXiv:1603.00892v1 [cs.CL] 2 Mar 2016

Abstract east expensive British


west pricey American
In this work, we present a novel counter-fitting north cheaper Australian
method which injects antonymy and synonymy Before south costly Britain
constraints into vector space representations in southeast overpriced European
order to improve the vectors’ capability for
northeast inexpensive England
judging semantic similarity. Applying this
method to publicly available pre-trained word eastward costly Brits
vectors leads to a new state of the art perfor- eastern pricy London
mance on the SimLex-999 dataset. We also After easterly overpriced BBC
show how the method can be used to tailor the - pricey UK
word vector space for the downstream task of - afford Britain
dialogue state tracking, resulting in robust im-
Table 1: Nearest neighbours for target words using GloVe
provements across different dialogue domains.
vectors before and after counter-fitting

1 Introduction
as east and west or expensive and inexpensive appear
Many popular methods that induce representations in near-identical contexts, which means that distribu-
for words rely on the distributional hypothesis – the tional models produce very similar word vectors for
assumption that semantically similar or related words such words. Examples of such anomalies in GloVe
appear in similar contexts. This hypothesis sup- vectors can be seen in Table 1, where words such as
ports unsupervised learning of meaningful word rep- cheaper and inexpensive are deemed similar to (their
resentations from large corpora (Curran, 2003; Ó antonym) expensive.
Séaghdha and Korhonen, 2014; Mikolov et al., 2013; A second drawback is that similarity and antonymy
Pennington et al., 2014). Word vectors trained using can be application- or domain-specific. In our case,
these methods have proven useful for many down- we are interested in exploiting distributional knowl-
stream tasks including machine translation (Zou et al., edge for the dialogue state tracking task (DST). The
2013) and dependency parsing (Bansal et al., 2014). DST component of a dialogue system is responsi-
One drawback of learning word embeddings from ble for interpreting users’ utterances and updating
co-occurrence information in corpora is that it tends the system’s belief state – a probability distribution
to coalesce the notions of semantic similarity and con- over all possible states of the dialogue. For exam-
ceptual association (Hill et al., 2014b). Furthermore, ple, a DST for the restaurant domain needs to detect
even methods that can distinguish similarity from whether the user wants a cheap or expensive restau-
association (e.g., based on syntactic co-occurrences) rant. Being able to generalise using distributional
will generally fail to tell synonyms from antonyms information while still distinguishing between se-
(Mohammad et al., 2008). For example, words such mantically different yet conceptually related words
(e.g. cheaper and pricey) is critical for the perfor- Recently, there has been interest in lightweight
mance of dialogue systems. In particular, a dialogue post-processing procedures that use lexical knowl-
system can be led seriously astray by false synonyms. edge to refine off-the-shelf word vectors without re-
We propose a method that addresses these two quiring large corpora for (re-)training as the afore-
drawbacks by using synonymy and antonymy rela- mentioned “heavyweight” procedures do. Faruqui
tions drawn from either a general lexical resource et al.’s (2015) retrofitting approach uses similarity
or an application-specific ontology to fine-tune dis- constraints from WordNet and other resources to pull
tributional word vectors. Our method, which we similar words closer together.
term counter-fitting, is a lightweight post-processing The complications caused by antonymy for distri-
procedure in the spirit of retrofitting (Faruqui et al., butional methods are well-known in the semantics
2015). The second row of Table 1 illustrates the community. Most prior work focuses on extracting
results of counter-fitting: the nearest neighbours cap- antonym pairs from text rather than exploiting them
ture true similarity much more intuitively than the (Lin et al., 2003; Mohammad et al., 2008; Turney,
original GloVe vectors. The procedure improves 2008; Hashimoto et al., 2012; Mohammad et al.,
word vector quality regardless of the initial word vec- 2013). The most common use of antonymy infor-
tors provided as input.1 By applying counter-fitting mation is to provide features for systems that de-
to the Paragram-SL999 word vectors provided by tect contradictions or logical entailment (Marcu and
Wieting et al. (2015), we achieve new state-of-the-art Echihabi, 2002; de Marneffe et al., 2008; Zanzotto
performance on SimLex-999, a dataset designed to et al., 2009). As far as we are aware, there is no
measure how well different models judge semantic previous work on exploiting antonymy in dialogue
similarity between words (Hill et al., 2014b). We systems. The modelling work closest to ours are
also show that the counter-fitting method can in- Liu et al. (2015), who use antonymy and WordNet
ject knowledge of dialogue domain ontologies into hierarchy information to modify the heavyweight
word vector space representations to facilitate the Word2Vec training objective; Yih et al. (2012), who
construction of semantic dictionaries which improve use a Siamese neural network to improve the qual-
DST performance across two different dialogue do- ity of Latent Semantic Analysis vectors; Schwartz et
mains. Our tool and word vectors are available at al. (2015), who build a standard distributional model
github.com/nmrksic/counter-fitting. from co-occurrences based on symmetric patterns,
with specified antonymy patterns counted as nega-
2 Related Work tive co-occurrences; and Ono et al. (2015), who use
Most work on improving word vector representa- thesauri and distributional data to train word embed-
tions using lexical resources has focused on bringing dings specialised for capturing antonymy.
words which are known to be semantically related
closer together in the vector space. Some methods 3 Counter-fitting Word Vectors to
modify the prior or the regularization of the original Linguistic Constraints
training procedure (Yu and Dredze, 2014; Bian et al.,
2014; Kiela et al., 2015). Wieting et al. (2015) use Our starting point is an indexed set of word vec-
the Paraphrase Database (Ganitkevitch et al., 2013) tors V = {v1 , v2 , . . . , vN } with one vector for each
to train word vectors which emphasise word simi- word in the vocabulary. We will inject semantic re-
larity over word relatedness. These word vectors lations into this vector space to produce new word
achieve the current state-of-the-art performance on vectors V 0 = {v0 1 , v0 2 , . . . , v0 N }. For antonymy
the SimLex-999 dataset and are used as input for and synonymy we have a set of constraints A and
counter-fitting in our experiments. S, respectively. The elements of each set are pairs
of word indices; for example, each pair (i, j) in S is
1
When we write “improve”, we refer to improving the vector such that the i-th and j-th words in the vocabulary are
space for a specific purpose. We do not expect that a vector
space fine-tuned for semantic similarity will give better results
synonyms. The objective function used to counter-fit
on semantic relatedness. As Mohammad et al. (2008) observe, the pre-trained word vectors V to the sets of linguistic
antonymous concepts are related but not similar. constraints A and S contains three different terms:
1. Antonym Repel (AR): This term serves to push where k1 , k2 , k3 ≥ 0 are hyperparameters that con-
antonymous words’ vectors away from each other in trol the relative importance of each term. In our
the transformed vector space V 0 : experiments we set them to be equal: k1 = k2 = k3 .
X To minimise the cost function for a set of starting
AR(V 0 ) = τ δ − d(vu0 , vw
0
vectors V and produce counter-fitted vectors V 0 , we

)
(u,w)∈A run stochastic gradient descent (SGD) for 20 epochs.
An end-to-end run of counter-fitting takes less than
where d(vi , vj ) = 1−cos(vi , vj ) is a distance derived two minutes on a laptop with four CPUs.
from cosine similarity and τ (x) , max(0, x) im-
poses a margin on the cost. Intuitively, δ is the “ideal” 3.1 Injecting Dialogue Domain Ontologies into
minimum distance between antonymous words; in Vector Space Representations
our experiments we set δ = 1.0 as it corresponds to
vector orthogonality. Dialogue state tracking (DST) models capture users’
goals given their utterances. Goals are represented as
2. Synonym Attract (SA): The counter-fitting
sets of constraints expressed by slot-value pairs such
procedure should seek to bring the word vectors of
as [food: Indian] or [parking: allowed]. The set of
known synonymous word pairs closer together:
slots S and the set of values Vs for each slot make up
X the ontology of a dialogue domain.
SA(V 0 ) = τ d(vu0 , vw
0

)−γ
In this paper we adopt the recurrent neural network
(u,w)∈S
(RNN) framework for tracking suggested in (Hender-
where γ is the “ideal” maximum distance between son et al., 2014d; Henderson et al., 2014c; Mrkšić et
synonymous words; we use γ = 0. al., 2015). Rather than using a spoken language un-
derstanding (SLU) decoder to convert user utterances
3. Vector Space Preservation (VSP): the topol- into meaning representations, this model operates
ogy of the original vector space describes relation- directly on the n-gram features extracted from the
ships between words in the vocabulary captured using automated speech recognition (ASR) hypotheses. A
distributional information from very large textual cor- drawback of this approach is that the RNN model
pora. The VSP term bends the transformed vector can only perform exact string matching to detect the
space towards the original one as much as possible in slot names and values mentioned by the user. It can-
order to preserve the semantic information contained not recognise synonymous words such as pricey and
in the original vectors: expensive, or even subtle morphological variations
N such as moderate and moderately. A simple way to
X X
VSP(V, V 0 ) = τ d(vi0 , vj0 ) − d(vi , vj ) mitigate this problem is to use semantic dictionaries:


i=1 j∈N (i) lists of rephrasings for the values in the ontology.
Manual construction of dictionaries is highly labour-
For computational efficiency, we do not calculate intensive; however, if one could automatically detect
distances for every pair of words in the vocabulary. high-quality rephrasings, then this capability would
Instead, we focus on the (pre-computed) neighbour- come at no extra cost to the system designer.
hood N (i), which denotes the set of words within To obtain a set of word vectors which can be used
a certain radius ρ around the i-th word’s vector in for creating a semantic dictionary, we need to inject
the original vector space V . Our experiments indi- the domain ontology into the vector space. This can
cate that counter-fitting is relatively insensitive to the be achieved by introducing antonymy constraints be-
choice of ρ, with values between 0.2 and 0.4 showing tween all the possible values of each slot (i.e. Chinese
little difference in quality; here we use ρ = 0.2. and Indian, expensive and cheap, etc.). The remain-
The objective function for the training procedure ing linguistic constraints can come from semantic
is given by a weighted sum of the three terms: lexicons: the richer the sets of injected synonyms
and antonyms are, the better the resulting word rep-
C(V, V 0 ) = k1 AR(V 0 )+k2 SA(V 0 )+k3 VSP(V, V 0 ) resentations will become.
Model / Word Vectors ρ Semantic Resource Glove Paragram
Neural MT Model (Hill et al., 2014a) 0.52 Baseline (no linguistic constraints) 0.41 0.69
Symmetric Patterns (Schwartz et al., 2015) 0.56 PPDB− (PPDB antonyms) 0.43 0.69
Non-distributional Vectors (Faruqui and Dyer, 2015) 0.58 PPDB+ (PPDB synonyms) 0.46 0.68
GloVe vectors (Pennington et al., 2014) 0.41 WordNet− (WordNet antonyms) 0.52 0.74
GloVe vectors + Retrofitting 0.53 PPDB− and PPDB+ 0.50 0.69
GloVe + Counter-fitting 0.58 WordNet− and PPDB− 0.53 0.74
Paragram-SL999 (Wieting et al., 2015) 0.69 WordNet− and PPDB+ 0.58 0.74
Paragram-SL999 + Retrofitting 0.68 WordNet− and PPDB− and PPDB+ 0.58 0.74
Paragram-SL999 + Counter-fitting 0.74 Table 3: SimLex-999 performance when different sets of
Inter-annotator agreement 0.67 linguistic constraints are used for counter-fitting
Annotator/gold standard agreement 0.78
Table 2: Performance on SimLex-999. Retrofitting uses
the code and (PPDB) data provided by the authors 4.2 Improving Lexical Similarity Predictions
In this section, we show that counter-fitting pre-
4 Experiments trained word vectors with linguistic constraints im-
proves their usefulness for judging semantic simi-
4.1 Word Vectors and Semantic Lexicons larity. We use Spearman’s rank correlation coeffi-
Two different collections of pre-trained word vectors cient with the SimLex-999 dataset, which contains
were used as input to the counter-fitting procedure: word pairs ranked by a large number of annotators
instructed to consider only semantic similarity.
1. Glove Common Crawl 300-dimensional vec- Table 2 contains a summary of recently reported
tors made available by Pennington et al. (2014). competitive scores for SimLex-999, as well as the
performance of the unaltered, retrofitted and counter-
2. Paragram-SL999 300-dimensional vectors
fitted GloVe and Paragram-SL999 word vectors. To
made available by Wieting et al. (2015).
the best of our knowledge, the 0.685 figure reported
The synonymy and antonymy constraints were ob- for the latter represents the current high score. This
tained from two semantic lexicons: figure is above the average inter-annotator agreement
of 0.67, which has been referred to as the ceiling
1. PPDB 2.0 (Pavlick et al., 2015): the latest re- performance in most work up to now.
lease of the Paraphrase Database. A new fea- In our opinion, the average inter-annotator agree-
ture of this version is that it assigns relation ment is not the only meaningful measure of ceiling
types to its word pairs. We identify the Equiv- performance. We believe it also makes sense to com-
alence relation with synonymy and Exclusion pare: a) the model ranking’s correlation with the gold
with antonymy. We used the largest available standard ranking to: b) the average rank correlation
(XXXL) version of the database and only con- that individual human annotators’ rankings achieved
sidered single-token terms. with the gold standard ranking. The SimLex-999
authors have informed us that the average annotator
2. WordNet (Miller, 1995): a well known seman- agreement with the gold standard is 0.78.2 As shown
tic lexicon which contains vast amounts of high in Table 2, the reported performance of all the models
quality human-annotated synonym and antonym and word vectors falls well below this figure.
pairs. Any two words in our vocabulary which
Retrofitting pre-trained word vectors improves
had antonymous word senses were considered
GloVe vectors, but not the already semantically spe-
antonyms; WordNet synonyms were not used.
cialised Paragram-SL999 vectors. Counter-fitting
In total, the lexicons yielded 12,802 antonymy and substantially improves both sets of vectors, showing
31,828 synonymy pairs for our vocabulary, which that injecting antonymy relations goes a long way
consisted of 76,427 most frequent words in Open- 2
This figure is now reported as a potentially fairer ceiling
Subtitles, obtained from invokeit.wordpress. performance on the SimLex-999 website: http://www.cl.
com/frequency-word-lists/. cam.ac.uk/˜fh295/simlex.html.
False Synonyms Fixed False Antonyms Fixed Dataset Train Dev Test #Slots
sunset, sunrise X dumb, dense Restaurants 1612 506 1117 4
forget, ignore adult, guardian Tourist Information 1600 439 225 9
girl, maid polite, proper XX Table 5: Number of dialogues in the dataset splits used
happiness, luck XX strength, might for the Dialogue State Tracking experiments
south, north X water, ice
go, come X violent, angry XX Word Vector Space Restaurants Tourist Info
groom, bride cat, lion XX Baseline (no dictionary) 68.6 60.5
dinner, breakfast laden, heavy XX GloVe 72.5 60.9
- - engage, marry GloVe + Counter-fitting 73.4 62.8
Paragram-SL999 73.2 61.5
Table 4: Highest-error SimLex-999 word pairs using Para- Paragram-SL999 + Counter-fitting 73.5 61.9
gram vectors (before counter-fitting) Table 6: Performance of RNN belief trackers (ensembles
of four models) with different semantic dictionaries
towards improving word vectors for the purpose of
making semantic similarity judgements.
rephrasings of that value. The optimal value of t
Table 3 shows the effect of injecting different cate-
was determined using a grid search: we generated a
gories of linguistic constraints. GloVe vectors benefit
dictionary and trained a model for each potential t,
from all three sets of constraints, whereas the quality
then evaluated on the development set. Table 6 shows
of Paragram vectors, already exposed to PPDB, only
the performance of RNN models which used the con-
improves with the injection of WordNet antonyms.
structed dictionaries. The dictionaries induced from
Table 4 illustrates how incorrect similarity predic-
the pre-trained vectors substantially improved track-
tions based on the original (Paragram) vectors can
ing performance over the baselines (which used no
be fixed through counter-fitting. The table presents
semantic dictionaries). The dictionaries created us-
eight false synonyms and nine false antonyms: word
ing the counter-fitted vectors improved performance
pairs with predicted rank in the top (bottom) 200
even further. Contrary to the SimLex-999 experi-
word pairs and gold standard rank 500 or more posi-
ments, starting from the Paragram vectors did not
tions lower (higher). Eight of these errors are fixed
lead to superior performance, which shows that in-
by counter-fitting: the difference between predicted
jecting the application-specific ontology is at least as
and gold-standard ranks is now 100 or less. Interest-
important as the quality of the initial word vectors.
ingly, five of the eight corrected word pairs do not
appear in the sets of linguistic constraints; these are 5 Conclusion
indicated by double ticks in the table. This shows
that secondary (i.e. indirect) interactions through the We have presented a novel counter-fitting method
three terms of the cost function do contribute to the for injecting linguistic constraints into word vector
semantic content of the transformed vector space. space representations. The method efficiently post-
processes word vectors to improve their usefulness
4.3 Improving Dialogue State Tracking for tasks which involve making semantic similarity
Table 5 shows the dialogue state tracking datasets judgements. Its focus on separating vector represen-
used for evaluation. These datasets come from the tations of antonymous word pairs lead to substantial
Dialogue State Tracking Challenges 2 and 3 (Hender- improvements on genuine similarity estimation tasks.
son et al., 2014a; Henderson et al., 2014b). We have also shown that counter-fitting can tailor
We used four different sets of word vectors to con- word vectors for downstream tasks by using it to
struct semantic dictionaries: the original GloVe and inject domain ontologies into word vectors used to
Paragram-SL999 vectors, as well as versions counter- construct semantic dictionaries for dialogue systems.
fitted to each domain ontology. The constraints used
Acknowledgements
for counter-fitting were all those from the previous
section as well as antonymy constraints among the We would like to thank Felix Hill for help with the
set of values for each slot. We treated all vocabu- SimLex-999 evaluation. We also thank the anony-
lary words within some radius t of a slot value as mous reviewers for their helpful suggestions.
References Models with (Genuine) Similarity Estimation. Comput-
ing Research Repository.
[Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and
[Kiela et al.2015] Douwe Kiela, Felix Hill, and Stephen
Karen Livescu. 2014. Tailoring continuous word repre-
Clark. 2015. Specializing word embeddings for simi-
sentations for dependency parsing. In Proceedings of
larity or relatedness. In Proceedings of EMNLP.
ACL.
[Lin et al.2003] Dekang Lin, Shaojun Zhao, Lijuan Qin,
[Bian et al.2014] Jiang Bian, Bin Gao, and Tie-Yan Liu.
and Ming Zhou. 2003. Identifying synonyms among
2014. Knowledge-powered deep learning for word
distributionally similar words. In Proceedings of IJCAI.
embedding. In Machine Learning and Knowledge Dis-
[Liu et al.2015] Quan Liu, Hui Jiang, Si Wei, Zhen-Hua
covery in Databases.
Ling, and Yu Hu. 2015. Learning semantic word
[Curran2003] James Curran. 2003. From Distributional embeddings based on ordinal knowledge constraints.
to Semantic Similarity. Ph.D. thesis, School of Infor- In Proceedings of ACL.
matics, University of Edinburgh. [Marcu and Echihabi2002] Daniel Marcu and Abdsem-
[de Marneffe et al.2008] Marie-Catherine de Marneffe, mad Echihabi. 2002. An unsupervised approach to rec-
Anna N. Rafferty, and Christopher D. Manning. 2008. ognizing discourse relations. In Proceedings of ACL.
Finding contradictions in text. In Proceedings of ACL. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai
[Faruqui and Dyer2015] Manaal Faruqui and Chris Dyer. Chen, Greg S Corrado, and Jeff Dean. 2013. Dis-
2015. Non-distributional word vector representations. tributed Representations of Words and Phrases and their
In Proceedings of ACL. Compositionality. In Proceedings of NIPS.
[Faruqui et al.2015] Manaal Faruqui, Jesse Dodge, Su- [Miller1995] George A. Miller. 1995. WordNet: A Lexi-
jay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. cal Database for English. Communications of the ACM.
Smith. 2015. Retrofitting Word Vectors to Semantic [Mohammad et al.2008] Saif Mohammad, Bonnie Dorr,
Lexicons. In Proceedings of NAACL HLT. and Graeme Hirst. 2008. Computing word-pair
[Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Van antonymy. In Proceedings of EMNLP.
Durme, and Chris Callison-burch. 2013. PPDB: The [Mohammad et al.2013] Saif M. Mohammad, Bonnie J.
Paraphrase Database. In Proceedings of NAACL HLT. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Com-
[Hashimoto et al.2012] Chikara Hashimoto, Kentaro Tori- puting lexical contrast. Computational Linguistics,
sawa, Stijn De Saeger, Jong-Hoon Oh, and Junichi 39(3):555–590.
Kazama. 2012. Excitatory or inhibitory: A new seman- [Mrkšić et al.2015] Nikola Mrkšić, Diarmuid Ó Séaghdha,
tic orientation extracts contradiction and causality from Blaise Thomson, Milica Gašić, Pei-Hao Su, David
the Web. In Proceedings of EMNLP-CoNLL. Vandyke, Tsung-Hsien Wen, and Steve Young. 2015.
[Henderson et al.2014a] Matthew Henderson, Blaise Multi-domain Dialog State Tracking using Recurrent
Thomson, and Jason D. Wiliams. 2014a. The Second Neural Networks. In Proceedings of ACL.
Dialog State Tracking Challenge. In Proceedings of [Ono et al.2015] Masataka Ono, Makoto Miwa, and Yu-
SIGDIAL. taka Sasaki. 2015. Word Embedding-based Antonym
[Henderson et al.2014b] Matthew Henderson, Blaise Detection using Thesauri and Distributional Informa-
Thomson, and Jason D. Wiliams. 2014b. The Third tion. In Proceedings of NAACL HLT.
Dialog State Tracking Challenge. In Proceedings of [Ó Séaghdha and Korhonen2014] Diarmuid Ó Séaghdha
IEEE SLT. and Anna Korhonen. 2014. Probabilistic distributional
[Henderson et al.2014c] Matthew Henderson, Blaise semantics. Computational Linguistics, 40(3):587–631.
Thomson, and Steve Young. 2014c. Robust Dia- [Pavlick et al.2015] Ellie Pavlick, Pushpendre Rastogi,
log State Tracking using Delexicalised Recurrent Juri Ganitkevich, Benjamin Van Durme, and Chris
Neural Networks and Unsupervised Adaptation. In Callison-Burch. 2015. PPDB 2.0: Better paraphrase
Proceedings of IEEE SLT. ranking, fine-grained entailment relations, word em-
[Henderson et al.2014d] Matthew Henderson, Blaise beddings, and style classification. In Proceedings of
Thomson, and Steve Young. 2014d. Word-Based ACL.
Dialog State Tracking with Recurrent Neural Networks. [Pennington et al.2014] Jeffrey Pennington, Richard
In Proceedings of SIGDIAL. Socher, and Christopher Manning. 2014. Glove:
[Hill et al.2014a] Felix Hill, Kyunghyun Cho, Sbastien Global Vectors for Word Representation. In Proceed-
Jean, Coline Devin, and Yoshua Bengio. 2014a. Em- ings of EMNLP.
bedding word similarity with neural machine transla- [Schwartz et al.2015] Roy Schwartz, Roi Reichart, and Ari
tion. Computing Research Repository. Rappoport. 2015. Symmetric pattern based word em-
[Hill et al.2014b] Felix Hill, Roi Reichart, and Anna Ko- beddings for improved word similarity prediction. In
rhonen. 2014b. SimLex-999: Evaluating Semantic Proceedings of CoNLL.
[Turney2008] Peter D. Turney. 2008. A uniform approach
to analogies, synonyms, antonyms, and associations. In
Proceedings of COLING.
[Wieting et al.2015] John Wieting, Mohit Bansal, Kevin
Gimpel, and Karen Livescu. 2015. From paraphrase
database to compositional paraphrase model and back.
Transactions of the Association for Computational Lin-
guistics.
[Yih et al.2012] Wen-Tau Yih, Geoffrey Zweig, and
John C. Platt. 2012. Polarity inducing Latent Semantic
Analysis. In Proceedings of ACL.
[Yu and Dredze2014] Mo Yu and Mark Dredze. 2014. Im-
proving lexical embeddings with semantic knowledge.
In Proceedings of ACL.
[Zanzotto et al.2009] Fabio Massimo Zanzotto, Marco
Pennachiotti, and Alessandro Moschitti. 2009. A ma-
chine learning approach to textual entailment recog-
nition. Journal of Natural Language Engineering,
15(4):551–582.
[Zou et al.2013] Will Y. Zou, Richard Socher, Daniel M.
Cer, and Christopher D. Manning. 2013. Bilingual
word embeddings for phrase-based machine translation.
In Proceedings of EMNLP.

You might also like