You are on page 1of 5

IJIRST –International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 12 | May 2017

ISSN (online): 2349-6010

A Survey on Semantic Similarity Measures


Aditi Gupta Mr. Ajay Kumar
PG Student Assistant Professor
Department of Computer Science & Engineering Department of Computer Science & Engineering
JSS Academy of Technical Education, Noida JSS Academy of Technical Education, Noida

Dr. Jyoti Gautam


Head of Department & Associate Professor
Department of Computer Science & Engineering
JSS Academy of Technical Education, Noida

Abstract
Measuring the semantic similarity between words, sentences and concepts is an important task in information retrieval, document
clustering, web mining and word sense disambiguation. Semantic similarity is basically a measure used to compute the extent of
similarity between two concepts based on the likeliness of their meaning. This survey discusses the existing similarity measures
by partitioning them into two approaches: Corpus-based and Knowledge-based. The features, performance, advantages and
disadvantages of various semantic similarity measures are discussed. The aim of this paper is to provide an efficient evaluation
of all these measures and help the researchers to select the best measure according to their requirement.
Keywords: Semantic similarity, Corpus-based similarity, Knowledge-based similarity, Semantic relatedness
_______________________________________________________________________________________________________

I. INTRODUCTION

Semantic similarity plays an iconic role in the field of data processing, artificial intelligence and data mining. It is also useful in
information management, especially in the context of environment such as semantic web where data may originate from different
sources and has to be integrated in flexible way. Semantic similarity is basically a measure used to compute the extent of
similarity between two concepts based on the likeliness of their meaning. The concepts can be sentences, words or paragraphs. It
finds the distance between different concepts in semantic space in such a way that lesser the distance, greater the similarity. In
other words, semantic similarity identifies the common characteristics. Concepts can be similar in two ways that is either
lexically or semantically. If words are having a similar character sequence then they are lexically similar, while semantics is
concerned with the meaning of the words. Different words or concepts are said to be semantically similar if their meaning is
same even if their lexical structure is different. The techniques used to find the semantic similarity between different words can
be extended to find similarity between phrases, sentences or paragraphs. Lexical similarity is presented using different string
based algorithm and semantic similarity is presented using Corpus based and Knowledge based algorithms.
Semantic similarity measures are being intensively used in various applications of knowledge based and semantic information
retrieval systems for identifying an optimal match between user query terms and documents. It is also used in word sense
disambiguation for identifying the correct sense of the term in the given context. Semantic similarity and semantic relatedness
are two different concepts but related to each other. For example, “mother” and “child” are related terms but are not similar since
they have different meaning.
The survey paper is structured as follows: Section two highlights the related work in the field of semantic similarity measures.
Section three describes the Corpus based similarity measures and Section four describes the knowledge based similarity measure.
Section five highlights various semantic similarity measures in a tabular form. Section six presents the summary of the survey in
the form of conclusion.

II. RELATED WORK

This section presents survey on various semantic similarity measures already been used. Table 1 describes the related work in the
field of semantic similarity measures.
Table – 1
Related work in semantic similarity measures
S.No TITLE PUBLISHER YEAR OBJECTIVE
Assessment of Semantic Similarity A method is proposed to determine similarity between
Parisa D and Marek
1. of concepts defined in ontology 2013 concepts defined in ontology. It focuses on relation between
Reformat
[24] concepts and their semantic relation.
An ontology based semantic Thusitha Mabotuwana,
A notion of semantic similarity is used to overcome the
2 similarity measure for biomedical Michael C. Lee and 2013
limitation of direct concept matching.
data-application to radiology Eric V.

All rights reserved by www.ijirst.org 243


A Survey on Semantic Similarity Measures
(IJIRST/ Volume 3 / Issue 12/ 038)

reports [18]
A hybrid knowledge based and
A comprehensive method is proposed which computes a
data driven approach to Rimma Pivovarov and
3. similarity score for a concept pair by combining data driven
identifying semantically similar Noimei Elhadad
2012 and ontology driven knowledge.
concepts [19]
David Sanchez, Ontology based approaches such as edge counting, feature
Ontology based semantic
Montserrat Battet, based approach and measures based on information content
4. similarity: A new feature based
David Isern and Aida 2012 are classified and a new ontology based measure relying on
approach [25]
Valls the exploitation of taxonomical features is proposed.
Unsupervised semantic similarity The proposed method requires only context based metric for
5. computation between terms using web documents search and does not require other metrics
Elias Iosif 2012
web documents [26] such as page counting, ontology, snippets.
Semantic similarity estimation in
In this along with the information content based semantic
the biomedical domain: An David Sanchez and
6. measure, new ontology based edge counting measure in
ontology based information Montserrat Batet 2011
terms of information content are redefined.
theoretic perspective [20]
The proposed algorithm is based on three features: cross
Measuring semantic similarity
Hisham Al-Mubaid and modified path length between different concepts, new feature
7. between biomedical concepts
Hoa A. Nguyen 2009 for common specifity of concepts in the ontology and local
within multiple Ontologies [23]
granularity of ontology clusters.
An approach for measuring
semantic similarity measure Y. Li, Z.A. Bandar and In this a new measure is proposed to measure semantic
8.
between words using multiple D. Mclean 2003 similarity which combines information non linearly.
information sources [21]

III. CORPUS BASED SIMILARITY

CORPUS refers to a large collection of written material and speeches that are used to study and describe a language. A huge
collection of documents and information is available online which can be used in determining the semantic relatedness among
different words or concepts. Figure 1 describes various CORPUS based similarity measures.

Fig. 1: CORPUS based similarity measures

Latent Semantic Analysis (LSA):


It is one of the earliest semantic similarity method and is proposed by Landauer [27]. LSA is based on the assumption that words
will occur in similar text if they are having similar meaning. LSA model analyses the text, calculates the term frequencies and
creates a weighted matrix. In the matrix, rows represent the unique terms in the text, column represent the text sample. Cells of
the matrix represent the frequency of each term in the given text sample. A mathematical technique called single valued
decomposition (SVD) is used to reduce the dimensionality in such a way that it reduces the number of columns while preserving
the number of rows. Terms are then compared by evaluating the cosine of the angle between the vectors formed by any two
rows. The LSA model surpasses the limitation of traditional vector space model such as high dimensionality and sparseness.
Generalized Latent Semantic Analysis (GLSA):
It is the extension of LSA approach in such a way that it focuses on the term vector instead of document-term representation. In
GLSA[3], weighted matrix is constructed in the last step to provide weights to the term vectors unlike LSA technique. GLSA
measures the semantic relatedness between terms followed by dimensionality reduction technique. Since LSA uses only cosine
function to measure the similarity between terms vectors, GLSA uses any measure of the semantic association between terms
and a method of dimensionality reduction. GLSA is a better approach compared to LSA.

All rights reserved by www.ijirst.org 244


A Survey on Semantic Similarity Measures
(IJIRST/ Volume 3 / Issue 12/ 038)

Hyper-space Analogue to Language (HAL):


It uses the word co-occurrences to form a semantic space. In this[1,2] a matrix is created in which each row and column both
represents the word. Each matrix element represents the strength of association of corresponding words of row and column. As
the text is analyzed, focused word is selected and is compared with the neighboring words which are termed as co-occurring. The
co-occurrence is inversely proportional to the distance from the focus word and the values are in the matrix. Dimensional
reduction can be opted by dropping any column having low entropy. Moreover, HAL focuses on the pattern of word occurrences
in such a way that it treats co-occurrences differently by checking if they appeared before or after the focus word.
Semantic Similarity using Web Search Engines:
This approach [16] uses web search results to find semantic relatedness between two words. For the semantic relatedness, it uses
page count and snippets provided by the web search. Suppose, similarity between two words A and B need to be found out. The
method searches A and B separately and finds the page count. In the similar way, the combined query A & B is searched and
page count is determined. The snippets are used to find the term frequencies of A & B. The observed patterns are ranked
according to their ability to semantically co-relate different words. Similarity scores of page counts and snippets are integrated
together using support vector machine (SVM) to evaluate semantic similarity between words.
Similarity based on Corpus Statistics and Lexical Taxonomy:
This is a proposed novel approach [9] in which lexical taxonomy is integrated with the statistical information derived from the
corpus. The proposed method enhances the edge-based approach of similarity calculation by adding the information based of
node based methods. In this method both edge based and node based calculation is done for computing the similarity and hence
this algorithm claims to outperform other computational methods.
Explicit Semantic Analysis:
In this method[4], semantic relatedness between any two arbitrary texts is computed. Meaning of the word is represented in a
weighted vector form using Wikipedia. Words are mapped as high dimensional vectors and each vector represents the TF-IDF
weight between the Wikipedia article and word. The semantic similarity among the generated vectors is computed using cosine
similarity measure.
The Cross Language Explicit Semantic Analysis:
This is a multilingual generalization of ESA. In this[5] semantic similarity of two documents in different languages is computed
using cosine similarity function.

IV. KNOWLEDGE BASED SIMILARITY

Ontology, taxonomies and semantic net are the knowledge representation forms which are used in Information Retrieval and
these knowledge representations are used by various methods to find semantic similarity between different terms or concepts.
Knowledge based similarity is one of the semantic similarity measure which uses information derived from semantic networks
for identifying the degree of similarity between words. Wordnet is the widely used semantic network. It is a large lexical
database of English in which verbs, nouns, adverbs and adjectives are grouped into similar meaning sets known as synsets. And
these synsets are linked with each other by means of conceptual semantic and lexical relations.
Knowledge based similarity measures can be classified as measures of semantic similarity and measures of semantic
relatedness. Semantically similar approaches use information content for the evaluation of similarity. These are also termed as
node based approaches. Semantically related approaches deals with the general notion of relatedness and not with the concept.
These are also termed as edge based approaches and they use conceptual distance for the evaluation of similarity. Semantic
similarity is also a kind of relatedness between words.
There are various measures of semantic similarity. Few of them are based on information content and remaining on path
length. These measures are described below in Figure 2.

All rights reserved by www.ijirst.org 245


A Survey on Semantic Similarity Measures
(IJIRST/ Volume 3 / Issue 12/ 038)

Fig. 2: Knowledge Based Semantic Measures

Resnik(res)[8] proposed quantifying the information content shared by different concepts in order to measure the similarity
between them. More the information is shared, more the concepts are similar to each other. Hence, it measures the relatedness
value which is equal to the information content of the Least Common Subsume that is the most informative subsumer and its
value is always positive. The Lin(lin)[6] and Jiang and Conrath(jcn)[9] measures expand the information content of the Least
Common Subsumer by adding the information content of the concepts themselves. The lin measure augments the information
content of Least Common Subsumer by this sum whereas jcn calculates the difference between the sum and the information
content of Least Common Subsumer.
Leacock and Chodorow(l&c)[10] measures the similarity using path length. It calculates the shortest path that connects the
senses of the words and returns the score denoting the similarity of two word senses. Wu and Palmer(wup)[11] measures the
similarity by returning a score based on the depth of the senses in the taxonomy and their Least Common Subsumer. Path
Length(path) measures the similarity by returning a score denoting the shortest path between the sense of the word in is-a
taxonomy.
St.Onge (hso)[12] measures the semantic relatedness by finding the lexical chains connecting the word senses. Lesk(lesk)[13]
measures the semantic relatedness by finding the overlap in the glossary of the synsets of the terms. The sum of the square of the
relatedness length is the semantic relatedness score. Vector pairs(vector)[14] measures the semantic relatedness by creating a co-
occurrence matrix for each word and then represent each word as a vector.

V. SEMANTIC SIMILARITY MEASURES

Table 2 gives a review of various semantic similarity measures.


Table – 2
Semantic Similarity Measures
Method Principle Measure Feature Advantage Disadvantage
Different pairs of equal length
Shortest Number of edges between
Simple measure and shortest path will have same
Path the concepts
similarity
Length of the path Different pairs having lowest
linking different Wu & Path length augmented by common subsume and equal
Path Based Simple Measure
word senses Palmer subsumer path to root length of path will have same
similarity
Different pairs of equal length
Number of edges between
L&C Simple Measure and shortest path will have same
the concepts
similarity
Different pairs having lowest
Information content of the
Simple Measure common subsume will have same
The concepts Resnik lowest common subsumer
similarity
Information sharing common
Considers the
Content Based information are Information content of the Different pair having the same
information content
similar lowest common subsumer summation of information content
Lin of compared
and compared concepts will have same similarity
concepts
The concepts having Considers features
Compares features of the
Feature Based common features are while computing Computationally complex
Tversky concepts
similar similarity

All rights reserved by www.ijirst.org 246


A Survey on Semantic Similarity Measures
(IJIRST/ Volume 3 / Issue 12/ 038)

VI. CONCLUSION

Semantic similarity plays a vital role in the field of Information Retrieval, Natural Language Processing and Artificial
Intelligence. Based in the thorough literature study, this paper gives a brief review of various semantic similarity measures. Two
types of semantic similarity measure are discussed: CORPUS based and knowledge based. CORPUS based similarity is a
semantic similarity measure that determines the similarity between the concepts through the information obtained from large
corpora. Algorithms such as LSA, GLSA, HAL, ESA, CL-ESA, Web service engine and statistical and lexical taxonomy are
discussed. Knowledge based semantic measures are the measures that uses information gained from semantic networks to
compute similarity among the concepts. Semantic similarity algorithms such as res, lin, jcn, lch, wup and Semantic relatedness
algorithms such as hso, lesk and vector are discussed. This paper provides an efficient evaluation of all these measures and helps
the researchers to select the best measure according to their requirement.

VII. REFERENCES
[1] Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic and associative priming in a high-dimensional semantic space. Cognitive Science Proceedings
(LEA), 660-665.
[2] Lund, K. & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments &
Computers, 28(2),203-208.
[3] Matveeva, I., Levow, G., Farahat, A. & Royer, C. (2005). Generalized latent semantic analysis for term representation. In Proc. of RANLP.
[4] Gabrilovich E. & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of the 20th
International Joint Conference on Artificial Intelligence, pages 6–12.
[5] Martin, P., Benno, S. & Maik, A.(2008). A Wikipedia-based multilingual retrieval model. Proceedings of the 30th European Conference on IR Research
(ECIR), pp. 522-530.
[6] Lin, D. (1998b). Extracting Collocations from Text Corpora. In Workshop on Computational Terminology , Montreal, Kanada, 57–63.
[7] Mihalcea, R., Corley, C. & Strapparava, C. (2006). Corpus based and knowledge-based measures of text semantic similarity. In Proceedings of the
American Association for Artificial Intelligence.(Boston, MA).
[8] Resnik, R. (1995). Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Montreal, Canada.
[9] Jiang, J. & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on
Research in Computational Linguistics, Taiwan.
[10] Leacock, C. & Chodorow, M. (1998). Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic
Lexical Database. The MIT Press.
[11] Wu, Z.& Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational
Linguistics, Las Cruces, New Mexico.
[12] Hirst, G. & St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. In C. Fellbaum, editor,
WordNet: An electronic lexical database , pp 305–332. MIT Press.
[13] Banerjee ,S. & Pedersen, T.(2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International
Conference on Intelligent Text Processing and Computational Linguistics, , Mexico City, pp 136–145.
[14] Patwardhan, V.( 2003). Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University
of Minnesota, Duluth.
[15] Tversky, A. 1977. Features of Similarity. Psycological Review, 84(4):327-352.
[16] Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, “Measuring semantic similarity between words using web search engines”, WWW, vol 7, pp.
757—766, 2007.
[17] Thabet Silmani. Description and evaluation of semantic similarity measures approaches. International Journal of Computer Applications(0975-8887).
Volume 80- No.10, October 2013.
[18] Thusitha Mabotuwana et al. An ontology based similarity measure for biomedical data- Application to radiology reports. Journal of Biomedical
Informatics; 2013. http://dx.doi.org/10.1016 /j.jbi.2013.06.013
[19] Pivovarov R and Elhadad N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. Journal of Biomedical
Informatics. 2012; 45(3):471–81.
[20] David Sanchez and Montserrat Batet. Semantic similarity estimation in the biomedical domain: An ontology-based information theoretic perspective.
Journal of Biomedical Informatics 44 (2011) 749–759. doi:10.1016/j.jbi.2011.03.013
[21] Yuhua Li, et al. An approach for measuring semantic similarity measure between words using multiple information sources. IEEE transactions on
knowledge and data engineering, vol.15, no.4, july/august 2003.
[22] Hisham Al-Mubaid and Hoa A.Nguyen. Measuring semantic similarity between biomedical concepts within multiple ontologies. IEEE transactions on
systems, man and cybernetics-part c: applications and reviews, vol.39, no.4, july 2009.
[23] Parisa D, et al. Assessment of semantic similarity of concepts defined in ontology. Journal of Information Sciences (2013). Doi: 10.1016/j.ins.2013.06.056.
[24] David Sanchez, et al. Ontology based semantic similarity: A new feature-based approach. Journal of expert systems with applications 39(2012) 7718-7728.
[25] Elias Losif and Alexandros Potamianos. Unsupervised semantic similarity computation between terms using web documents. IEEE transactions on
knowledge engineering, vol.22, no.11, November 2012.
[26] Landauer, T.K. & Dumais, S.T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of
knowledge", Psychological Review, 104.

All rights reserved by www.ijirst.org 247

You might also like