1 views

Uploaded by AngelRibeiro10

Properties of Levenshtein, N-Gram, Cosine and Jaccard Distance Coefficients - In Sentence Matching - Cross Validated

- PalisadeRiskASQVT1
- FFFF Report
- Brants 1996 TSP
- VIEW POINT OF MANAGERS IN YASUJ UNIVERSITY OF MEDICAL SCIENCES, ABOUT INTELLECTUAL CAPITAL AND DETERMINATION IT’S EFFECTIVE FACTORS
- Sarcasm Detection
- Journal INT 1
- Crash Course in ASR
- 9781947441033
- Konstantinos Ouzounidis CV
- Lazy Prices SSRN-id1658471
- Output
- Sampling Plan for Product Inspection
- sta2023syl-f14
- research quantitative
- Free Stats
- Statistics Work
- shumway1987
- Statistic
- Data Science to Do from quora posts
- 1-s2.0-S0925231215020184-main

You are on page 1of 1

_

Cross Validated is a question and Here's how it works:

answer site for people interested in

statistics, machine learning, data

analysis, data mining, and data

visualization. Join them; it only takes a

minute:

Anybody can ask Anybody can The best answers are voted

a question answer up and rise to the top

Join

Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching

string B: 'I heard the cafeteria is serving roast-beef sandwiches today'.

Formulas:

Levenshtein distance: Minimum number of insertions, deletions or substitutions necessary to convert string a into string b

N-gram distance: sum of absolute differences of occurrences of n-gram vectors between two strings. As an example, the first 3 elements of

the bi-gram vectors for strings A and B would be (1, 1, 1) and (0, 0, 0), respectively.

Cosine similarity:

a b

a a

b b

length()

Jaccard similarity:

a b

length()

a b

word)

Bigram distance = 14

Cosine similarity = 0.33

Jaccard similarity = 0.2

I would like to understand the pros and cons of using each of the these (dis)similarity measures. If possible, it would be nice to understand

these pros/cons in the example sentence, but if you have an example that better illustrates the differences, please let me know. Also, I realize

that I can scale Levenshtein distance by the number of words in the text, but that wouldn't work for the bigram distance, since it would be

greater than 1.

To start, it seems that cosine and Jaccard provide similar results. Jaccard is actually much less computationally intensive and is also (a little

bit) easier to explain to a layman.

matsuo_basho

109 8

A good way to start to understand the differences is to dig up for their formulas, all expressed in a single a-b-c-d

"binary data form" , such as used in this answer, for example. ttnphns Jun 20 '16 at 18:10

1 @ttnphns, I understand the differences between the algorithms, just not clear on a situation where for example

cosine similarity would be superior to Jaccard similarity. matsuo_basho Jun 20 '16 at 18:14

They are not "algorithms". They are alternative proximity measures. ttnphns Jun 20 '16 at 18:15

2 If you now the formulas of them all, why not show them in your question; and then ask "how are their properties

different given these formulas? I expect this (something) but I don't understand that (something)". That would make

your question specific and showing your efforts. So far, the question is too broad. ttnphns Jun 20 '16 at 18:19

1 @ttnphns, you're right, added the formulas now. matsuo_basho Jun 20 '16 at 20:53

1 Levenshtein is a specific form of "alignment" distance, and it compares sequences of elements, i.e. both content of

elements and their order. Cosine and Jaccard compare only content (element is, say, a letter). Bi-gram distance

compares content of elements, but an element is defined specifically as 2-letter chunk. ttnphns Jun 21 '16 at 9:52

https://stats.stackexchange.com/questions/219243/properties-of-levenshtein-n-gram-cosine-and-jaccard-distance-coecients-in 1/1

- PalisadeRiskASQVT1Uploaded byALEXANDER
- FFFF ReportUploaded byAnonymous A4NHI5
- Brants 1996 TSPUploaded bytejshahi1
- VIEW POINT OF MANAGERS IN YASUJ UNIVERSITY OF MEDICAL SCIENCES, ABOUT INTELLECTUAL CAPITAL AND DETERMINATION IT’S EFFECTIVE FACTORSUploaded byInternational Journal of Clinical and Biomedical Research (IJCBR)
- Sarcasm DetectionUploaded byRoberto Rod
- Journal INT 1Uploaded byAnjar Ahmad Rizzal Fawzi
- Crash Course in ASRUploaded byNajla
- 9781947441033Uploaded byBusiness Expert Press
- Konstantinos Ouzounidis CVUploaded bytsitsos
- Lazy Prices SSRN-id1658471Uploaded byElvis
- OutputUploaded byMohRozani
- Sampling Plan for Product InspectionUploaded byZaman Parvez
- sta2023syl-f14Uploaded byapi-257186847
- research quantitativeUploaded bywidieandriyani
- Free StatsUploaded bystefold
- Statistics WorkUploaded byBrittanyWilliams
- shumway1987Uploaded byAlbert Rahardjo
- StatisticUploaded byjova babelonia
- Data Science to Do from quora postsUploaded byNilpa Jha
- 1-s2.0-S0925231215020184-mainUploaded byDiogo Leonardo
- Salas Orlando 2001Uploaded bybbbero
- The Demand of Democratic Leader in Southeast Asia- A Case of Thai SocietyUploaded byGlobal Research and Development Services
- 4 a Winner Never QuitsUploaded bybawaneatul
- Air Infiltration in Chilean Housing a Baseline DeterminationUploaded bymiau
- Ch02tolUploaded byVadi Velan
- SubmittedSyllabus_Statistics and Algorithms for Computational BiologyUploaded bymarginalme
- Case Study - Classification of Patients with Abnormal Blood Pressure (N=2000)Uploaded byDeepak Gupta
- bba-202Uploaded bygoyalpramod
- Contents ProjectUploaded bychethan halandur
- Article How to AvoidUploaded byGall Anonim

- egc2013_tutoriel_MissaouiUploaded byAngelRibeiro10
- ontolog-social-web-keynote.pdfUploaded byAngelRibeiro10
- Fundamentals of Algorithmics Brassard InglesUploaded byTusharVatsa
- Guide to Unconventional Computing for MusicUploaded bySonnenschein
- inplementar.pdfUploaded byAngelRibeiro10
- inplementar.pdfUploaded byAngelRibeiro10
- Quando eu era um filhoteUploaded byAngelRibeiro10
- Ontolog Social Web KeynoteUploaded byAngelRibeiro10
- Biopython_Tutorial.pdfUploaded byAngelRibeiro10
- curso_grafos_handout201009Uploaded byAngelRibeiro10
- edital_poscomp 2018Uploaded byAngelRibeiro10
- Language, Music and Computing - Mitrenina, Eds - 2019.pdfUploaded byAngelRibeiro10
- Overlap Coefficient - WikipediaUploaded byAngelRibeiro10
- A Survey of Heterogeneous Information Network AnalysisUploaded byAngelRibeiro10
- Review Text BasedUploaded byAngelRibeiro10
- Jacard vs PMIUploaded byAngelRibeiro10
- Introduction to Computer Programming With MATLABUploaded byAngelRibeiro10
- Redes Complexas 2Uploaded byAngelRibeiro10
- book_270.pdfUploaded bygerman2210
- Programa Escola RCUploaded byAngelRibeiro10
- Redes ComplexasUploaded byAngelRibeiro10
- jumping-nlp-curves.pdfUploaded byAngelRibeiro10
- Guia.politicamente.incorreto.da.Historia.do.BrasilUploaded byCleber Daniel Paiva
- acustica.txtUploaded byAngelRibeiro10
- natural language processingUploaded byAngelRibeiro10
- How to Use the Hungarian Algorithm_ 10 Steps (With Pictures)Uploaded byAngelRibeiro10
- Aristóteles - Arte PoéticaUploaded byFellipe Ferini dos Santos
- Beethoven's Letters. (1790--1826.) Vol. iUploaded byAngelRibeiro10
- Sound LabUploaded byAngelRibeiro10

- federalism lesson planUploaded byapi-247347559
- WF API Reference GuideUploaded byNitin Nagrale
- ubc_2002-731618Uploaded bypiMLeon
- What to Expect From Behavioral InterviewsUploaded byJuan Carlos Álvarez
- EARTH BUILDING PERFORMANCE OPTIMIZATION WITH BIOPOLYMERSUploaded bySemana de Engenharia '2010
- Remote SensingUploaded byAndrea Deleon
- Firebird Utils WIPUploaded byChris Nash
- G6_Report_MPP_ver20.08.docxUploaded byThuyDuong
- Control StrategyUploaded byPortia Shilenge
- Design of palm kernel.pdfUploaded bySandi Aslan
- As 2134.3-1994 Recommended Practice for Chemical Analysis by Atomic Absorption Spectrometry Vapour GenerationUploaded bySAI Global - APAC
- February 21, 2015Uploaded byThe Delphos Herald
- Master of Seals.pdfUploaded byDatGuy
- The National Numeracy Strategy DFEEUploaded byammnavaz
- Delivering the Goods: Public Works Technologies, Management, and FinancingUploaded byChris Nash
- NavigationUploaded byHuong Nguyen
- HaverUploaded byAlisha Chopra
- Sandvik Jaw Crushers_ENUploaded byAlexis Iván Pérez Guerrero
- CV Layout TemplateUploaded byEdwin Kcw
- Seal FluentUploaded byrhbazzi
- Know about Unique Device Identifier (UDI) | DDiUploaded byDDi
- 8180829 rUploaded bykoclito
- Vectorial Capacity and Vector ControlUploaded byandakolawi
- ARMCortexA-9ProcessorsUploaded byRob Crowther
- HTML+CSSUploaded byFazal Amin
- Psych118_journal critique.docxUploaded byM Lapuz
- 2 Curriculum-Framing Questions-Part 2Uploaded byHeidi Jorx
- TOPIC 1Uploaded byNur Hanani
- Fresh Foods Ordering ProcessUploaded bySagarPatel
- Math173 Sample Final Exam SolsUploaded bytivnoc1