Ispm 3

FindingSimilarItems
ACommonMetaphor
Manyproblemscanbeexpressedas
findingsimilarsets:
Findnearneighborsinhighdimensional space
Examples:
Pageswithsimilarwords
Forduplicatedetection,classificationbytopic
Customerswhopurchasedsimilarproducts
Productswithsimilarcustomersets
Imageswithsimilarfeatures
Userswhovisitedthesimilarwebsites
SlidesbyJureLeskovec:MiningMassiveDatasets
DistanceMeasures
Weformallydefinenearneighborsas
pointsthatareasmalldistanceapart
Foreachusecase,weneedtodefinewhat
distancemeans
Today:Jaccard similarity/distance
TheJaccard Similarity/Distance oftwosets isthe
sizeoftheirintersection/thesizeoftheirunion:
sim(C1,C2)=|C1C2|/|C1C2|
d(C1,C2)=1 |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
DistanceMeasures
Weformallydefinenearneighborsas
pointsthatareasmalldistanceapart
Foreachusecase,weneedtodefinewhat
distancemeans
Twomajorclassesofdistancemeasures:
AEuclideandistance isbasedonthelocationsof
pointsinsuchaspace
ANonEuclideandistance isbasedonproperties
ofpoints,butnottheirlocationinaspace
SomeEuclideanDistances
L2 norm: d(p,q) =squarerootofthesumof
thesquaresofthedifferencesbetweenp and
q ineachdimension:
Themostcommonnotionofdistance
L1 norm: sumoftheabsolutedifferencesin
eachdimension
Manhattandistance =distanceifyou
hadtotravelalongcoordinatesonly
NonEuclideanDistances:Cosine
Thinkofapointasavectorfrom
theorigin(0,0,,0)toitslocation
Twovectorsmakeanangle,whose
cosineisnormalizeddotproduct
ofthevectors:
AB
A
Example: A=00111;B=10011
AB=2;A =B =3
cos()=2/3; isabout48degrees
NonEuclideanDistances:Jaccard
TheJaccard Similarity oftwosets isthesizeof
theirintersection/thesizeoftheirunion:
Sim(C1,C2)=|C1C2|/|C1C2|
TheJaccard Distance betweensetsis1minus
theirJaccard similarity:
d(C1,C2)=1 |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
FindingSimilarItems
FindingSimilarDocuments
Goal: Givenalargenumber(Ninthemillionsor
billions)oftextdocuments,findpairsthatare
nearduplicates
Applications:
Mirrorwebsites,orapproximatemirrors
Dontwanttoshowbothinasearch
Similarnewsarticlesatmanynewssites
Clusterarticlesbysamestory
Problems:
Manysmallpiecesofonedoccanappear
outoforderinanother
Toomanydocstocompareallpairs
Docsaresolargeorsomanythattheycannot
fitinmainmemory
3EssentialStepsforSimilarDocs
1. Shingling: Convertdocuments,emails,
etc.,tosets
2. Minhashing: Convertlargesetstoshort
signatures,whilepreservingsimilarity
3. Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefrom
similardocuments
10
TheBigPicture
LocalitySensitive
Hashing
Document
The set
of strings
of length k
that appear
in the document
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity.
11
DocumentsasHighDimData
Step1: Shingling: Convertdocuments,
emails,etc.,tosets
Simpleapproaches:
Document=setofwordsappearingindoc
Document=setofimportantwords
Dontworkwellforthisapplication.Why?
Needtoaccountfororderingofwords
Adifferentway:Shingles
12
Define:Shingles
Akshingle (orkgram)foradocumentisa
sequenceofktokensthatappearsinthedoc
Tokenscanbecharacters,words orsomething
else,dependingonapplication
Assumetokens=charactersforexamples
Example:k=2;D1=abcab
Setof2shingles:S(D1)={ab,bc,ca}
Option: Shinglesasabag,countab twice
13
CompressingShingles
Tocompresslongshingles,
wecanhash themto(say)4bytes
Representadocbythesetofhashvalues
ofitskshingles
Idea: Twodocumentscould(rarely)appearto
haveshinglesincommon,wheninfactonlythe
hashvalueswereshared
Example: k=2;D1=abcab
Setof2shingles:S(D1)={ab,bc,ca}
Hashthesingles:h(D1)={1,5,7}
14
WorkingAssumption
Documentsthathavelotsofshinglesin
commonhavesimilartext,evenifthetext
appearsindifferentorder
Careful: Youmustpickk largeenough,ormost
documentswillhavemostshingles
k=5isOKforshortdocuments
k =10isbetterforlongdocuments
15
MotivationforMinhash/LSH
Supposeweneedtofindnearduplicate
documentsamongN=1milliondocuments
Navely,wedhavetocomputepairwise
Jaccard similaritiesforeverypairofdocs
i.e,N(N1)/25*1011 comparisons
At105 secs/dayand106 comparisons/sec,
itwouldtake5days
ForN=10million,ittakesmorethanayear
16
Docu
ment
Theset
ofstrings
oflengthk
thatappear
inthedoc
ument
Signatures:
shortinteger
vectorsthat
representthe
sets,and
reflecttheir
similarity
MinHashing
Step2: Minhashing: Convertlargesets to
shortsignatures,whilepreservingsimilarity
EncodingSetsasBitVectors
Manysimilarityproblemscanbe
formalizedasfindingsubsetsthat
havesignificantintersection
Encodesetsusing0/1(bit,boolean)vectors
Onedimensionperelementintheuniversalset
InterpretsetintersectionasbitwiseAND,and
setunionasbitwiseOR
Example: C1 =10111;C2 =10011
Sizeofintersection=3;sizeofunion=4,
Jaccard similarity(notdistance)=3/4
d(C1,C2)=1 (Jaccard similarity)=1/4
18
FromSetstoBooleanMatrices
Rows =elementsofthe
universalset
Columns =sets
1inrowe andcolumns ifand
onlyife isamemberofs
ColumnsimilarityistheJaccard
similarityofthesetsoftheir
rowswith1
Typicalmatrixissparse
1 1
1
0
1
1
0
0
1
1
0
19
Example:Jaccard ofColumns
Eachdocumentisacolumn:
Sizeofintersection=2;sizeofunion=5,
Jaccard similarity(notdistance)=2/5
d(C1,C2)=1 (Jaccard similarity)=3/5
shingles
Example: C1 =1100011;C2 =0110010
1 0
1
0
1
1
0
0
1
1
Note:
0 0 0 1
Wemightnotreallyrepresent
1 1 1 0
thedatabyaboolean matrix
1 0 1 0
Sparsematricesareusually
documents
betterrepresentedbythelist
ofplaceswherethereisanonzerovalue
20
Outline:FindingSimilarColumns
Sofar:
Documents Setsofshingles
Representsetsasboolean vectorsinamatrix
NextGoal:Findsimilarcolumns,Smallsignatures
Approach:
1)Signaturesofcolumns: smallsummariesofcolumns
2)Examinepairsofsignatures tofindsimilarcolumns
Essential: Similaritiesofsignatures&columnsarerelated
3)Optional: checkthatcolumnswithsimilarsigs.arereally
similar
Warnings:
Comparingallpairsmaytaketoomuchtime:jobforLSH
Thesemethodscanproducefalsenegatives,andevenfalsepositives
(iftheoptionalcheckisnotmade)
21
HashingColumns(Singatures)
Keyidea: hasheachcolumnC toasmall
signature h(C),suchthat:
(1) h(C) issmallenoughthatthesignaturefitsinRAM
(2) sim(C1,C2) isthesameasthesimilarityof
signaturesh(C1) andh(C2)
Goal: Findahashfunctionh() suchthat:

ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)
ifsim(C1,C2) islow,thenwithhighprob.h(C1)h(C2)
Hashdocsintobuckets,andexpectthatmost
pairsofnearduplicatedocshashintothesame
bucket
22
MinHashing
Goal: Findahashfunctionh() suchthat:
ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)
ifsim(C1,C2) islow,thenwithhighprob.h(C1)h(C2)
Clearly,thehashfunctiondependson
thesimilaritymetric:
Notallsimilaritymetricshaveasuitable
hashfunction
Thereisasuitablehashfunctionfor
Jaccard similarity: Minhashing
23
MinHashing
Imaginetherowsoftheboolean matrix
permutedunderrandompermutation
Defineahashfunctionh(C) =thenumberof
thefirst(inthepermutedorder)rowinwhich
columnC hasvalue1:
h (C) = min (C)
Useseveral(e.g.,100)independenthash
functionstocreateasignatureofacolumn
24
MinHashingExample
Permutation
Inputmatrix(ShinglesxDocuments)
SignaturematrixM
1 4 3
3 2 4
7 1 7
6 3 6
2 6 1
5 7 2
4 5 5
SlidesbyJureLeskovec:MiningMassive
Datasets
25
Choosearandompermutation
thenPr[h(C1)=h(C2)]=sim(C1,C2)
Why?
LetXbeasetofshingles,X [264],xX
Then: Pr[(y)=min((X))]=1/|X|
SurprisingProperty
ItisequallylikelythatanyyX ismappedtotheminelement
Letxbes.t.(x)=min((C1C2))
Theneither: (x)=min((C1))ifx C1,or
(x)=min((C2))ifx C2
Sotheprob.thatbotharetrueistheprob.x C1 C2
Pr[min((C1))=min((C
26 2)
2))]=|C1C2|/|C1C2|=sim(C1,C
SimilarityforSignatures
Weknow:Pr[h(C1)=h(C2)]=sim(C1,C2)
Nowgeneralizetomultiplehashfunctions
Thesimilarityoftwosignaturesisthefraction
ofthehashfunctionsinwhichtheyagree
Note:Becauseoftheminhash property,the
similarityofcolumnsisthesameasthe
expectedsimilarityoftheirsignatures
27
MinHashing Example
Inputmatrix
SignaturematrixM
1 4 3
3 2 4
7 1 7
6 3 6
2 6 1
5 7 2
4 5 5
Similarities:
13241234
Col/Col 0.750.7500
Sig/Sig 0.671.0000
Datasets
28
MinHash Signatures
Pick100randompermutationsoftherows
Thinkofsig(C)asacolumnvector
Letsig(C)[i]=accordingtotheith
permutation,theindexofthefirstrow
thathasa1incolumnC
sig(C)[i] = min (i(C))
Note: Thesketch(signature)of
documentCissmall ~100bytes!
Weachievedourgoal! Wecompressed
longbitvectorsintoshortsignatures
29
Locality
sensitive
Hashing
Docu
ment
Theset
ofstrings
oflengthk
thatappear
inthedoc
ument
Signatures:
shortinteger
vectorsthat
representthe
sets,and
reflecttheir
similarity
Candidate
pairs:
thosepairs
ofsignatures
thatweneed
totestfor
similarity.
LocalitySensitiveHashing
Step3:Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefromsimilar
documents
LSH:FirstCut
Goal: FinddocumentswithJaccard similarityat

leasts(forsomesimilaritythreshold,e.g., s=0.8)
LSH Generalidea: Useafunctionf(x,y)thattells
whetherx andy isacandidatepair:
apairofelementswhosesimilaritymustbe
evaluated
Forminhash matrices:
HashcolumnsofsignaturematrixM tomanybuckets
Eachpairofdocumentsthathashesintothe
samebucketisacandidatepair
31
CandidatesfromMinhash
1
2
Pickasimilaritythresholds,afraction<1
Columnsxandy ofMareacandidatepair if
theirsignaturesagreeonatleastfractions of
theirrows:
M (i,x)=M (i,y)foratleastfrac.s valuesofi
Weexpectdocumentsx andytohavethesame
similarityastheirsignatures
32
LSHforMinhash
Bigidea: Hashcolumnsof
signaturematrixM severaltimes
Arrangethat(only)similarcolumnsare
likelytohashtothesamebucket,with
highprobability
Candidatepairsarethosethathashto
thesamebucket
33
PartitionMintoBands2
r rows
perband
b bands
One
signature
SignaturematrixM
34
PartitionMintoBands
DividematrixM intobbandsofr rows
Foreachband,hashitsportionofeach
columntoahashtablewithk buckets
Makek aslargeaspossible
Candidate columnpairsarethosethathash
tothesamebucketfor 1band
Tune b andr tocatchmostsimilarpairs,
butfewnonsimilarpairs
35
HashingBands
Buckets
Matrix M
Columns 2 and 6
are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.
r rows
Datasets
b bands
36
SimplifyingAssumption
Thereareenoughbucketsthatcolumnsare
unlikelytohashtothesamebucketunless
theyareidentical inaparticularband
Hereafter,weassumethatsamebucket
meansidenticalinthatband
Assumptionneededonlytosimplifyanalysis,
notforcorrectnessofalgorithm
37
ExampleofBands
Assumethefollowingcase:
Suppose100,000columnsofM(100kdocs)
Signaturesof100integers(rows)
Therefore,signaturestake40Mb
Choose20bandsof5integers/band
Goal: Findpairsofdocumentsthat
areatleasts=80%similar
38
C1,C2 are80%Similar2
Assume: C1,C2 are80%similar
Sinces=80%wewantC1,C2 tohashtoatleastone
commonbucket (atleastonebandisidentical)
ProbabilityC1,C2 identicalinoneparticularband:
(0.8)5 =0.328
ProbabilityC1,C2 arenot similarinallofthe20
bands:(10.328)20 =0.00035
i.e.,about1/3000thofthe80%similarcolumnpairs
arefalsenegatives
Wewouldfind99.965%pairsoftrulysimilar
documents
39
C1,C2 are30%Similar2
Assume: C1,C2 are30%similar
Sinces=80%wewantC1,C2 tohashtoatNO
commonbuckets (allbandsshouldbedifferent)
ProbabilityC1,C2 identicalinoneparticular
band:(0.3)5 =0.00243
ProbabilityC1,C2 identicalinatleast1of20
bands:1 (1 0.00243)20 =0.0474
Inotherwords,approximately4.74%pairs
ofdocswithsimilarity30%endupbecoming
candidatepairs falsepositives
40
LSHInvolvesaTradeoff2
Pick:
thenumberofminhashes (rowsofM)
thenumberofbandsb,and
thenumberofrowsr perband
tobalancefalsepositives/negatives
Example: ifwehadonly15bandsof5
rows,thenumberoffalsepositiveswould
godown,butthenumberoffalsenegatives
wouldgoup
41
AnalysisofLSH WhatWeWant
Probability = 1
if s > t
Probability
of sharing
a bucket
No chance
if s < t
t
Similarity s of two sets
42
What1Bandof1RowGivesYou
Remember:
Probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
t
Datasets
43
Whatb Bandsofr RowsGivesYou

At least
one band
identical
t ~ (1/b)1/r
Probability
of sharing
a bucket
No bands
identical
1 - (1 - s r )b
Some row
of a band
unequal
All rows
of a band
are equal

44
Example:b =20;r =5
Similaritythresholds
Prob.thatatleast1bandidentical:
s
1-(1-sr)b
.2
.006
.3
.047
.4
.186
.5
.470
.6
.802
.7
.975
.8
.9996
45
Pickingr andb:TheScurve
Pickingr andb togetthebestScurve
50hashfunctions(r=5,b=10)
1
Prob.sharingabucket
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Bluearea:FalseNegativerate
Greenarea:FalsePositiverate
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Similarity
46
LSHSummary
Tunetogetalmostallpairswithsimilar
signatures,buteliminatemostpairsthatdo
nothavesimilarsignatures
Checkinmainmemorythatcandidatepairs
reallydohavesimilarsignatures
Optional: Inanotherpassthroughdata,check
thattheremainingcandidatepairsreally
representsimilardocuments
47
Summary:3Steps
1. Shingling: Convertdocuments,emails,
etc.,tosets
2. Minhashing: Convertlargesetstoshort
signatures,whilepreservingsimilarity
3. Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefrom
similardocuments
48

Ispm 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ispm 3

Uploaded by

Copyright:

Available Formats

FindingSimilarItems

Example: C1 =1100011;C2 =0110010

Goal: Findahashfunctionh() suchthat:

Goal: FinddocumentswithJaccard similarityat

Whatb Bandsofr RowsGivesYou

Similarity s of two sets

You might also like