Professional Documents
Culture Documents
LEARNING
Jasmeet Singh
Thapar University
INTRODUCTION
In the average linkage method, the distance between two clusters is equal to
the average distance between each object in one cluster to every object in the
other cluster.
METHOD I : FEATURES AND CLASSES
CONTD….
Example
Consider the following words,
condition
conditional
conditions
conditioned
condense
condensed
condenser
Use Jaccard distance and average linkage method to create
clusters of morphologically related words.
METHOD I : FEATURES AND CLASSES
CONTD….
Jaccard distance between two strings x and y is given
by 1 − |X ∩ Y |/|X ∪ Y | where X and Y represents set
of unique q-grams of strings x and y respectively.
So according to Jaccard distance, distance matrix is:
condition 0
conditional 0.25 0
conditions 0.1428 0.3333 0
conditioned 0.1428 0.3333 0.25 0
condense 0.5 0.6 0.375 0.375 0
condensed 0.5 0.6 0.375 0.375 0 0
condenser 0.5555 0.6363 0.4444 0.4444 0.1428 0.1428 0
METHOD I : FEATURES AND CLASSES CONTD….
condense,
condition conditional conditions conditioned condenser
condensed
condition 0
conditional 0.25 0
condense,
condition,
conditional conditioned condensed,
conditions
condendser
condition,
0
conditions
conditional 0.2916 0
condense,
condition,
conditional conditioned condensed,
conditions
condendser
condition,
0
conditions
conditional 0.2916 0
condition, condense,
conditions, conditional condensed,
conditioned condenser
condition,
conditions, 0
conditioned
conditional 0.3055 0
condense,
condensed, 0.4382 0.6121 0
condenser
METHOD I : FEATURES AND CLASSES CONTD….
YASS (Yet Another Suffix Stripper) has been proposed that clusters
word using complete linkage method on the basis of string distance
proposed by the authors.
The method makes use of certain string distances that do not have
any knowledge of morphological variants of language and gives high
similarity between words that have a long common prefix.
For any two strings X and Y of length n (null characters are
appended to the shorter string to make length equal), the authors
proposed four string similarity measures {D1, D2, D3, D4} and
reported that D3 is the most effective and is quite insensitive to the
changes in threshold values.