Word Level Analyis III

AUTOMATIC MORPHOLOGY
LEARNING
Jasmeet Singh
Thapar University
INTRODUCTION
Rule-based methods of learning morphology requires:

Linguistic Resources like suffix lists, rule lists, dictionaries,
tagged data, etc.
Language Expertise to encode rules of the language.
But for many languages of the world, these lexical

resources are either incomplete or unavailable.
So, for resource scarce languages, these rule-based
methods are not preferred.
AUTOMATIC MORPHOLOGY LEARNING
Automatic morphology involves unsupervised or semi-
supervised learning of morphologically related words or
morpheme boundaries from the ambient corpus.
So, these methods are also called corpus-based or language-
independent or statistical methods.
The major advantage of corpus-based methods is that these
stemmers can be applied to a new language with very little
effort provided the language satisfies the basic assumptions
of the methods (like variant words should be formed by
adding affixes only).
Moreover, these techniques can deal with languages that
have complex morphology and sparse data.
Statistical techniques do not require any prior knowledge
of the language or language resources which are useful for
many languages where the resources are either not available
or are incomplete to provide effective results.
METHODS FOR AUTOMATIC MORPHOLOGY LEARNING
Methods for automatic learning of morphology are

broadly divided into following three categories:
Feature and Classes.
Border Determination Methods.
Frequency Based Methods (frequency of substrings or
suffixes).
METHOD I : FEATURES AND CLASSES
In this family of methods, a word is seen as made up of
a set of features.
After computation of the features, the words are
clustered according to various clustering methods.
The most common feature used is lexical similarity or
distance between the methods.
The string comparisons methods like hamming
distance, longest common subsequence, Jaccard
distance, cosine distance, q-gram distance, Levensthein
distance, Jaro distance, etc. are frequently used for
computing distance/similarity between words acquired
from the corpus.
CONTD….
Among clustering based techniques, Hierarchical
agglomerative clustering is commonly used due to:
its ability to create natural clusters,
possibility to view data at different threshold levels
and no prior knowledge of number of clusters.
The hierarchical agglomerative clustering algorithm
begins by considering each word as a separate cluster,
so for N words in a class, there are N clusters.
At each next step, the algorithm merges the most
similar clusters and updates the distance between the
new cluster and each old cluster.
The process continues until the distance between the
clusters is less than the pre-defined threshold value.
CONTD….
At each fusion stage of clusters, a linkage method decides how the
inter-cluster distances are calculated.
A number of linkage rules like single linkage, complete linkage,
and average linkage are described in the literature
In single method, the distance between two clusters is equal to the
least distance between an object in one cluster and an object in
another cluster.
In complete method, the distance between two clusters is equal to

the maximum distance between an object in one cluster and an
object in another cluster.
In the average linkage method, the distance between two clusters is equal to
the average distance between each object in one cluster to every object in the
other cluster.
CONTD….
Example
Consider the following words,
condition
conditional
conditions
conditioned
condense
condensed
condenser
Use Jaccard distance and average linkage method to create
clusters of morphologically related words.
CONTD….
Jaccard distance between two strings x and y is given
by 1 − |X ∩ Y |/|X ∪ Y | where X and Y represents set
of unique q-grams of strings x and y respectively.
So according to Jaccard distance, distance matrix is:
Condition conditional conditions conditioned condense condensed condenser
condition 0
conditional 0.25 0
conditions 0.1428 0.3333 0
conditioned 0.1428 0.3333 0.25 0
condense 0.5 0.6 0.375 0.375 0
condensed 0.5 0.6 0.375 0.375 0 0
condenser 0.5555 0.6363 0.4444 0.4444 0.1428 0.1428 0
METHOD I : FEATURES AND CLASSES CONTD….
Now, the words condense and condensed will be grouped together as

they have minimum distance.
The distance of this new cluster will be updated with all old clusters.
For example,
condense,
condition conditional conditions conditioned condenser
condensed
condition 0
conditional 0.25 0
conditions 0.1428 0.3333 0
conditioned 0.1428 0.3333 0.375 0

condense,
0.5 0.6 0.375 0.375 0
condensed
condenser 0.5555 0.6363 0.4444 0.4444 0.1428 0
Now, two new clusters will be formed

condition and conditions
Condense, condensed and condenser
condense,
condition,
conditional conditioned condensed,
conditions
condendser
condition,
0
conditions
conditional 0.2916 0
conditioned 0.1964 0.3333 0

condense,
condensed, 0.2749 0.6121 0.3981 0
condendser
Now, two new clusters will be formed

condition and conditions
Condense, condensed and condenser
condense,
condition,
conditional conditioned condensed,
conditions
condendser
condition,
0
conditions
conditioned 0.1964 0.3333 0

condense,
condensed, 0.2749 0.6121 0.3981 0
condendser
Now, following new cluster will be formed

Condition & conditions , conditioned
condition, condense,
conditions, conditional condensed,
conditioned condenser
condition,
conditions, 0
conditioned
condense,
condensed, 0.4382 0.6121 0
condenser
In next iteration, the cluster condition, conditions and

conditioned will be grouped with conditional.
condition, conditions, condense, condensed,

conditioned, conditional condenser
condition, conditions,
0
conditioned, conditional
condense, condensed,
0.4817 0
condenser
Finally the two different clusters will be merged into

one cluster at distance 0.4817.
The final clustering dendogram is shown in the next
slide.
CONTD….
YASS (Yet Another Suffix Stripper) has been proposed that clusters
word using complete linkage method on the basis of string distance
proposed by the authors.
The method makes use of certain string distances that do not have
any knowledge of morphological variants of language and gives high
similarity between words that have a long common prefix.
For any two strings X and Y of length n (null characters are
appended to the shorter string to make length equal), the authors
proposed four string similarity measures {D1, D2, D3, D4} and
reported that D3 is the most effective and is quite insensitive to the
changes in threshold values.
where m is the location of first mismatch between the strings X and

Y.
After calculating the string distance parameters, the clusters are
obtained using graph based complete linkage technique.
The software code of the method is available at
http://fire.irsi.res.in/fire/static/resources
Besides, lexical similarity, many other features are used to group

morphologically related words.
Suffix Pair Frequency, is one useful feature, suggested in
literature.
A pair of suffixes s1, s2 is said to be co-occurring if there exists
a pair of distinct words wi,wj in Lexicon L, such that wi =
rs1,wj = rs2, where r is the longest common prefix of wi and wj,
and |r| > 0.
Informally, s1 and s2 are regarded as a co-occurring pair if each of s1
and s2 may be added as a suffix to the end of a common “root” r to
form valid words wi and wj .
The pair wi,wj is said to induce the suffix pair s1, s2, and the number
of such word pairs is termed the frequency of the suffix pair s1, s2.
For example, if L = {activate, activation, educate, education}, the
suffix pair e, ion is said to co-occur with frequency 2.
All the suffix pairs whose frequency is greater than some pre-decided
threshold are valid suffix pairs.
The suffix pairs and their frequency is used by the clustering
algorithm to group words.
Co-occurrence of words with the other words of the corpus is

also a useful feature to group morphologically related words.
The methods which only use lexicon statistics sometimes
conflate many unrelated words by treating valid word
endings as suffixes (as in the case of shin and shining). The
corpus analysis helps in reducing all such erroneous
conflation.
Two important metric based on co-occurrence frequency used in
morphological analysis are:
nab is number of co-occurrence of words a and b, na and nb are number

of times words a and b occur in the 100 words window respectively
and k = nab/na+nb
fwa, d is the frequnecy of occurrence of word wa in document d
METHOD II: BORDER DETERMINATION METHODS
In these family of methods, the optimal split point of

the potential stems and suffixes is determined using
different techniques.
The most common border segmentation method is
letter successor variety.
The successor variety of a string is the number of
different characters that follow it in words in
some body of text.
The successor variety of substrings of a term will
decrease as more characters are added until a
segment boundary is reached. At this point the
successor variety is reached.
This information is used to find the stem of the
words.
After deriving the successor variety for a given word,

following four approaches are used:
1) The cut-off method: Some cut off value i.e threshold
is selected for successor varieties and boundary is
reached.
The problem with this threshold method is that if the
threshold is too small, incorrect cuts will be made and
if it is too large, correct cuts will be missed.
2) Peak and Plateau Method: In this method the
segment break is made after the character whose
successor variety exceeds the character immediately
preceding it and immediately succeeding it.
This method does not suffer from the problem of cut-
off method.
3) Complete Word Method: A break is made after a

segment is the segment is the complete word in the
corpus.
4) Entropy method: This method works as follows:
• Let |Dαi| be the number of words in the text beginning
with i length sequence of letters α.
• Let |Dαij| be the number of words in |Dαi|with the
successor j computed in above step.
• The entropy of Hαi is given by:
• A cut off value is selected and a boundary is identified

whenever the cut off value is reached.
Example : Letter Successor Variety
Testword: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
The letter successor variety of each character for the input word
READABLE is:
Prefix Successor Variety Letters
R 3 E, I, O
RE 2 A, D
REA 1 D
READ 3 A, I, S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 Blank
o Cutoff method: segment when successor variety >= threshold
o If threshold is 2 , then R|E|AD|ABLE
o Peak and Plateau Method: break at the character whose
successor variety is greater than its preceding and following
character: READ|ABLE
o Complete Word Method: Break is made if the segment is
complete word in the corpus. READ
o Entropy Method:
o Let i=2, α = RE, |Dαi|= 5
For j= A , |Dαij|= 4
For j=D, |Dαij|= 1
Therefore, Hαij = - 1/5 log2 (1/5) – 4/5 log2 (4/5) = 0.72
METHOD III: FREQUENCY BASED METHODS
Another approach used in unsupervised learning of
morphology is based on identifying suffixes from the corpus
on the basis of their frequency.
The identified suffixes are then used in the process of
suffix stripping to obtain the stem of the words.
Oard et al. [2001] developed a method to identify suffixes
based on their frequencies from first 5, 00,000 words in the
corpus.
The frequencies of suffixes of length one to four characters
is computed from these words.
In order to remove the effect of partial suffixes (‘-ng’ is part
of ‘-ing’) the suffix frequencies are modified by subtracting
the count of the most frequent subsuming ending of next
longer length from each ending.
In order to stem the word, the first matching suffix in the
list (from the top) is removed.

Word Level Analyis III

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Word Level Analyis III

Uploaded by

Copyright:

Available Formats

AUTOMATIC MORPHOLOGY

Rule-based methods of learning morphology requires:

Language Expertise to encode rules of the language.

But for many languages of the world, these lexical

Methods for automatic learning of morphology are

In complete method, the distance between two clusters is equal to

Condition conditional conditions conditioned condense condensed condenser

Now, the words condense and condensed will be grouped together as

conditions 0.1428 0.3333 0

conditioned 0.1428 0.3333 0.375 0

Now, two new clusters will be formed

conditioned 0.1964 0.3333 0

Now, two new clusters will be formed

conditioned 0.1964 0.3333 0

Now, following new cluster will be formed

In next iteration, the cluster condition, conditions and

condition, conditions, condense, condensed,

Finally the two different clusters will be merged into

where m is the location of first mismatch between the strings X and

Besides, lexical similarity, many other features are used to group

Co-occurrence of words with the other words of the corpus is

nab is number of co-occurrence of words a and b, na and nb are number

In these family of methods, the optimal split point of

After deriving the successor variety for a given word,

3) Complete Word Method: A break is made after a

• A cut off value is selected and a boundary is identified

You might also like