Professional Documents
Culture Documents
Wondwossen
Mulugeta (PhD) email: wondemule@yahoo.com
Topics
2
Topics Subtopics
3: Text 1. Tokenization and word Segmentation, Stemming,
Processing Lemmatization, Morphological Processing (Types of
Morpheme, Morphological Types, Morphological Rules,
Morphemes and Words, Inflectional and Derivational
Morphology),
2. Part of Speech Tagging,
3. Parsing (Introduction, Context Free Grammar, parsing)
What is Text Processing
3
DataBase
database
Data-Base
Data-base
data-base
data-base
Regular Expressions
6
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
Errors
13
Pattern objects have methods that parallel the re functions (e.g., match,
search, split, findall, sub), e.g.:
>>> p1 = re.compile("\w+@\w+\.+com|org|net|edu")
>>> p1.match("steve@apple.com").group(0)
'steve@apple.com' email address
>>> p1.search(”Email steve@apple.com
today.").group(0)
'steve@apple.com’
>>> p1.findall("Email steve@apple.com and
bill@msft.com now.")
['steve@apple.com', 'bill@msft.com’] sentence boundary
>>> p2 = re.compile("[.?!]+\s+")
>>> p2.split("Tired? Go to bed! Now!! ")
['Tired', 'Go to bed', 'Now', ’ ']
RE Errors cont.
20
they lay back on the San Francisco grass and looked at the stars and their
Definitions:
CONSONANT: a letter other than A, E, I, O, U, and Y
preceded by consonant
VOWEL: any other letter
SSES -> SS
caresses -> caress
IES -> I
ponies -> poni
ties -> ti
SS -> SS
caress -> caress
S -> Ø
cats -> cat
The Porter Stemmer: Step 2a
(past tense, progressive)
35
(*V*) ED -> Ø
Condition verified: plastered -> plaster
Condition not verified: bled -> bled
computers
Step 1, Rule 4: -> computer
Step 6, Rule 4: -> compute
singing
Step 2a, Rule 3: -> sing
Step 6, Rule 4: -> compute
controlling
Step 2a, Rule 3: -> controll
Step 7b : -> control
generalizations
Step 1, Rule 4: -> generalization
Step 4, Rule 11: -> generalize
Step 6, last rule: -> general
Problems
41
• Conflation:
reply, rep. rep
• Overstemming:
wander wand
news new
• Misstemming: relativity relative
• Understemming: knavish knavish
Challenges for Local Languages
43
43
>>> stemmer=RegexpStemmer('occh')
>>> stemmer.stem('sewocch')
'sew'
Stemming Example
45
45
>>> stemmer=RegexpStemmer('occh|achn|u|acchew')
>>> stemmer.stem('sewocch')
'sew'
>>> stemmer.stem('sewocchuacchew')
'sew'
>>> stemmer.stem('sewocchuachn')
'sew'
>>> stemmer.stem('sewuacchewachn')
'sew'
>>> stemmer.stem('occhsewuacchewocchachn')
'sew'
>>> stemmer.stem('occhseocchwuacchewocchachn')
'sew'
Stemming Example
46
46
>>> stemmer=RegexpStemmer('ku|k|sh|e|ech|u|n|acchu')
>>> stemmer.stem('metacchu')
'mt'
>>> stemmer.stem('seberacchu')
'sbr'
Part Of Speech Tagging
47
47
They tell us a lot about a word (and the words near it).
Tell us what words are likely to occur in the neighborhood
adjectives often followed by nouns
personal pronouns often followed by verbs
possessive pronouns by nouns
Part Of Speech Tagging
48
Tagging
The process of associating labels with each
token in a text
Tags
The labels
Tag Set
The collection of tags used for a particular
task
Tagging Example
50
• Part-of-Speech Tagging
•Rule-Based Tagger
•Stochastic Tagger: HMM-based
•Transformation-Based Tagger (Brill)
•Assumption:
Bigram tagger
Make predictions based on the preceding tag
The basic unit is the preceding tag and the current tag
Trigram tagger
Predication based on the previous two tags
Expected to have more accurate predictions ….how?
argmaxT P(T|W)……
probability of the tag T given the word W?
argmaxTP(T)P(W|T)…
probability of the tag * probability of the word W given the tag T
argmaxtP(t1…tn)P(w1…wn|t1…tn)
argmaxt[P(t1)P(t2|t1)…P(tn|tn-1)][P(w1|t1)P(w2|t2)…P(wn|tn)]
c(ti-1ti)/c(ti-1)
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
Sample Markov Model for POS
65
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076
Sample HMM for POS
66
0.1
start
Sample HMM Generation
69
>>> token=nltk.word_tokenize(text)
>>> token
['zare', 'sewochu', 'begun', 'yametutal']
>>> taggedtoken=nltk.pos_tag(token)
>>> taggedtoken
[('zare', 'NN'), ('sewochu', 'VBD'), ('begun', 'VBN'), ('yametutal',
'JJ')]
PoS Tagging Example
81
81
Default Tagger:
>>> amh_tagger=nltk.RegexpTagger(amhpatt)
>>> amh_defalut_tagger=nltk.DefaultTagger('VB')
>>> amh_defalut_tagger.tag(token)
[('zare', 'VB'), ('sewocch', 'VB'), ('beserur', 'VB'), ('betoch', 'VB'),
('endesetalen', 'VB')]
Tagging in NLTK (4)
85
85
RE Tagger:
>>> amhpatt=[
... (r'.*occh$', 'NN'),
... (r'.*och$', 'NN'),
... (r'.*','VB')
... ]
>>> amh_tagger=nltk.RegexpTagger(amhpatt)
RE Tagger:
>>> taggedtext=amh_tagger.tag(token)
>>> taggedtext
[('zare', 'VB'), ('sewocch', 'NN'), ('beserur', 'VB'), ('betoch', 'NN'),
('endesetalen', 'VB')]
Setting the Scene
87
Learning
Setting the Scene
88
N-gram model (2)
89
89
Unigram Tagger
Finds the most frequent tag for each word in a training
corpus
When it sees that word the tagger assigns it that tag that
is observed frequently
E.g: in a tagged Amharic corpus
a) the word ‘sra-ስራ’ is found 25 times
Unigram Tagger
>>> brown_a = nltk.corpus.brown.tagged_sents(categories=‘news')
>>> unigram_tagger = nltk.UnigramTagger(brown_a)
>>> sent = nltk.corpus.brown.sents(categories=‘news')
>>> unigram_tagger.tag(sent)
[('Various', None), ('of', 'IN'), ('the', 'AT'), ('apartments',
'NNS'), ('are', 'BER'),
('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','),
('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'),
('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]
N-gram model (5)
92
92
Bigram
>>> unigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', 'CC'),
('gize', None),
('new', 'JJ'),
('yemetaw', None)]
N-gram model (8)
95
95
>>> bigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', None),
('gize', None),
('new', None),
('yemetaw', None)]
>>> bigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', None),
('gize', None),
('new', None),
('yemetaw', None)]
N-gram model (10)
97
97
Combining Taggers
• Accuracy increases as we move from one simpler tagger
to a more complicated one.
• How can we benefit from all?? Combine
HOW??
Backing Off: use a catch-all backoff tagger that sees
other options
N-gram model (11)
98
98
Combining Taggers
1) Try a BigramTagger
2) Try a UnigramTagger
3) Try the a RegexpTagger (can be added before the default)
4) Get everything else with a DefaultTagger
>>> tagger1 = nltk.DefaultTagger('NN')
>>> tagger2 = nltk.UnigramTagger(brown_a, backoff=tagger1)
>>> tagger3 = nltk.BigramTagger(brown_a, backoff=tagger2)
>>> tagger1.tag(nltk.word_tokenize(text))
[('esu', 'NN'),
('and', 'NN'),
('gize', 'NN'),
('new', 'NN'),
('yemetaw', 'NN')]
>>> tagger2.tag(nltk.word_tokenize(text))
[('esu', 'NN'),
('and', 'CC'),
('gize', 'NN'),
('new', 'JJ'),
('yemetaw', 'NN')]
>>> tagger3.tag(nltk.word_tokenize(text))
[('esu', 'NN'),
('and', 'CC'),
('gize', 'NN'),
('new', 'JJ'),
('yemetaw', 'NN')]
100 Parsing
Parsing
101
S NP VP N people
VP V NP N fish
VP V NP PP N tanks
NP NP NP N rods
NP NP PP V people
NP N V fish
NP e V tanks
PP P NP P with
Phrase structure grammars in NLP (CFGs)
105
G = (T, N, S, R)
T is a set of terminal symbols (words/lexicon)
N is a set of nonterminal symbols (NP, VP, etc)
S NP VP N people
VP V NP N fish
VP V NP PP N tanks
NP NP NP N rods
NP NP PP V people
NP N V fish
NP e V tanks
PP P NP P with
Probabilistic – or stochastic –
107
context-free grammars (PCFGs)
G = (T, N, S, R, P)
T is a set of terminal symbols
N is a set of nonterminal symbols
P is a probability function
P: R [0,1]
åg ÎT *
P(g ) = 1