CS724 NLP Topic 3-Text Processing

1
TOPIC 3: TEXT PROCESSING

NATURAL LANGUAGE PROCESSING (NLP)
CS-724
Wondwossen
Mulugeta (PhD) email: wondemule@yahoo.com
Topics
2
Topics Subtopics
3: Text 1. Tokenization and word Segmentation, Stemming,
Processing Lemmatization, Morphological Processing (Types of
Morpheme, Morphological Types, Morphological Rules,
Morphemes and Words, Inflectional and Derivational
Morphology),
2. Part of Speech Tagging,
3. Parsing (Introduction, Context Free Grammar, parsing)
What is Text Processing
3
 Text processing is a process of manipulating a written

text in a way that will be useful for further processing
or higher level of NLP application.
 Text processing might have different scope based on
the application domain or NLP type.
 Understanding text, identifying relevant elements,
manipulating elements of a text and analyzing the
structure and semantics of text elements is vital.
Text Processing
4
 Every NLP task needs to do text processing:

1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
Performing these tasks requires robust Text

Processing
Most Important Approach: Regular Expression
Regular expressions
5
 Regular Expression (RE) is a formal language for

specifying text strings
 How can we search for any of these?
 Database
 DataBase
 database
 Data-Base
 Data-base
 data-base
 data-base
Regular Expressions
6
 Regular expressions are a powerful string

manipulation tool
 All modern programming languages have similar
library packages for regular expressions
 Use regular expressions to:
 Search a string (search and match)
 Replace parts of a string (sub)
 Break strings into smaller pieces (split)

Python’s Regular Expression Syntax
7
 Most characters match themselves

The regular expression “test” matches the string
‘test’, and only that string
 [x] matches any one of a list of characters
“[abc]” matches ‘a’,‘b’,or ‘c’
 [^x] matches any one character that is not included
in x
“[^abc]” matches any single character except
‘a’,’b’,or ‘c’
Python’s Regular Expression Syntax
8
 “.” matches any single character

 Parentheses can be used for grouping
“(abc)+” matches ’abc’, ‘abcabc’,
‘abcabcabc’, etc.
 x|y matches x or y
“this|that” matches ‘this’ or ‘that’,
but not ‘thisthat’.
Python’sRegular Expression Syntax
9
 x* matches zero or more x’s

“a*” matches ’’, ’a’, ’aa’, etc.
 x+ matches one or more x’s
“a+” matches ’a’,’aa’,’aaa’, etc.
 x? matches zero or one x’s
“a?” matches ’’ or ’a’
 x{m, n} matches i x‘s, where m<i< n
“a{2,3}” matches ’aa’ or ’aaa’
Regular Expression Syntax
10
 “\d” matches any digit; “\D” any non-digit

 “\s” matches any whitespace character; “\S” any
non-whitespace character
 “\w” matches any alphanumeric character; “\W” any
non-alphanumeric character
 “^” matches the beginning of the string; “$” the end
of the string
 “\b” matches a word boundary; “\B” matches a
character that is not a word boundary
Search and Match
11
 The two basic functions are re.search and re.match

 Search looks for a pattern anywhere in a string
 Match looks for a match staring at the beginning
 Both return None (logical false) if the pattern isn’t

found and a “match object” instance if it is
>>> import re
>>> pat = "a*b”
>>> re.search(pat,"fooaaabcde")
<_sre.SRE_Match object; span=(3, 7),
match='aaab'>
>>> re.match(pat,"fooaaabcde")
Regular Expressions: ? * + .
12
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
Errors
13
 The process we just went through was based on fixing

two kinds of errors
 Matching strings that we should not have matched (there,
then, other)
 False positives (Type I)
 Not matching things that we should have matched (The)
 False negatives (Type II)

What got matched?
14
 Here’s a pattern to match simple email addresses

\w+@(\w+\.)+(com|org|net|edu)
>>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)"

>>> r1 = re.match(pat,"finin@cs.umbc.edu")
>>> r1.group()
'finin@cs.umbc.edu’
 We might want to extract the pattern parts, like the

email name and host
What got matched?
15
 We can put parentheses around groups we want to

be able to reference
>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))"
>>> r2 = re.match(pat2,"finin@cs.umbc.edu")
>>> r2.group(1)
'finin'
>>> r2.group(2)
'cs.umbc.edu'
>>> r2.groups()
r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
Note that the ‘groups’ are numbered in a preorder traversal of the forest
What got matched?
16
 We can ‘label’ the groups as well…

>>> pat3
="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|net|ed
u))"
>>> r3 = re.match(pat3,"finin@cs.umbc.edu")
>>> r3.group('name')
'finin'
>>> r3.group('host')
'cs.umbc.edu’
 And reference the matching parts by the labels
More re functions
17
 re.split() is like split but can use patterns

>>> re.split("\W+", “This... is a test,
short and sweet, of split().”)
['This', 'is', 'a', 'test', 'short’,
'and', 'sweet', 'of', 'split’, ‘’]
 re.sub substitutes one string for a pattern
>>> re.sub('(blue|white|red)', 'black', 'blue
socks and red shoes')
'black socks and black shoes’
 re.findall() finds al matches
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’]
Compiling regular expressions
18
 If you plan to use a re pattern more than once, compile it to a re object

 Python produces a special data structure that speeds up matching
>>> capt3 = re.compile(pat3)

>>> cpat3
<_sre.SRE_Pattern object at 0x2d9c0>
>>> r3 = cpat3.search("finin@cs.umbc.edu")
>>> r3
<_sre.SRE_Match object at 0x895a0>
>>> r3.group()
'finin@cs.umbc.edu'
Pattern object methods
19
Pattern objects have methods that parallel the re functions (e.g., match,
search, split, findall, sub), e.g.:
>>> p1 = re.compile("\w+@\w+\.+com|org|net|edu")
>>> p1.match("steve@apple.com").group(0)
'steve@apple.com' email address
>>> p1.search(”Email steve@apple.com
today.").group(0)
'steve@apple.com’
>>> p1.findall("Email steve@apple.com and
bill@msft.com now.")
['steve@apple.com', 'bill@msft.com’] sentence boundary
>>> p2 = re.compile("[.?!]+\s+")
>>> p2.split("Tired? Go to bed! Now!! ")
['Tired', 'Go to bed', 'Now', ’ ']
RE Errors cont.
20
 In NLP we are always dealing with these kinds of

errors.
 Reducing the error rate for an application often
involves two antagonistic efforts:
 Increasing accuracy or precision (minimizing false positives)
 Increasing coverage or recall (minimizing false negatives).
Sentence Segmentation
21
 !, ? are relatively unambiguous

 Period “.” is quite ambiguous
 Sentenceboundary
 Abbreviations like Inc. or Dr.
 Numbers like .02% or 4.3
 Build a binary classifier

 Looks at a “.”
 Decides EndOfSentence/NotEndOfSentence
 Classifiers: hand-written rules, regular expressions, or

machine-learning
Implementing Decision Trees
22
 A decision tree is just an if-then-else statement

 The challenging part is choosing the features
 Setting up the structure is often too hard to do by
hand
 Hand-building only possible for very simple features,
domains
 Instead, structure usually learned by machine learning
from a training corpus
Tokenization…. How many words?
23
they lay back on the San Francisco grass and looked at the stars and their
 Type: an element of the vocabulary.

 Token: an instance of that type in running text.
 How many?
 15 tokens (or 14)
 13 types (or 12) (or 11?)
Tokenization: language issues
24
 Chinese and Japanese no spaces between words:

 莎拉波娃现在居住在美国东南部的佛罗里达。
 莎拉波娃现在居住在美国东南部的佛罗里达

 Sharapova now lives in US southeastern Florida
 Further complicated in Japanese, with multiple alphabets
intermingled
 Dates/amounts in multiple formats
 January 2, 2016 vs 02/01/2016 vs 01/02/2016 vs……
 Amharic
 ወደቤቱና vs ወደ ቤቱና vs ወደቤቱ እና vs ወደ : ቤቱ : እና ?
Tokenization: language issues
25
 Important Questions to be answered..

1. Which element is required from the text
2. How to separate one item from the other
3. How to deal with irregularities
4. For what kind of application is the text required
5. What should be the possible output from the
tokenization
 Words, characters, punctuations, numbers, etc….
Normalization
26
 Need to “normalize” terms

 Some written words are different at the surface but
identical in meaning
 ሃገር vs ሀገር vs ኅገር vs ሐገር vs ሓገር
 USA vs USA vs US
 We implicitly define equivalence classes of terms

 e.g., deleting periods in a term
 Alternative: asymmetric expansion:
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: Windows Search: Windows
 Potentially more powerful, but less efficient

Case folding
27
 Applications like IR: reduce all letters to lower

case
 Since users tend to use lower case
 Possible exception: upper case in mid-sentence?
 e.g.,General Motors orgeneral motors
 Fed vs. fed
 SAIL vs. sail
 For sentiment analysis, MT, Information

extraction
 Case is helpful (US versus us is important)
Stemming
28
 Reduce terms to their stems in information retrieval

 Stemming is crude chopping of affixes
 language dependent
 e.g., automate(s), automatic, automation all reduced to
automat.
for example compressed for example compress and
and compression are both compress are both accept
accepted as equivalent to as equivalent to compress
compress.
Lemmatization and Stemming
29
 The goal of both stemming and lemmatization is to

reduce inflectional forms and sometimes
derivationally related forms of a word to a common
base form.
 Lemmatization usually refers to doing things
properly with the use of a vocabulary and
morphological analysis of words, normally aiming to
remove inflectional endings only and to return the
base or dictionary form of a word, which is known
as the lemma
Lemmatization and Stemming
30
 Words can be viewed as consisting of:

a Stem, and
 One or more Affixes
 Morphological Analysis in its general form involves

recovering the LEMMA of a word and all its
affixes, together with their grammatical properties
 Stemming a simplified form of morphological
analysis – simply find the stem
 Lemmatization is the process of identifying
lexical/dictionary term after removing all affixes.
The Porter Stemmer (Porter, 1980)
31
 A simple rule-based algorithm for stemming

 An example of a HEURISTIC method
 Based on rules like:
 ATIONAL -> ATE (e.g., relational -> relate)
 The algorithm consists of seven sets of rules,
applied in order
The Porter Stemmer: definitions
32
 Definitions:
 CONSONANT: a letter other than A, E, I, O, U, and Y
preceded by consonant
 VOWEL: any other letter
 With this definition, all words are of the form:

(C)(VC)m(V)
C=string of one or more consonants (con+)
V=string of one or more vowels
The Porter Stemmer: rule format
33
 The rules are of the form:

(condition) S1 -> S2
Where S1 and S2 are suffixes
 Conditions:
m The measure of the stem
*S The stem ends with S
*v* The stem contains a vowel

*d The stem ends with a
double consonant
*o The stem ends in CVC
(second C not W, X, or Y)
The Porter Stemmer: Step 1
34
 SSES -> SS
 caresses -> caress
 IES -> I
 ponies -> poni
 ties -> ti
 SS -> SS
 caress -> caress
 S -> Ø
 cats -> cat
The Porter Stemmer: Step 2a
(past tense, progressive)
35
 (m>1) EED -> EE

 Condition verified: agreed -> agree
 Condition not verified: feed -> feed
 (*V*) ED -> Ø
 Condition verified: plastered -> plaster
 Condition not verified: bled -> bled
 (*V*) ING -> Ø

 Condition verified: motoring -> motor
 Condition not verified: sing -> sing
The Porter Stemmer: Step 2b (cleanup)
36
 (These rules are ran if second or third rule in 2a apply)

 AT-> ATE
 conflat(ed) -> conflate
 BL -> BLE
 Troubl(ing) -> trouble
 (*d & ! (*L or *S or *Z)) -> single letter
 Condition verified: hopp(ing) -> hop, tann(ed) -> tan
 Condition not verified: fall(ing) -> fall
 (m=1 & *o) -> E
 Condition verified: fil(ing) -> file
 Condition not verified: fail -> fail
The Porter Stemmer: Steps 3 and 4
37
 Step 3: Y Elimination (*V*) Y -> I

 Condition verified: happy -> happi
 Condition not verified: sky -> sky
 Step 4: Derivational Morphology, I

 (m>0) ATIONAL -> ATE
 Relational -> relate
 (m>0) IZATION -> IZE
 generalization-> generalize
 (m>0) BILITI -> BLE
 sensibiliti -> sensible
The Porter Stemmer: Steps 5 and 6
38
 Step 5: Derivational Morphology, II

 (m>0) ICATE -> IC
 triplicate -> triplic
 (m>0) FUL -> Ø
 hopeful -> hope
 (m>0) NESS -> Ø
 goodness -> good
 Step 6: Derivational Morphology, III
 (m>0) ANCE -> Ø
 allowance-> allow
 (m>0) ENT -> Ø
 dependent-> depend
 (m>0) IVE -> Ø
 effective -> effect
The Porter Stemmer: Step 7
39
(cleanup)
 Step 7a
 (m>1) E -> Ø
 probate -> probat
 (m=1 & !*o) NESS -> Ø
 goodness -> good
 Step 7b
 (m>1 & *d & *L) -> single letter
 Condition verified: controll -> control
 Condition not verified: roll -> roll
Examples
40
 computers
 Step 1, Rule 4: -> computer
 Step 6, Rule 4: -> compute
 singing
 Step 2a, Rule 3: -> sing
 Step 6, Rule 4: -> compute
 controlling
 Step 2a, Rule 3: -> controll
 Step 7b : -> control
 generalizations
 Step 1, Rule 4: -> generalization
 Step 4, Rule 11: -> generalize
 Step 6, last rule: -> general
Problems
41
 elephants -> eleph

 Step 1, Rule 4: -> elephant
 Step 6, Rule 7: -> eleph
 doing - > doe

 Step 2a, Rule 3: -> do
Types of Stemming Errors
42
• Conflation:
reply, rep. rep
• Overstemming:
wander wand
news new
• Misstemming: relativity relative
• Understemming: knavish knavish
Challenges for Local Languages
43
43
 Ethiopian languages are tough for stemming.
 The challenge is mainly on:

 The orthography,
 The morphology,
Stemming Example
44
44
>>> from nltk.stem import RegexpStemmer

>>> stemmer=RegexpStemmer('ing')
>>> stemmer.stem('cooking')
'cook‘
NB: this will remove ing from a word where-ever it is found
What comes first for Local Languages with non-Latin scripts?

•TRANSLITRATION
>>> stemmer=RegexpStemmer('occh')
>>> stemmer.stem('sewocch')
'sew'
Stemming Example
45
45
>>> stemmer=RegexpStemmer('occh|achn|u|acchew')
>>> stemmer.stem('sewocch')
'sew'
>>> stemmer.stem('sewocchuacchew')
'sew'
>>> stemmer.stem('sewocchuachn')
'sew'
>>> stemmer.stem('sewuacchewachn')
'sew'
>>> stemmer.stem('occhsewuacchewocchachn')
'sew'
>>> stemmer.stem('occhseocchwuacchewocchachn')
'sew'
Stemming Example
46
46
>>> stemmer=RegexpStemmer('ku|k|sh|e|ech|u|n|acchu')
>>> stemmer.stem('metacchu')
'mt'
>>> stemmer.stem('seberacchu')
'sbr'
Part Of Speech Tagging
47
47
 Words can be divided into classes that behave similarly.

 Traditionally eight parts of speech:
noun, verb, pronoun, preposition, adverb, conjunction, adjective
and article
 They tell us a lot about a word (and the words near it).
 Tell us what words are likely to occur in the neighborhood
 adjectives often followed by nouns
 personal pronouns often followed by verbs
 possessive pronouns by nouns
Part Of Speech Tagging
48
 PoS Tagging is the process of annotating each

word in a sentence with a part-of-speech marker.
 Lowest level of syntactic analysis.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
 Useful for subsequent syntactic parsing and

word sense disambiguation.
Tagging Terminology
49
 Tagging
 The process of associating labels with each
token in a text
 Tags
 The labels
 Tag Set
 The collection of tags used for a particular
task
Tagging Example
50
Typically a tagged text is a sequence of white-space

separated base/tag tokens:
The/at Pantheon’s/np interior/nn ,/,still/rb in/in

its/pp original/jj form/nn ,/, is/bez truly/ql
majestic/jj and/cc an/at architectural/jj triumph/nn
./. Its/pp rotunda/nn forms/vbz a/at perfect/jj
circle/nn whose/wp diameter/nn is/bez equal/jj
to/in the/at height/nn from/in the/at floor/nn to/in
the/at ceiling/nn ./.
What does tagging do?
51
1. Collapses Some Distinctions

• Lexical identity may be discarded
• e.g. all personal pronouns tagged with PRP
2. ….But Introduces Others
• Ambiguities may be removed
• e.g. deal tagged with NN or VB
• e.g. deal tagged with DEAL1 or DEAL2
3. Helps classification and prediction
Parts of Speech (POS)
52
 A word’s POS tells us a lot about the word and its

neighbors:
 Limits the range of meanings, pronunciation
 Helps in stemming
 Limits the range of following words for Speech
Recognition
 Can help select nouns from a document for IR
 Basis for partial parsing (chunked parsing)
 Parsers can build trees directly on the POS tags instead
of maintaining a lexicon
POS and Tagsets
53
 The choice of tagset greatly affects the difficulty

of the problem
 Need to strike a balance between
 Getting better information about context (best:
introduce more distinctions)
 Make it possible for classifiers to do their job (need
to minimize distinctions)
Common Tagsets
54
 Brown corpus: 87 tags

 Penn Treebank: 45 tags
 Lancaster UCREL C5 (used to tag the British
National Corpus - BNC): 61 tags
 Lancaster C7: 145 tags
The challenges is still there…..
55
55
Which tag for which word?

If not done manually then look for the most probable word class
for any string found in the text………….
use already tagged document as source of tag pattern
WORDS
TAGS
the
girl
kissed N
the V
boy P
on DET
the
cheek
Automatic Taggers
56
56
• Size of tag sets depends on:

•language,
•objectives and
•purpose
 simple morphology = more ambiguity = fewer tags
• Part-of-Speech Tagging
•Rule-Based Tagger
•Stochastic Tagger: HMM-based
•Transformation-Based Tagger (Brill)
 Some addresses the ambiguity problem

The probabilistic approach tries to find the more likely tag sequence
POS Tagging Approaches
57
 Rule-Based: Human crafted rules based on lexical

and other linguistic knowledge.
 Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
 Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM), Conditional
Random Field (CRF)
 Rule learning: Transformation Based Learning (TBL)
 Generally, learning-based approaches have been

found to be more effective overall, taking into
account the total amount of human expertise and
effort involved.
Stochastic Tagging
58
58
•Based on probability of certain tag occurring

given various possibilities
•Requires a training corpus
•No probabilities for words not in corpus.
•Training corpus may be different from test corpus.
•Simple Method: Choose most frequent tag in training text

for each word!
–Result:
90% accuracy
–Unknown for words never encountered before
–HMM is an example
HMM Tagger
59
59
•The Whole Idea:

guess the most likely tag for a given word.
•The issue is maximization……
•Assumption:
•A word’s tag only depends on the previous tag (limited

horizon) and that this dependency does not change over time
(time invariance)
HMM Taggers selects a tag sequence that gives the
•
maximum result from the following probability formula:

P(word /tag) × P(tag/previous n tags)
Markov Model Taggers
60
60
 Bigram tagger
Make predictions based on the preceding tag
The basic unit is the preceding tag and the current tag
 Trigram tagger
Predication based on the previous two tags
Expected to have more accurate predictions ….how?
RB(adverb) VBD(past tense) Vs RB VBN(past participle) ?

E.g.:
“clearly marked”
Is clearly marked : P(BEZ RB VBN) > P(BEZ RB VBD)
He clearly marked : P(PN RB VBD) > P(PN RB VBN)
Ngram-HMM Tagger..the beginning
61
61
 argmaxT P(T|W)……
 probability of the tag T given the word W?
 argmaxTP(T)P(W|T)…
 probability of the tag * probability of the word W given the tag T
argmaxtP(t1…tn)P(w1…wn|t1…tn)
argmaxt[P(t1)P(t2|t1)…P(tn|tn-1)][P(w1|t1)P(w2|t2)…P(wn|tn)]
To tag a single word: ti = argmaxi P(ti|ti-1)P(wi|ti)
How do we compute P(ti|ti-1)?
c(ti-1ti)/c(ti-1)
 How do we compute P(wi|ti)?

c(wi,ti)/c(ti)
 How do we compute the most probable tag sequence?

An Example Bigram
62
62
 Secretariat/NNP is/VBZ expected/VBN to/TO race/VB

tomorrow/NN People/NNS continue/VBP to/TO inquire/VB
the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
to/TO race/???
the/DT race/???
 ti = argmaxi P(ti|ti-1)P(wi|ti)
max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)]
 Brown reveals that:
P(NN|TO) = .021 × P(race|NN) = .00041
= .000007
P(VB|TO) = .34 × P(race|VB) = .00003 = .00001
Markov Model / Markov Chain
63
 A finite state machine with probabilistic state

transitions.
 Makes Markov assumption that next state only
depends on the current state and independent of
previous history.
Sample Markov Model for POS
64
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
Sample Markov Model for POS
65
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076
Sample HMM for POS
66
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
Sample HMM Generation
67
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
68
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.1
start
69
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
70
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
71
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
72
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
73
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
74
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
75
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
76
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
HMM Learning
77
 Supervised Learning: All training sequences are

completely labeled (tagged).
 Unsupervised Learning: All training sequences are
unlabelled (but generally know the number of tags,
i.e. states).
 Semisupervised Learning: Some training sequences
are labeled, most are unlabeled.
Supervised HMM Training
78
 If training sequences are labeled (tagged) with the

underlying state sequences that generated them,
then the parameters, λ={A,B} can all be estimated
directly.
Training Sequences
John ate the apple
A dog bit Mary
Mary hit the dog Supervised
John gave Mary the cat. HMM
. Training
.
.
Det Noun PropNoun Verb
PoS Tagging Example
79
79
>>> import nltk

>>> text="This is a text to test part of speech tagging in NLTK"
>>> token=nltk.word_tokenize(text)
>>> nltk.pos_tag(token)
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('text', 'NN'), ('to', 'TO'),
('test', 'NN'), ('part', 'NN'), ('of', 'IN'), ('speech', 'NN'), ('tagging',
'VBG'), ('in', 'IN'), ('NLTK', 'NNP')]
PoS Tagging Example
80
80
>>> import nltk

>>> text="zare sewochu begun yametutal"
>>> text
'zare sewochu begun yametutal'
>>> token
['zare', 'sewochu', 'begun', 'yametutal']
>>> taggedtoken=nltk.pos_tag(token)
>>> taggedtoken
[('zare', 'NN'), ('sewochu', 'VBD'), ('begun', 'VBN'), ('yametutal',
'JJ')]
PoS Tagging Example
81
81
>>> import nltk

>>> news='''
... bezare/PP elt/PP betederegew/VB mircha/VB
... bizu/ADJ sewoch/NN bemegegnet/VB
... meretu/VB
... '''
>>> tagged_news=[nltk.tag.str2tuple(t) for t in news.split()]

>>> tagged_news
[('bezare', 'PP'), ('elst', 'PP'), ('betederegew', 'VB'), ('mircha',
'VB'), ('bizu', 'ADJ'), ('sewoch', 'NN'), ('bemegegnet', 'VB'),
('meretu', 'VB')]
Tagging in NLTK (1)
82
82
NLTK provides several means of developing a tagger:

 Default Tagger: the nltk default tagger works by
assigning a default tag to all tokens.
 Regular Expression: regular expressions (RE) can be
used tag a string.
–The expression should use part of a string to guess its
part of speech.
Tagging in NLTK (2)
83
83
NLTK provides several means of developing a tagger:

 Unigram tagging :
 assigning the most probable tag
 Bigram tagging :
 assigning the most probable tag given a left-adjacent PoS
 Brill tagging :
 transformation based learning of tags
Tagging in NLTK (3)
84
84
Default Tagger:
>>> amh_tagger=nltk.RegexpTagger(amhpatt)
>>> text="zare sewocch beserut betoch endesetalen"
>>> amh_defalut_tagger=nltk.DefaultTagger('VB')
>>> amh_defalut_tagger.tag(token)
[('zare', 'VB'), ('sewocch', 'VB'), ('beserur', 'VB'), ('betoch', 'VB'),
('endesetalen', 'VB')]
Tagging in NLTK (4)
85
85
RE Tagger:
>>> amhpatt=[
... (r'.*occh$', 'NN'),
... (r'.*och$', 'NN'),
... (r'.*','VB')
... ]
>>> amh_tagger=nltk.RegexpTagger(amhpatt)
>>> text="zare sewocch beserur betoch endesetalen"

Tagging in NLTK (5)
86
86
RE Tagger:
>>> taggedtext=amh_tagger.tag(token)
>>> taggedtext
[('zare', 'VB'), ('sewocch', 'NN'), ('beserur', 'VB'), ('betoch', 'NN'),
('endesetalen', 'VB')]
Setting the Scene
87
Learning
Setting the Scene
88
N-gram model (2)
89
89
 An n-gram tagger is a generalization of a unigram

tagger whose context is the current word together with
the part-of-speech tags of the n-1 preceding tokens
 The Trigram
N-gram model (3)
90
90
 Unigram Tagger
 Finds the most frequent tag for each word in a training
corpus
 When it sees that word the tagger assigns it that tag that
is observed frequently
 E.g: in a tagged Amharic corpus
a) the word ‘sra-ስራ’ is found 25 times
b) The word ‘sra-ስራ’ is tagged as Noun 5 times
c) The word ‘sra-ስራ’ is tagged as Verb 20 times

  The most probable tag for the word ‘sra’ in
Unigram is Verb.
N-gram model (4)
91
91
Unigram Tagger
>>> brown_a = nltk.corpus.brown.tagged_sents(categories=‘news')
>>> unigram_tagger = nltk.UnigramTagger(brown_a)
>>> sent = nltk.corpus.brown.sents(categories=‘news')
>>> unigram_tagger.tag(sent)
[('Various', None), ('of', 'IN'), ('the', 'AT'), ('apartments',
'NNS'), ('are', 'BER'),
('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','),
('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'),
('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]
N-gram model (5)
92
92
 For N-gram taggers with more than one token being

considered for tagging, first we need to train it, then
use it to tag untagged sentences or words.
>>> bigram_tagger = nltk.BigramTagger(brown_a)
>>> bigram_tagger.tag(sent)
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'),
('apartments', 'NNS'), ('are', 'BER'),
('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type',
'NN'), (',', ','),
('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground',
'NN'), ('floor', 'NN'),
('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is',
'BEZ'), ('direct', 'JJ'),
('.', '.')]
N-gram model (6)
93
93
Bigram
If we see in training :

 “They are content” (are/VBP content/JJ)
 “The most important part is content” (is/VBP
content/NN)
From the data, “content” would be more likely
an adjective when it follows a verb
Thus if we see the word content after Verb we tag it as
adjective
N-gram model (7)
94
94
Unigram on local sentences

>>> brown_a=nltk.corpus.brown.tagged_sents()
>>> text='esu and gize new yemetaw'
>>> unigram_tagger=nltk.UnigramTagger(brown_a)
>>> unigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', 'CC'),
('gize', None),
('new', 'JJ'),
('yemetaw', None)]
N-gram model (8)
95
95
Bigram on local sentences

>>> brown_a=nltk.corpus.brown.tagged_sents()
>>> bigram_tagger=nltk.BigramTagger(brown_a)
>>> bigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', None),
('gize', None),
('new', None),
('yemetaw', None)]
Why do we have such result?

N-gram model (9)
96
96
Unigram Vs Bigram on local (Amharic) sentences

>>> unigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', 'CC'),
('gize', None),
('new', 'JJ'),
('yemetaw', None)]
>>> bigram_tagger.tag(nltk.word_tokenize(text))
[('esu', None),
('and', None),
('gize', None),
('new', None),
('yemetaw', None)]
N-gram model (10)
97
97
Combining Taggers
• Accuracy increases as we move from one simpler tagger
to a more complicated one.
• How can we benefit from all?? Combine
• Start with the complex……………if tag is not found for

some of the words then track back to the next simpler
tagger……………if tag is not found for some of the
words then track back to the Default tagger
HOW??
Backing Off: use a catch-all backoff tagger that sees
other options
N-gram model (11)
98
98
Combining Taggers
1) Try a BigramTagger
2) Try a UnigramTagger
3) Try the a RegexpTagger (can be added before the default)
4) Get everything else with a DefaultTagger
>>> tagger1 = nltk.DefaultTagger('NN')
>>> tagger2 = nltk.UnigramTagger(brown_a, backoff=tagger1)
>>> tagger3 = nltk.BigramTagger(brown_a, backoff=tagger2)
Test using the previous taggers…

N-gram model (12)
99
99
>>> tagger1.tag(nltk.word_tokenize(text))
[('esu', 'NN'),
('and', 'NN'),
('gize', 'NN'),
('new', 'NN'),
('yemetaw', 'NN')]
[('esu', 'NN'),
('and', 'CC'),
('gize', 'NN'),
('new', 'JJ'),
('yemetaw', 'NN')]
[('esu', 'NN'),
('and', 'CC'),
('gize', 'NN'),
('new', 'JJ'),
('yemetaw', 'NN')]
Compare this result with the independent taggings

100
100 Parsing
Parsing
101
 Parsing is the process of recognizing and assigning

STRUCTURE
 Parsing a string with a CFG:
 Findinga derivation of the string consistent with the
grammar
 The derivation gives us a Parse Tree
Parsing
102
 Phrase structure organizes words into nested

constituents.
 How do we know what is a constituent?
 Distribution: a constituent behaves as a unit that can appear
in different places:
 John talked [to the children] [about drugs].
 John talked [about drugs] [to the children].
 *John talked drugs to the children about
 Substitution/expansion:
 I sat [on the box/right on top of the box/there].
 Coordination, regular internal structure, no intrusion,
fragments, semantics, …
103 CFGs and PCFGs
(Probabilistic) Context-Free Grammars
A phrase structure grammar
104
S  NP VP N  people
VP  V NP N  fish
VP  V NP PP N  tanks
NP  NP NP N  rods
NP  NP PP V  people
NP  N V  fish
NP  e V  tanks
PP  P NP P  with
Phrase structure grammars in NLP (CFGs)
105
 G = (T, N, S, R)
T is a set of terminal symbols (words/lexicon)
 N is a set of nonterminal symbols (NP, VP, etc)
 S is the start symbol (S ∈ N)
 R is a set of rules/productions of the form X  

X ∈ N and  ∈ (N ∪ T)*
 A grammar G generates a language L.

A phrase structure grammar
106
S  NP VP N  people
VP  V NP N  fish
VP  V NP PP N  tanks
NP  NP NP N  rods
NP  NP PP V  people
NP  N V  fish
NP  e V  tanks
PP  P NP P  with
Probabilistic – or stochastic –
107
context-free grammars (PCFGs)
 G = (T, N, S, R, P)
T is a set of terminal symbols
 N is a set of nonterminal symbols
 S is the start symbol (S ∈ N)
 R is a set of rules/productions of the form X  
 P is a probability function
 P: R  [0,1]
 åg ÎT *
P(g ) = 1
 A grammar G generates a language model L.

A PCFG (Probabilistic)
108
S  NP VP 1.0 N  people 0.5

VP  V NP 0.6 N  fish 0.2
VP  V NP PP 0.4
N  tanks 0.2
NP  NP NP 0.1
N  rods 0.1
NP  NP PP 0.2
NP  N 0.7
V  people 0.1
PP  P NP 1.0 V  fish 0.6
V  tanks 0.3
P  with 1.0
A PCFG Tree
109
Parsing as Search
110
 Just as in the case of non-deterministic regular

expressions, the main problem with parsing is the
existence of CHOICE POINTS
 There is a need for a SEARCH STRATEGY
determining the order in which alternatives are
considered
Top Down vs Bottom Up Searching
111
 The search has to be guided by the INPUT and the

GRAMMAR
 TOP-DOWN search: the parse tree has to be
rooted in the start symbol S
 EXPECTATION-DRIVEN parsing
 BOTTOM-UP search: the parse tree must be an
analysis of the input
 DATA-DRIVEN parsing
Example-----Top Down
112
Example-----Bottom-Up
113
Applications of parsing
114
 Machine translation (Alshawi 1996, Wu 1997, ...)

tree
English operations Chinese
 Speech synthesis from parses (Prevost 1996)

The government plans to raise income tax.
The government plans to raise income tax the imagination.
 Speech recognition using parsing (Chelba et al 1998)

Put the file in the folder.
Put the file and the folder.
Applications of parsing
115
 Grammar checking (Microsoft)
 Indexing for information retrieval (Woods 1997)

... washing a car with a hose ... vehicle maintenance
 Information extraction (Hobbs 1996)
NY Times Database

archive
query
116 End of Topic 3

CS724 NLP Topic 3-Text Processing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS724 NLP Topic 3-Text Processing

Uploaded by

Copyright:

Available Formats

1

TOPIC 3: TEXT PROCESSING

 Text processing is a process of manipulating a written

 Every NLP task needs to do text processing:

Performing these tasks requires robust Text

 Regular Expression (RE) is a formal language for

 Regular expressions are a powerful string

 Replace parts of a string (sub)

 Break strings into smaller pieces (split)

 Most characters match themselves

 “.” matches any single character

 x* matches zero or more x’s

 “\d” matches any digit; “\D” any non-digit

 The two basic functions are re.search and re.match

 Both return None (logical false) if the pattern isn’t

 The process we just went through was based on fixing

 False negatives (Type II)

 Here’s a pattern to match simple email addresses

>>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)"

 We might want to extract the pattern parts, like the

 We can put parentheses around groups we want to

 We can ‘label’ the groups as well…

 re.split() is like split but can use patterns

 If you plan to use a re pattern more than once, compile it to a re object

>>> capt3 = re.compile(pat3)

 In NLP we are always dealing with these kinds of

 !, ? are relatively unambiguous

 Numbers like .02% or 4.3

 Build a binary classifier

 Classifiers: hand-written rules, regular expressions, or

 A decision tree is just an if-then-else statement

 Type: an element of the vocabulary.

 Chinese and Japanese no spaces between words:

 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达

 Important Questions to be answered..

 Need to “normalize” terms

 We implicitly define equivalence classes of terms

 Potentially more powerful, but less efficient

 Applications like IR: reduce all letters to lower

 For sentiment analysis, MT, Information

 Reduce terms to their stems in information retrieval

 The goal of both stemming and lemmatization is to

 Words can be viewed as consisting of:

 Morphological Analysis in its general form involves

 A simple rule-based algorithm for stemming

 With this definition, all words are of the form:

 The rules are of the form:

*S The stem ends with S

*v* The stem contains a vowel

 (m>1) EED -> EE

 (*V*) ING -> Ø

 (These rules are ran if second or third rule in 2a apply)

 Step 3: Y Elimination (*V*) Y -> I

 Step 4: Derivational Morphology, I

 Step 5: Derivational Morphology, II

 elephants -> eleph

 doing - > doe

 Ethiopian languages are tough for stemming.

 The challenge is mainly on:

>>> from nltk.stem import RegexpStemmer

NB: this will remove ing from a word where-ever it is found

What comes first for Local Languages with non-Latin scripts?

 Words can be divided into classes that behave similarly.

 PoS Tagging is the process of annotating each

 莎拉波娃现在居住在美国东南部的佛罗里达

v The stem contains a vowel

 (V) ING -> Ø

 Step 3: Y Elimination (V) Y -> I