Professional Documents
Culture Documents
1. Lexical categories
2. Python data structure for storing words
3. Automatically tag each words
Introduction
Words can be grouped into classes, such as
nouns, verbs, adjectives, and adverbs.
classes are known as lexical categories or
parts-of-speech
Parts-of-speech are assigned short labels, or
tags, such as NN and VB.
Tagging
Process of automatically assigning parts-of-speech to words in
text after tokenization
Examples of parts-of-speech : nouns, verbs, adjectives, and
adverbs
Tagging methods: default tagger, regular expression tagger,
unigram tagger, and n-gram taggers
Application: Sequence classification task
1. Lexical Categories
Tagset
Tagset
Default Tagging
Use part of speech (POS) tagger
Step 1: Import nltk
>>> import nltk
Output
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]
CC=coordinating conjunction, RB= adverbs, IN=preposition, NN= noun, JJ=adjective
3. Automatic tagging
Automatic tagging
automatically add part-of-speech tags to text
tag of a word depends on the word and its
context within a sentence
Example 1
Step1: Get tag which appear most in corpora
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN
Output
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received',
'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]