Professional Documents
Culture Documents
Ayu Purwarianti
I.
INTRODUCTION
II.
METHODS
A. Underlying Model
Both first order (bigram) and second order (trigram) Hidden
Markov Model are used for developing the tagger system.
Hidden states of the model represent tags and observation
1
We provide API (Application Programming Interface) of our tagger. This API is available at
http://students.itb.ac.id/home/alfan_fw@students.itb.ac.id/IPOSTAgger or http://www.informatika.org/~ayu/IPOSTAgger
P (t i | t i 1 ) P ( wi | t i )
i=2
.................. (1)
i =1
P (ti | t i 1 , ti 2 ) P ( wi | t i )
i =3
..... (2)
i =1
1 + 2 + 3 = 1
..... (3)
P is probability
G ( aS ) = F ( aS ) ( I ( S ) I ( aS )) ............ (6)
..... (4)
2.
3.
tag
VBT
VBI
NN
total
Prefix m
75
20
7
102
Prefix me
75
19
5
99
Prefix mi
0
1
2
3
I ( m) =
75
75 20
20
log 2
log 2
... 1.0523
102
102 102
102
(7)
The corresponding values for the child nodes are 0.97 for me
and 0.91 for mi. Then, we determine the weighted information
gain at each of child nodes such as in equation (8) and
equation (9).
Prefix Tree
2.
first pass
In the first pass, POS tags of the input words are
predicted ordinarily without using the information of
succeeding POS tag. This process is done using
equation (1) or (2). Affix tree is used to predict the
emission probability vector of OOV words.
second pass
In the second pass, the tagging process is repeated from
the first input word. This time, the tagger uses
information from succeeding POS tag if current word is
OOV in decoding process. The tagger does not use
succeeding POS tag if the current word is known. The
POS tag of succeeding word (succeeding POS tag) is
obtained from the first pass.
P(t i | t i 1 , t i +1 )
(
P (t i | t i 1 , t i 2 )
with
P (t i | t i 1 , t i 2 , t i +1 ) ). ti +1 is obtained from the first pass. It
means that the transition probability is not only conditioned on
being preceding POS tag but also succeeding POS tag. The
tagger uses maximum likehood estimation to determine the
value of P (t i | t i 1 , t i +1 ) or P (t i | t i 1 , t i 2 , t i +1 ) .
D. Lexicon from KBBI-Kateglo
We use lexicon from KBBI - Kateglo [9, 10] for reducing
entry in emission probability vector produced by affix tree.
Different POS tag set between corpus and KBBI Kateglo is
the main problem. POS tag set in tagged corpus is the
specialized form of POS tag set in KBBI and Kateglo.
Therefore, we make a table called category table that maps
POS in KBBI - Kateglo into POS in corpus. Table 2 shows the
example of category table for some POS tags in KBBI
Kateglo.
TABLE 2. SAMPLE OF CATEGORY TABLE
NO
POS tag
KBBI
NN
(specialized)
NN NNP NNG
CD
PR
III.
EXPERIMENTS
A. Experimental Data
Experiments were performed using small hand tagged
Indonesian corpus which consists of 35 POS tags (table 3). The
tagset is modification from tagset used in [23, 7]. The corpus
was developed as the correction and modification of PAN
Localization corpus3 for Bahasa Indonesia [23, 15]. Our small
tagged corpus was divided into 2 sub corpora, one for training
(~12000 words) and one for testing (~3000 words). We did
experiments in 3 different testing corpus (each corpus contains
~3000 words). First testing corpus contains 30% unknown
words. Second testing corpus contains 21% unknown words.
The last testing corpus contains 15% unknown words.
NO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
POS
OP
CP
GM
;
:
.
,
...
JJ
RB
NN
NNP
NNG
VBI
VBT
IN
MD
CC
SC
DT
UH
CDO
CDC
CDP
CDI
PRP
WP
PRN
PRL
NEG
SYM
RP
FW
POS Name
Open Parenthesis
Close Parenthesis
Slash
Semicolon
Colon
Quotation
Sentence Terminator
Comma
Dash
Ellipsis
Adjective
Adverb
Common Noun
Proper Noun
Genitive Noun
Intransitive Verb
Transitive Verb
Preposition
Modal
Coor-Conjunction
Subor-Conjunction
Determiner
Interjection
Ordinal Numerals
Collective Numerals
Primary Numerals
Irregular Numerals
Personal Pronouns
WH-Pronouns
Number Pronouns
Locative Pronouns
Negation
Symbols
Particles
Foreign Words
Example
({[
)}]
/
;
:
.!?
,
...
Kaya, Manis
Sementara, Nanti
Mobil
Bekasi, Indonesia
Bukunya
Pergi
Membeli
Di, Ke, Dari
Bisa
Dan, Atau, Tetapi
Jika, Ketika
Para, Ini, Itu
Wah, Aduh, Oi
Pertama, Kedua
Bertiga
Satu, Dua
Beberapa
Saya, Kamu
Apa, Siapa
Kedua-duanya
Sini, Situ, Sana
Bukan, Tidak
@#$%^&
Pun, Kah
Foreign, Word
B. Experimental Result
Table 4,5,6 show the performance of the POS tagger in 3
different types of testing corpus. The performance is showed
Overall Accuracy
in format
.
( Known Word Acc / Unknown Word Acc )
NO
Configuration
PREFIX
Baseline
Bigram
Trigram
Bigram+succeeding POS
Trigram+succeeding POS
Bigram+Lexicon
Trigram+Lexicon
Bigram+succ+Lexicon
Trigram+succ+Lexicon
95.67%
(99.43%/75.00%)
95.29%
(99.18%/73.87%)
95.57%
(99.43%/74.32%)
94.94%
(99.02%/72.52%)
96.30%
(99.43%/79.05%)
95.98%
(99.18%/78.38%)
96.36%
(99.43%/79.50%)
95.78%
(99.02%/77.93%)
95.36%
(99.43%/72.97%)
95.01%
(99.22%/71.85%)
95.36%
(99.43%/72.97%)
94.70%
(99.02%/70.95%)
96.23%
(99.43%/78.60%)
95.95%
(99.26%/77.70%)
96.50%
(99.43%/80.41%)
95.91%
(99.06%/78.60%)
NO
Configuration
PREFIX
1
Baseline
Bigram
Trigram
Bigram+succeeding POS
Trigram+succeeding POS
Bigram+Lexicon
Trigram+Lexicon
Bigram+succ+Lexicon
Trigram+succ+Lexicon
93.27%
(99.18%/71.64%)
93.12%
(99.29%/70.56%)
93.41%
(99.22%/72.18%)
93.04%
(99.18%/70.56%)
93.82%
(99.18%/74.22%)
93.59%
(99.26%/72.86%)
93.97%
(99.18%/74.90%)
93.59%
(99.15%/73.27%)
92.40%
(99.15%/67.71%)
92.31%
(99.07%/67.57%)
92.54%
(99.15%/68.39%)
92.22%
(99.15%/66.89%)
93.27%
(99.18%/71.64%)
93.33%
(99.18%/71.91%)
93.56%
(99.18%/73.00%)
93.24%
(99.11%/71.78%)
93.01%
(99.18%/70.42%)
93.12%
(99.22%/70.83%)
93.36%
(99.18%/72.05%)
93.07%
(99.18%/70.69%)
93.97%
(99.18%/74.90%)
94.00%
(99.22%/74.90%)
94.46%
(99.18%/77.20%)
94.06%
(99.07%/75.71%)
NO
Configuration
PREFIX
1
Baseline
Bigram
Trigram
Bigram+succeeding POS
Trigram+succeeding POS
Bigram+Lexicon
Trigram+Lexicon
Bigram+succ+Lexicon
Trigram+succ+Lexicon
89.32%
(99.05%/67.01%)
89.32%
(99.01%/67.11%)
88.86%
(99.01%/65.60%)
88.66%
(98.89%/65.22%)
90.32%
(99.01%/70.41%)
90.32%
(99.05%/70.31%)
90.04%
(98.93%/69.65%)
90.15%
(98.85%/70.22%)
88.98%
(99.05%/65.88%)
89.06%
(98.97%/66.35%)
88.35%
(99.05%/63.81%)
88.09%
(98.85%/63.43%)
90.15%
(99.05%/69.75%)
90.12%
(99.01%/69.75%)
89.81%
(99.05%/68.61%)
89.87%
(98.97%/68.99%)
89.84%
(99.10%/68.61%)
89.78%
(99.01%/68.61%)
89.55%
(99.10%/67.67%)
88.92%
(98.97%/65.88%)
91.18%
(99.01%/73.23%)
91.30%
(99.05%/73.52%)
91.01%
(99.01%/72.67%)
90.78%
(98.93%/72.10%)
[4]
[5]
[6]
IV.
CONCLUSIONS
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
ACKNOWLEDGMENT
[19]
[20]
REFERENCES
[18]
[21]
[22]
[1]
[2]
[3]
[23]
[24]