Professional Documents
Culture Documents
Otakar Smrz
Autumn 2005
Yet Another Introduction to Arabic NLP
Arabic Natural Language Processing
Otakar Smrz
Autumn 2005
This is the series of lecture notes to the course on Arabic Natu-
ral Language Processing taught in the Winter Term of 2005 at the
Department of Middle Eastern Studies, Faculty of Philosophy, Uni-
versity of West Bohemia in Pilsen.
We will not do much of the crucial theories behind all that theory
of computation, logic, theoretical linguistics. We will explore how
dealing with the natural language, esp. Arabic, is implemented today
and what other solutions we can expect and contribute to.
Lecture Topics
Otakar Smrz
October 6, 2005
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Where to Begin . . .
Where to Begin . . .
Where to Begin . . .
Where to Begin . . .
Bits of Information
.
Unicode
http://www.unicode.org/
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Unicode Arabic
The identifiers that are assigned to characters are called code points.
Arabic Ux0600Ux06FF
Arabic Supplement Ux0750Ux077F
Arabic Presentation Forms-A UxFB50UxFDFF
Arabic Presentation Forms-B UxFE70UxFEFF
Encodings of Unicode
Encodings of Unicode
Buckwalter Transliteration
Z d d X d. D
k k
b b H d * X
.t T l l
.
t t
H r r P z. Z
m m
t v
H z z P
E n n
g j h
.
s s g g h h
h. H h s $
f f w w
h x p s. S q q y y
a A p P H
v V W
O c J h
g G }
I
z R
P h p a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Buckwalter Transliteration
Z d d X d. D
k k
b b H d * X
.t T l l
.
t t
H r r P z. Z
m m
t v
H z z P
E n n
g j h
.
s s g g h h
h. H h s $
f f w w
h x p s. S q q y y
a A p P H
v V W
O c J h
g G }
I
z R
P h p a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Buckwalter Transliteration
Z d d X d. D
k k
b b H d * X
.t T l l
.
t t
H r r P z. Z
m m
t v
H z z P
E n n
g j h
.
s s g g h h
h. H h s $
f f w w
h x p s. S q q y y
a A p P H
v V W
O c J h
g G }
I
z R
P h p a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Buckwalter Transliteration
Z d d X d. D
k k
b b H d * X
.t T l l
.
t t
H r r P z. Z
m m
t v
H z z P
E n n
g j h
.
s s g g h h
h. H h s $
f f w w
h x p s. S q q y y
a A p P H
v V W
O c J h
g G }
I
z R
P h p a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Buckwalter Transliteration
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK A K DK AK D
. . .
Buckwalter Transliteration
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK AK DK AK D
. . .
Buckwalter Transliteration
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK A K DK AK D
. . .
. Z Ag B h QK AK DK AK D
. . .
Notation of ArabTEX
Z d d X d. .d
k k
b b H d _d X
.t .t l l
.
t t
H r r P z. .z
m m
t _t
H z z P
n n
g ^g h
.
s s g .g h h
h. .h h s ^s
f f w w
h _h p s. .s q q y y
a A p p H
v v
c ^c h
g g
z ^z
P h T a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Notation of ArabTEX
Z d d X d. .d
k k
b b H d _d X
.t .t l l
.
t t
H r r P z. .z
m m
t _t
H z z P
n n
g ^g h
.
s s g .g h h
h. .h h s ^s
f f w w
h _h p s. .s q q y y
a A p p H
v v
c ^c h
g g
z ^z
P h T a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Notation of ArabTEX
| Z d d X d. .d
k k
b b H d _d X
.t .t l l
.
t t
H r r P z. .z
m m
t _t
H z z P
n n
g ^g h
.
s s g .g h h
h. .h h s ^s
f f w w
h _h p s. .s q q y y
a A p p H
v v w
a c ^c h
g g y
i
z ^z
P h T a Y
Introduction Unicode and UTF-8 Buckwalter Transliteration ArabTEX Notation Resources
Notation of ArabTEX
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK A K DK AK D
. . .
Notation of ArabTEX
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK AK DK AK D
. . .
Notation of ArabTEX
Notation of ArabTEX
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK A K DK AK D
. . .
Q g
C J Y . m ' Q K A P Qk AJ J YK
.
.
. Z Ag B h QK AK DK AK D
. . .
Other Resources
Buckwalter Transliteration
http://www.qamus.org/transliteration.htm
Otakar Smrz
Preliminaries
Expressing Meaning
Writing in LATEX
Note the notation for the markup, esp. \ { }, and the pairing and
possible nesting of \begin{...} and \end{...}.
\begin{document}
\end{document}
Outline Simple Documents Flashcards
Including ArabTEX
\usepackage{arabtex}
\begin{document}
\end{document}
Outline Simple Documents Flashcards
Flashcards!
\begin{flashcard}{official measures}
\RL{i^grAAtuN rasmIyaTuN} \\ \arabfalse\transtrue
\RL{i^grAAtuN rasmIyaTuN}
\end{flashcard}
Funny Morphology and MorphoTrees
Arabic Natural Language Processing
Otakar Smrz
Morphology Disambiguation
Morphology Disambiguation
Functional Approximation
The underlying morphological engine for both the Penn Arabic Tree-
bank and the Prague Arabic Dependency Treebank is the Buckwalter
Arabic Morphological Analyzer. While PATB adopts the analyses in
their original format, the PADT annotations take place on approx-
imations of Functional Arabic Morphology, a novel theoretical and
computational model, organized into MorphoTrees.
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
Functional Approximation
The underlying morphological engine for both the Penn Arabic Tree-
bank and the Prague Arabic Dependency Treebank is the Buckwalter
Arabic Morphological Analyzer. While PATB adopts the analyses in
their original format, the PADT annotations take place on approx-
imations of Functional Arabic Morphology, a novel theoretical and
computational model, organized into MorphoTrees.
With respect to the linguistic view and the architecture of the soft-
ware that we develop, we unify the format of the morphological data
by converting all the Parts of PATB into the approximation, which
is done in two steps: (a) the morphs of the original input strings are
re-grouped to form tokens (b) the corresponding sequences of tags
are mapped into the fixed-width positional notation of PADT.
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
He will notify them about that through SMS messages, the Internet, and
other means. . A
Q
IK
K B
Q
K AQ K Q
YK
.
Q
jJ
.
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
He will notify them about that through SMS messages, the Internet, and
other means. . A
Q
IK
K B
Q
K AQ K Q
YK
.
Q
jJ
.
String Token Token Tag Buckwalters M-Tags Token Form Token Gloss
F--------- FUT sa- will
Q
jJ
.
K Q
A-----FS2D +CASE_DEF_GEN
al-qas.r-at-i the-short
Q
C--------- CONJ wa- and
IK K B
DET+NOUN_PROP+
Z-------2D +CASE_DEF_GEN
al-internet-i the-internet
C--------- CONJ wa- and
A
Q
He will notify them about that through SMS messages, the Internet, and
other means. . A
Q
IK
K B
Q
K AQ K Q
YK
.
Q
jJ
.
String Token Token Tag Buckwalters M-Tags Token Form Token Gloss
F--------- FUT sa- will
Q
jJ
.
K Q
A-----FS2D +CASE_DEF_GEN
al-qas.r-at-i the-short
Q
C--------- CONJ wa- and
IK K B
DET+NOUN_PROP+
Z-------2D +CASE_DEF_GEN
al-internet-i the-internet
C--------- CONJ wa- and
A
Q
He will notify them about that through SMS messages, the Internet, and
other means. . A
Q
IK
K B
Q
K AQ K Q
YK
.
Q
jJ
.
String Token Token Tag Buckwalters M-Tags Token Form Token Gloss
F--------- FUT sa- will
Q
jJ
.
K Q
A-----FS2D +CASE_DEF_GEN
al-qas.r-at-i the-short
Q
C--------- CONJ wa- and
IK K B
DET+NOUN_PROP+
Z-------2D +CASE_DEF_GEN
al-internet-i the-internet
C--------- CONJ wa- and
A
Q
AlY
ala
aly al '
ila
ila
ya
waliya
u u u u u u u u u u
u
u
u
u
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
. . . organize them into a hierarchy with the string as its root and
the full tokens as the leaves
AlY
ala
aly al '
ila
ila
ya
waliya
u u u u u u u u u u
u
u
u
u
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
. . . organize them into a hierarchy with the string as its root and
the full tokens as the leaves, grouped by their lemmas
AlY
ala
aly al '
ila
ila
ya
waliya
u u u u u u u u u u
u
u
u
u
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
. . . organize them into a hierarchy with the string as its root and
the full tokens as the leaves, grouped by their lemmas, canonical
forms
AlY
ala
aly al '
ila
ila
ya
waliya
u u u u u u u u u u
u
u
u
u
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
. . . organize them into a hierarchy with the string as its root and
the full tokens as the leaves, grouped by their lemmas, canonical
forms and partitionings of the string into such forms:
AlY
ala
aly al '
ila
ila
ya
waliya
u u u u u u u u u u
u
u
u
u
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
List of the morphological analyses of the input string AlY with the
V
String Tag A
VISA-1-S--------------------------------
N
P
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
V
String Tag A
N--------RS----1-S2---------------------
N
P
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
VP-A-3MS--------------------------------
Ambiguity Classes A-I- ---1RS 1 S2
NIS 1 2I
P 4
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
V -
Ambiguity Classes AP- 3 1-
N-IA -MS2R- - --
PIS--1--4IS----1-S2---------------------
Morphology Issues MorphoTrees vs. Lists Using Restrictions Tokenization Models
C---------
fhm
f hm fhm
----------
f hm fhm
--------1-
fa and, so
C---------
VP---3MS--
N-------1R
N-------4R
N-------2R
N-------1I
N-------2I
S----3MP1-
hum they
fahima to understand
fa-
hamm-a
hamm-u
hamm-a
hamm-i
hamm-un
hamm-in
hum
fahm understanding
C---------
fhm
f hm fhm
----------
f hm fhm
--------1-
fa and, so
C---------
VP---3MS--
N-------1R
N-------4R
N-------2R
N-------1I
N-------2I
S----3MP1-
hum they
fahima to understand
fa-
hamm-a
hamm-u
hamm-a
hamm-i
hamm-un
hamm-in
hum
fahm understanding
AlY
lY
0 A 1 l 2 Y 3
AlY
Al
AlY Al Y AlY
AlY
lY
0 A 1 l 2 Y 3
AlY
Al
AlY Al Y AlY
AlY
lY
0 A 1 l 2 Y 3
AlY
Al
AlY Al Y AlY
AlY
lY
0 A 1 l 2 Y 3
AlY
Al
AlY Al Y AlY