You are on page 1of 7

Design of a Lexical Database for Sanskrit

Gérard Huet
INRIA-Rocquencourt
BP 105, 78153 Le Chesnay CEDEX
France
Gerard.Huet@inria.fr

Abstract lar to the classical “lexical saturation” effect


We present the architectural design rationale witnessed when one builds a lexicon covering
of a Sanskrit computational linguistics plat- a given corpus (Polguère, 2003). Progressively,
form, where the lexical database has a cen- a number of other consistency constraints were
tral role. We explain the structuring require- identified and systematically enforced, which
ments issued from the interlinking of grammat- proved invaluable in the long run.
ical tools through its hypertext rendition. At this point the source of the lexicon was
a plain ASCII file in the LaTeX format. How-
1 Introduction ever, a strict policy of systematic use of macros,
not only for the structure of the grammatical
Electronic dictionaries come into two distinct
information, and for the polysemy representa-
flavours: digital sources of dictionaries and en-
tion, but also for internal quotations, ensured
cyclopedia, meant for human usage, and lexical
that the document had a strict logical struc-
databases, developed for computational linguis-
ture, mechanically retrievable (and thus consid-
tics needs. There is little interaction between
erably easier to process without loss of informa-
the two forms, mostly for sociological reasons.
tion than an optically scanned paper dictionary
We shall argue, in this communication, that a (Ma et al., 2003)).
lexical database may be used for both purposes,
Indeed, around 2000, when the author got
to mutual advantage. We base our thesis on a
interested into adapting the data as a lexical
concrete experiment on the design of linguistics
database for linguistic processing, he was able
resources for the Sanskrit language.
without too much trouble to reverse engineer
2 From book form to web site the dictionary into a structured lexical database
(Huet, 2000; Huet, 2001). He then set to work
2.1 A Sanskrit-French paper dictionary to design a set of tools for further computer
The author started from scratch a Sanskrit to processing as an experiment in the use of func-
French dictionary in 1994, first as a personal tional programming for a computational linguis-
project in indology, then as a more structured tics platform.
attempt at covering Sanskrit elementary vocab- The first design decision was to avoid stan-
ulary. A systematic policy was inforced along a dard databases, for several reasons. The first
number of successive invariants. For instance, one is portability. Many database formats are
etymology, when known, was followed recur- proprietary or specific to a particular product.
sively through relevant entries. Any word could The second reason is that the functionalities of
then be broken into morphological constituents, data base systems, such as query languages, are
down to verbal roots when known. This “et- not well adapted to the management of lexi-
ymological” completeness requirement was at cal information, which is highly structured in a
first rather tedious, since entering a new word deep manner - in a nutshell, functional rather
may require the acquisition of many ancestors, than predicative. Thirdly, it seemed best to
due to complex compounding. But it appeared keep the information in the concrete format in
that the acquisition of new roots slowed down which it had been developed so far, with spe-
considerably after an initial “bootstrap” phase. cific text editing tools, and various levels of an-
When the number of entries approached 10000, notation which could remain with the status
with 520 roots, new roots acquisition became of unanalysed comments, pending their possible
quite exceptional. This phenemenon is simi- later structuring. After all, ASCII is the most
portable format, large text files is not an issue only the Sanskrit lexical analyser may construct
anymore, parsing technology is fast enough to values of type skt).
make compilation times negligible, and the hu- A typical process is the printing process
man ease of editing is the really crucial factor – Print_dict, itself a functor. Here is its inter-
any tool which the lexicographer has to fight to face:
organise his data is counter-productive.
A detailed description of this abstract syntax module Print_dict : functor
is available as a research report (Huet, 2000), (Printer:Print.Printer_signature)
and will not be repeated here. We shall just -> Proc.Process_signature;
point to salient points of this abstract structure
when needed. It takes as argument a Printer module, which
specifies low-level printing primitives for a given
2.2 Grinding the abstract structure medium, and defines the printing of entries as a
The main tool used to extract information from generic recursion over its abstract syntax. Thus
this data-base is called the grinder (named after we may define the typesetting primitives to gen-
the corresponding core processor in the Word- erate a TEX source in module Print_tex, and
Net effort (Miller, 1990; Fellbaum, 1998)). The obtain a TEX processor by a simple instancia-
grinder is a parsing process, which recognizes tion:
successive items in the data base, represents
module Process_tex = Print_dict Print_tex;
them as an abstract syntax value, and for each
such item calls a process given as argument to Similarly, we may define the primitives to gener-
the grinder. In other words, grind is a para- ate the HTML sources of a Web site, and obtain
metric module, or functor in ML terminology. an HTML processor Process_html as:
Here is exactly the module type grind.mli in
the syntax of our pidgin ML: module Process_html = Print_dict Print_html;
module Grind : functor It is very satisfying indeed to have such shar-
(Process : Proc.Process_signature) ing in the tools that build two ultimately very
-> sig end; different objects, a book with professional typo-
graphical quality on one hand, and a Web site
with interface module Proc specifying the ex- fit for hypertext navigation and search, as well
pected signature of a Process: as grammatical query, on the other hand1 .
module type Process_signature = sig 2.3 Structure of entries
value process_header : Entries are of two kinds: cross-references and
(Sanskrit.skt * Sanskrit.skt) -> unit; proper lexemes. Cross references are used to list
value process_entry : alternative spellings of words and some irregular
Dictionary.entry -> unit; but commonly occurring flexed forms (typically
value prelude : unit -> unit; pronoun declensions). Lexeme entries consist of
value postlude : unit -> unit; three components : syntax, usage, and an op-
end; tional list of cognate etymologies in other indo-
european languages.
That is, there are two sorts of items in The syntax component consists itself of three
the data base, namely headers and entries. sub-components: a heading, a list of variants,
The grinder will start by calling the process and an optional etymology. The heading spells
prelude, will process every header with rou- the main stem (in our case, the so-called weak
tine process_header and every entry with rou- stem), together with a hierarchical level. At
tine process_entry, and will conclude by call- the top of the hierarchy, we find root verbs,
ing the process postlude. Module interface non-compound nouns, suffixes, and occasional
Dictionary describes the dictionary data struc- declined forms which do not reduce to just a
tures used for representing entries (i.e. its ab- cross reference, but carry some usage informa-
stract syntax as a set of ML datatypes), whereas tion. Then we have subwords, and subsub-
module Sanskrit holds the private represen- words, which may be derived forms obtained
tation structures of Sanskrit words (seen from
1
Dictionary as an abstract type, insuring that http://pauillac.inria.fr/~huet/SKT/
by prefixing or suffixing their parent stem, or We remark that French is solely used as a se-
compound nouns. mantic formalism, deep at the leaves of our en-
Other subordinate entries are idiomatic locu- tries. Thus there is a clear separation between
tions and citations. Thus we have a total of a superstructure of the lexical database, which
ten sorts of entries, classified into three hierar- only depends on a generic dictionary structure
chical levels (to give a comparison, the much and of the specific structure of the Sanskrit lan-
more exhaustive Monier-Williams Sanskrit-to- guage, and terminal semantic values, which in
English dictionary has 4 hierarchical levels). our case point to French sentences, but could as
Let us now explain the structure of the us- well point to some other language within a mul-
age component of our entries. We have actu- tilingual context, or a WordNet-like (Fellbaum,
ally three kinds of such usage structure, one 1998) pivot structure.
corresponding to nouns (substantives and adjec- The strings denoting Sanskrit references are
tives), another one corresponding to verbs, and traeted in a special way, since they determine
still another one for idiomatic locutions. We the hypertext links in the HTML version of
shall now describe the substantives usage com- the dictionary. There are two kinds of possi-
ponent, the verbs one being not very different ble references, proper nouns starting with an
in spirit, and the idioms one being a mere sim- upper case letter, and common nouns or other
plification of it. Sanskrit words. For both categories, we dis-
tinguish binding occurrences, which construct
The usage structure of a substantive entry
HTML anchors, and used occurrences, which
is a list of meanings, where a meaning con-
construct the corresponding references. In or-
sists of a grammatical role and a sense compo-
der to discuss more precisely these notions, we
nent. A role is itself the notation for a part-of-
need to consider the general notion of scoping.
speech tag and an optional positional indication
But before discussing this important notion, we
(such as ‘enclitic’ for postfix particles, or ‘iic’
need a little digression about homonymy.
[in initio composi] for prefix components). The
part-of-speech tag is typically a gender (mean- 2.4 Homonyms
ing substantive or adjective of this gender), or First of all, there is no distinction in Sanskrit
a pronominal or numeral classifier, or an unde- between homophons and homographs, since the
clinable adverbial role, sometimes correspond- written form reflects phonetics exactly. As
ing to a certain declension of the entry. The in any language however, there are homonyms
thematic role ‘agent’ is also available as tag, which are words of different origin and unrelated
typically for nouns which may be used in the meanings, but which happen to have the same
masculine or the feminine, but not in the neuter. representation as a string of phonemes (voca-
This results in a fairly flexible concrete syntax ble). They may or may not have the same gram-
at the disposal of the lexicographer, put into a matical role. For such clearly unrelated words,
rigid but rigorous structure for computational we use the traditional solution of distinguish-
use by the data base processors. ing the two entries by numbering them, in our
The sense component is itself a list of elemen- case with a subscript index. Thus we distin-
tary semantic items (representing polysemy), guish entry aja1 ‘he goat’, derived from root aj
each possibly decorated by a list of spelling vari- ‘to lead’, from entry aja2 ‘unborn’, derived by
ants. Elementary semantic items consist in their privative prefix a from root jan ‘to be born’.
turn of an explanation labeled with a classi- Actually, primary derived words, such as sub-
fier. The classifier is either ‘Sem’, in which case stantival root forms, are distinguished from the
the explanation is to be interpreted as a sub- root itself, mostly for convenience reasons (the
stitutive definition, or else it is a field label in usage structure of verbs being superficially dif-
some encyclopedic classification, such as ‘Myth’ ferent from the one of substantives). Thus the
for mythological entries, ‘Phil’ for philosophi- root diś1 ‘to show’ is distinguished from the sub-
cal entries, etc., in which case the explanation stantive diś2 ‘direction’, and root jñā1 ‘to know’
is merely a gloss in natural language. In ev- is distinct from the feminine substantive jñā2
ery case the explanation component has type ‘knowledge’.
sentence, meaning in our case French sentence,
since it applies to a Sanskrit-to-French bilingual 2.5 Scoping
dictionary, but here it is worth giving a few ad- There are two notions of scoping, one global,
ditional comments. and the other one local. First, every refer-
ence ought to point to some binding occurrence, the grammatical engine. This engine allows the
somewhere in the data base, so that a click on computation of inflected forms, that is declen-
any used occurrence in the hypertext document sions of nouns and finite conjugated forms of
ought to result in a successful context switch- verbs. For nouns, we observe that in Sanskrit,
ing to the appropriate link. Ideally this link declension paradigms are determined by a suffix
ought to be made to a unique binding occur- of the stem and its grammatical gender. Since
rence. Such binding occurrences may be explicit we just indicated that all defined occurrences
in the document; typically, for proper nouns, of substantive stems occurring in the dictionary
this corresponds to a specific semantic item, were in the scope of a gender declaration, this
which explains their denotation as the name of means that we can compute all inflected forms
some human or mythological figure or geograph- of the words in the lexicon by iterating a gram-
ical entity. For common nouns, the binding oc- matical engine which knows how to decline a
currence is usually implicit in the structure of stem, given its gender.
the dictionary, corresponding either to the main Similary, for verbs, conjugation paradigms for
stem of the entry, or to some auxiliary stem or the present system fall into 10 classes (and the
flexed form listed as an orthographic variant. In aorist system has 7 classes). Every root entry
this sense a binding occurrence has as scope the mentions explicitly its (possibly many) present
full dictionary, since it may be referred to from and aorist classes.
anywhere. In another sense it is itself within
the scope of a specific entry, the one in which 3.1 Sandhi
it appears as a stem or flexed form or proper Given a stem and its gender, standard grammar
name definition, and this entry is itself physi- paradigm tables give for each number and case
cally represented within one HTML document, a suffix. Glueing the suffix to the stem is com-
to be loaded and indexed when the reference puted by a phonetic euphony process known as
is activated. In order to determine these, the sandhi (meaning ‘junction’ in Sanskrit). Actu-
grinder builds a lexical tree (trie) of all binding ally there are two sandhi processes. One, called
occurrences in the data base, kept in permanent external sandhi, is a regular homomorphism op-
storage. A cross-reference analysis checks that erating on the two strings representing two con-
each used occurrence is bound somewhere. tiguous words in the stream of speech. The end
Actually, things are still a bit more elabo- of the first string is modified according to the
rate, since each stem is not only bound lexico- beginning of the second one, by a local euphony
graphically in some precise entry of the lexicon, process. Since Sanskrit takes phonetics seri-
but it is within the scope of some grammati- ously, this euphony occurs not just orally, but in
cal role which determines uniquely its declen- writing as well. This external sandhi is relevant
sion paradigm. Let us explain this by way of a to contiguous words, and compound formation.
representative example. Consider the following
A more complex transformation, called inter-
typical entry:
nal sandhi, occurs for words derived by affixes
kmAr kumāra m. garçon, jeune homme; fils
and thus in particular for inflected forms in de-
| prince; page; cavalier | myth. np. de Kumāra clension and conjugation. The two composed
‘Prince’, épith. de Skanda — n. or pur — f. strings influence each other in a complex process
kumārı̄ adolescente, jeune fille, vierge. which may influence non-local phonemes. Thus
There are actually four binding occurrences in prefixing ni (down) to root sad (to sit) makes
this entry. The stem kumāra is bound initially verb nis.ad (to sit down) by retroflexion of s af-
with masculine gender for the meaning ‘boy’, ter i, and further suffixing it with na for forming
and rebound with neuter gender for the mean- its past participle makes nis.an.n.a (seated) by as-
ing ‘gold’. The stem kumārı̄ is bound with fem- similation of d with n and further retroflexion
inine gender for the meaning ‘girl’. Finally the of both occurrences of n.
proper name Kumāra is bound in the mytho- While this process remains deterministic (ex-
logical sememe, the text of which contains an cept for occasional cases where some pho-
explicit reference to proper name Skanda. netic rules are optional), and thus is easily
programmable for the synthesis of inflected
3 The grammatical engine forms, the analysis of such derivations is non-
We are now ready to understand the second deterministic in a more complex way than the
stage in our Sanskrit linguistic tools, namely simple external sandhi, since it involves a com-
plex cascading of rewrites. propriate entry (from which the etymological
link provides the return pointer). Other de-
3.2 Declensions rived stems (causative, intensive and desider-
Using internal sandhi, systematic declension ta- ative forms) act also as morphology generators.
bles drive the declension engine. Here too 3.3 Inflected forms management
the task is not trivial, given the large num- One special pass of the grinder generates the
ber of cases and exceptions. At present our trie of all declensions of the stems appearing in
nominal grammatical engine, deemed sufficient the dictionary. This trie may be itself pretty-
for the corpus of classical Sanskrit (that is, printed as a document describing all such in-
not attempting the treatment of complex vedic flected forms. At present this represents about
forms), operates with no less than 86 tables 2000 pages of double-column fine print, for a
(each describing 24 combinations of 8 cases and total of around 200 000 forms of 8200 stems
3 numbers). This engine may generate all de- (133655 noun forms and 55568 root finite verbal
clensions of substantives, adjectives, pronouns forms).
and numerals. It is to be remarked that this
grammatical engine, available as a stand-alone 3.4 Index management
executable, is to a large extent independent of Another CGI auxiliary process is the index. It
the lexicon, and thus may be used to give the searches for a given string (in transliterated no-
declension of words belonging to a larger cor- tation), first in the trie of defined stems, and
pus. However, the only deemed correctness is if not found in the trie of all declined forms.
that the words actually appearing in the lexicon It then proposes a dictionary entry, either the
get their correct declension patterns, including found stem (the closest stem the given string is
exceptions. an initial prefix of) or the stem (or stems) whose
This grammatical engine is accessible online declension is the given string, or if both searches
from the hypertext version of the lexicon, since fail the closest entry in the lexicon in alphabeti-
its abstract structure ensures us not only of cal order. This scheme is very effective, and the
the fact that every defined stem occurs within answer is given instantaneously.
the range of a gender declaration, but con- An auxiliary search engine searches Sanskrit
versely that every gender declaration is within words with a naive transcription, without dia-
the range of some defined stem. Thus we made critics. Thus a request for panini will return the
the gender declarations (of non-compound en- proper link to pān.ini.
tries) themselves mouse sensitive as linked to
3.5 Lemmatization
the proper instanciation of the grammatical
CGI program. Thus one may navigate with a The basic data structures and algorithms de-
Web browser not only within the dictionary as veloped in this Sanskrit processor have actually
an hypertext document (thus jumping in the ex- been abstracted as a generic Zen toolkit, avail-
ample above from the definition of Kumāra to able as free software (Huet, 2002; Huet, 2003b;
the entry where the name Skanda is defined, and Huet, 2003d).
conversely), but also from the dictionary to the One important data structure is the revmap,
grammar, obtaining all relevant inflected forms. which allows to store inflected forms as an
Similarly for roots, the present class indica- invertible morphological map from stems, with
tor is mouse-sensitive, and yields on demand the minimal storage. The Sanskrit platform uses
corresponding conjugation tables. This under- this format to store its inflected forms in a
lines a general requirement for the grammatical in such a way that it may directly be used
tools: each such process ought to be callable as a lemmatizer. Each form is tagged with a
from a concrete point in the text, correspond- list of pairs (stem, features), where features
ing unambiguously to a node in the abstract gives all the morphological features used in
syntax of the corresponding entry, with a scop- the derivation of the form from root stem.
ing structure of the lexicon such that from this A lemmatization procedure, available as a
node all the relevant parameters may be com- CGI executable, searches this structure. For
puted unambiguously. instance, for form devayos it lists:
In order to compute conjugated forms of non-
root verbs, the list of its relevant preverbs is { loc. du. m. | gen. du. m. |
available, each preverb being a link to the ap- loc. du. n. | gen. du. n. }[deva]
where the stem deva is a hyperlink to the cor- < { pr. a. sg. 3 }[paa#1] >
responding entry in the lexicon. Similarly for with sandhi identity]
verbal forms. For pibati it lists:
{ pr. a. sg. 3 }[paa_1], indicating that it This explains that the sentence
is the 3rd person singular present form of root mārjārodugdham . pibati (a cat drinks milk) has
pā1 in the active voice. one possible segmentation, where maarjaras,
We end this section by remarking that we did nominative singular masculine of maarjara (and
not attempt to automate derivational morphol- here the stem is a hyperlink to the entry in
ogy, although some of it is fairly regular. Actu- the lexicon glosing it as chat i.e. cat) combines
ally, compound formation is treated at the level by external sandhi with the following word by
of segmentation, since classical Sanskrit does rewriting into maarjaro, followed by dugdham
not impose any bound on its recursion depth. which is the accusative singular masculine of
Verb formation (which sequences of preverbs are dugdha (draught) or the accusative or nomi-
allowed to prefix which root) is explicit in the native singular neuter of dugdha (milk - same
dictionary structure, but it is also treated at vocable), which combines by external sandhi
the level of the segmentation algorithm, since with the following word by rewriting into its
this affix glueing obeys external sandhi and nasalisation dugdham . , followed by pibati ...
not internal sandhi, a peculiarity which may (drinks).
follow from the historical development of the
language (preverbs derive from postpositions). 4.2 Applications to philology
At present, noun derivatives from verbal roots We are now at the stage which, after proper
are explicit in the dictionary rather than be- training of the tagger to curb down its over-
ing computed out, but we envision in some fu- generation, we shall be able to use it for scan-
ture edition to make systematic the derivation ning simple corpus (i. e. corpus built over the
of participles, absolutives, infinitives, and pe- stem forms encompassed in the lexicon). The
riphrastic future and perfect. first level of interpretation of a Sanskrit text
is its word-to-word segmentation, and our tag-
4 Syntactic analysis ger will be able to assist a philology specialist
4.1 Segmentation and tagging to achieve complete morphological mark-up sys-
tematically. This will allow the development of
The segmenter takes a Sanskrit input as a concordance analysis tools recognizing morpho-
stream of phonemes and returns a stream of so- logical variants, a task which up to now has to
lutions, where a solution is a list of (inflected) be performed manually.
words and sandhi rules such that the input is At some point in the future, one may hope
obtainable by applying the sandhi rules to the to develop for Sanskrit the same kind of in-
successive pairs of words. It is presented, and formative repository that the Perseus web site
its completeness is proved, in (Huet, 2004). Fur- provides for Latin and Classical Greek2 . Such
ther details on Sanskrit segmentation are given resources are invaluable for the preservation of
in (Huet, 2003a; Huet, 2003c). the cultural heritage of humanity. The consid-
Combined with the lemmatizer, we thus ob- erable classical Sanskrit corpus, rich in philo-
tain a (non-deterministic) tagger which returns sophical texts but also in scientific, linguistic
all the (shallow) parses of an input sentence. and medical knowledge, is an important chal-
Here is an easy example: lenge for computational linguistics.
Another kind of envisioned application is
# process "maarjaarodugdha.mpibati"; the mechanical preparation of students’ read-
ers analysing a text at various levels of informa-
Solution 1 : tion, in the manner of Peter Scharf’s Sanskrit
[ maarjaaras Reader3 .
< { nom. sg. m. }[maarjaara] >
The next stage of analysis will group together
with sandhi as|d -> od]
tagged items, so as to fulfill constraints of sub-
[ dugdham
categorization (accessible from the lexicon) and
< { acc. sg. m. | acc. sg. n. |
nom. sg. n. }[dugdha] > 2
http://www.perseus.tufts.edu/
with sandhi m|p -> .mp] 3
http://cgi-user.brown.edu/Departments/
[ pibati Classics/Faculty/Scharf/
agreement. The result ought be a set of consis- Christiane Fellbaum, editor. 1998. WordNet:
tent dependency structures. We are currently An Electronic Lexical Database. MIT Press.
working, in collaboration with Brendan Gillon, Brendan S. Gillon. 1996. Word order in classi-
to the design of an abstract representation for cal Sanskrit. Indian Linguistics, 57,1:1–35.
sanskrit syntax making explicit dislocations and Gérard Huet. 2000. Structure of a San-
anaphora antecedents, with the goal of building skrit dictionary. Technical report, IN-
a consistent tree bank from his work on the anal- RIA. http://pauillac.inria.fr/~huet/
ysis of the exemples from Apte’s manual (Apte, PUBLIC/Dicostruct.ps
1885; Gillon, 1996). Gérard Huet. 2001. From an informal textual
An interesting piece of design is the interface lexicon to a well-structured lexical database:
between lexicon citations and the corpus. An An experiment in data reverse engineering. In
intermediate structure is a virtual library, act- Working Conference on Reverse Engineering
ing as a skeleton of the corpus used for indexa- (WCRE’2001). IEEE.
tion. This way citations in the lexicon are mere Gérard Huet. 2002. The Zen computational
pointers in the virtual library, which acts as a linguistics toolkit. Technical report, ESSLLI
citations repository, but also possibly as a cita- Course Notes. http://pauillac.inria.fr/
tion server proxy to the actual corpus materal ~huet/ZEN/zen.pdf
when it is actually available as marked-up text. Gérard Huet. 2003a. Lexicon-directed segmen-
For lack of space, we omit this material here. tation and tagging of Sanskrit. In XIIth
World Sanskrit Conference, Helsinki.
5 Conclusions Gérard Huet. 2003b. Linear contexts and the
The computational linguistic tools should be sharing functor: Techniques for symbolic
modular, with an open-ended structure, and computation. In Fairouz Kamareddine, edi-
their evolution should proceed in a breadth-first tor, Thirty Five Years of Automating Mathe-
manner, encompassing all aspects from pho- matics. Kluwer.
netics to morphology to syntax to semantics Gérard Huet. 2003c. Towards computational
to pragmatics to corpus acquisition, with the processing of Sanskrit. In International
lexical database as a core switching structure. Conference on Natural Language Processing
Proper tools have to be built, so that the an- (ICON), Mysore, Karnataka.
alytic structure is confronted to the linguistic Gérard Huet. 2003d. Zen and the art of sym-
facts, and evolves through experimentally ver- bolic computing: Light and fast applicative
ifiable improvements. The interlinking of the algorithms for computational linguistics. In
lexicon, the grammatical tools and the marked- Practical Aspects of Declarative Languages
up corpus is essential to distill all linguistic in- (PADL) symposium. http://pauillac.
formation, so that it is explicit in the lexicon, inria.fr/~huet/PUBLIC/padl.pdf
while encoded in the minimal way which makes Gérard Huet. 2004. A functional toolkit
it non-redundant. for morphological and phonological pro-
We have argued in this article that the de- cessing, application to a Sanskrit tagger.
sign of an hypertext interface is useful to refine Journal of Functional Programming, to ap-
the structure of the lexicon in such a way as pear. http://pauillac.inria.fr/~huet/
to enforce these requirements. However, such PUBLIC/tagger.pdf.
a linguistic platform must carefully distinguish Huanfeng Ma, Burcu Karagol-Ayan, David Do-
between the external exchange formats (XML, ermann, Doug Oard, and Jianqiang Wang.
Unicode) and the internal logical structure, 2003. Parsing and tagging of bilingual dictio-
where proper computational structures (induc- naries. Traitement Automatique des Langues,
tive data types, parametric modules, powerful 44,2:125–149.
finite-state algorithms) may enforce the consis- G. A. Miller. 1990. Wordnet: a lexical database
tency invariants. for English. International Journal of Lexicog-
raphy, 3,4.
References Alain Polguère. 2003. Lexicologie et sémantique
Vāman Shivarām Apte. 1885. The Student’s lexicale. Presses de l’Université de Montréal.
Guide to Sanskrit Composition. A Treatise on
Sanskrit Syntax for Use of Schools and Col-
leges. Lokasamgraha Press, Poona, India.

You might also like