You are on page 1of 18

An Efficient Rule-Based System for Morphological

Parsing of Tamil Language


தமிழ் உருபனியல் ஆய்வு

Final Semester Project


Department of Computer Science and Engineering
National Institute of Technology, Tiruchirappalli

May 2010

STUDENTS:
Karthik S 106106029
Praveen Kumar 106106045
Venkataraman GB 106106073

GUIDE:
Dr. V. Gopalakrishnan
Agenda
 Overview of the Project
 NLP Applications – The Stakeholders
 The problem at hand
 The proposed solution
◦ Rule – Based Morphological Analysis
◦ Machine Learning
 Where does it all fit in ?
 Need for Tamil Morphological Analysis
 Resources Obtained
 Implementation Details
 Demonstration
 Future Scope

1 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Overview of the Project
 Natural Language Processing
 Morphological Analysis
 Tamil Language

Morphing …

நடப்பான்
நடக்கின் நடக்கின்
… And in Tamil றான் றாள்
நடந்தான் நடந்தனர்

2 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
NLP Applications – The Stakeholders

WHO ARE THE STAKEHOLDERS ?


Natural Language Processing Applications like:
 Stemming
 Machine Translation
 Speech Recognition
 Information Retrieval

WHY ARE THESE APPLICATION THE STAKEHOLDERS ?

3 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL

திணை - Class பால் - Gender

எண் - Number

இடம் - Person காலம் - Tense

4 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL
 Example: vAḷntukkoṇṭiruntēṉ: [வாழ்ந்துகொண்டிருந்தேன்]

vAḷ - வாழ் intu - ந்து koṇṭu - கொண்டு irunta - இருந்த ēn - ஏன்

root voice marker tense marker aspect marker person marker

live past tense during past progressive first person,


object voice Singular

4 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The proposed solution
There are two levels called lexical and surface levels. In the surface level, a
word is represented in its original orthographic form. In the lexical level, a
word is represented by denoting all of the functional components of the word.

SURFACE LEVEL LEXICAL LEVEL

RULE – BASED MORPHOLOGICAL ANALYSIS


Analyzing word inflections using rules specified in Tamil Grammar

அன் ஆன் அள் ஆள் அர் ஆர் பம்மார்


அஆ குடுதுறு என் ஏன் அல் அன்
அம் ஆம் எம் ஏம் ஓமொ டும்மூர் நன்னூல்
கடதற ஐ ஆய் இம்மின் இர்ஈர்
தொல்காப்பியம்
ஈயர் கயவு மென்பவும் பிறவும்
வினையின் விகுதி பெயரினும் சிலவே

5 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The proposed solution
MACHINE LEARNING APPROACH
While checking for suffixes in a given word, more than one suffix might be
possible, if the rules are strictly followed. But only one suffix is semantically
possible.
விகுதி : படித்து – “உ” படித்தது – “து” or “உ” ???
1
M/L approach helps the system in “learning” the correct parsing method for the
word, and in the subsequent processing of the same word, the wrong
possibilities are automatically eliminated.

Two words might share the same inflectional part.

நடக்கின்றான் படிக்கின்றான்
2
The inflectional part of every word is learnt by the system. This helps in
optimization by eliminating the need to analyse the second word again from
scratch

6 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Where does it all fit in ?

Characters ப டி த் தா ன்

Word – Tokenization படித்தான்

Morphological Analysis படி - த்த் - ஆன்

Sentence Syntax Analysis அவன் புத்தகத்தைப்


படித்தான்

Semantic Analysis Meaning of the sentence ???

7 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Need for Tamil Morphological Analysis
ENGLISH vs. TAMIL
I came நான் வந்தேன்
You came நீ வந்தாய்
They came அவர்கள் வந்தனர்

He came அவன் வந்தான்


She came அவள் வந்தாள்

TRANSLATION AND SEMANTIC ANALYSIS

அவன் மதுரைக்கு வந்தாள் -- Semantically Wrong

To check semantic correctness of a sentence, morphological analysis is needed.


How to translate the above sentence ??

8 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Resources Obtained
EMILLE – CIIL TAMIL MONOLINGUAL CORPUS
 Enabling Minority Language Engineering
 Collaborative Venture of
◦ Lancaster University, UK
◦ Central Institute of Indian Languages (CIIL), Mysore, India
 Distributed by European Language Resources Association [ELRA]
TAMIL WORDNET
 The database is a semantic dictionary that is designed as a lexical network
 Developed by
◦ Department of Linguistics of Tamil University
◦ AU-KBC Research Centre, Chennai
 Tamil Wordnet resembles a traditional dictionary. It also contains valuable
information about morphologically related words

9 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 1
Classify and Backward Scanning
Input Tamil Word
Remove Inflection of inflections

No

Check No Root
in DB verb ?

C-V Segmentation
Yes Yes

Output

Conflict Resolution
Machine Learning

10 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 2
படித்தான்

ப டி த் தா ன்

ப் - அ ட் - இ த் த் - ஆ ன்

ப் அ ட் இ த் த் ஆ ன்

படி < VERB_ROOT >


த்த் < PAST TENSE >
ஆன் < 3SM >

11 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 3
UNICODE SUPPORT FOR TAMIL
 U+0B80 – U+0BFF

GOOGLE TAMIL TRANSLITERATOR IME (Input Method)


 Google Transliteration IME is an input method editor which allows users to
enter text Tamil using a roman keyboard

PROGRAMMING LANGUAGE
 Java

DATABASES
 MySQL Databases, with JDBC to access the database

12 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 3
TRANSLITERATION MODULE
 A simple Transliterator module - to enable conversion from Tamil to English
and vice-versa
 Example:
◦ அ - a
◦ ஆ - aa
◦ க - ka

HASH TABLE GENERATOR


 The application uses two data files, containing a list of vigudhi and idainilai.
 The Java Hash Generator Code loads the data from the workbooks, adds
them to a hash table, and serializes the data and outputs to an external data
file, which can be loaded whenever the application requires access.

13 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Future Scope
 The algorithm can be extended to cover nouns and noun forms too.

 The algorithm can be improved to incorporate stricter rules so as to reduce


conflicts that arise in the output generated by the current system.

 The algorithm can be extended for other agglutinative languages.

 The various resources obtained as a part of this project, including the


EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can
be used for further study, research and development in the field of Natural
Language Processing at our college in the years to come.

14
12/08/2021 National Institute of Technology, Tiruchirappalli
References
 A Novel Approach to Morphological Analysis for Tamil Language
◦ Anand kumar M1, Dhanalakshmi V1, Rajendran S2, Soman K P
 Nannool and Tholkaapiyam
◦ Tamil Grammar texts
 The Morphological Generator and Parsing Engine for Tamil Verb Forms.
◦ Ultimate Software Solution, Dindigul
 Morphological Analyzer for Tamil
◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]
◦ ICON 2002, RCILTS-Tamil, Anna University, India.
 Morphology. A Handbook on Inflection and Word Formation
◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]
 Tamil Part-of-Speech tagger based on SVMTool
◦ Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008]
◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).
 Unsupervised Learning of the Morphology of a Natural Language.
◦ John Goldsmith. [2001]
◦ Computational Linguistics, 27(2):153–198.
 Computational morphology of verbal complex
◦ Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. [2001]
15
◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001.
12/08/2021 National Institute of Technology, Tiruchirappalli
Thank you

12/08/2021 National Institute of Technology, Tiruchirappalli

You might also like