Tamil Morphological Analysis

An Efficient Rule-Based System for Morphological
Parsing of Tamil Language

தமிழ் உருபனியல் ஆய்வு
Final Semester Project

Department of Computer Science and Engineering
National Institute of Technology, Tiruchirappalli
May 2010
STUDENTS:
Karthik S 106106029
Praveen Kumar 106106045
Venkataraman GB 106106073
GUIDE:
Dr. V. Gopalakrishnan
Agenda
 Overview of the Project
 NLP Applications – The Stakeholders
 The problem at hand
 The proposed solution
◦ Rule – Based Morphological Analysis
◦ Machine Learning
 Where does it all fit in ?
 Need for Tamil Morphological Analysis
 Resources Obtained
 Implementation Details
 Demonstration
 Future Scope
1 WHO WHAT WHERE WHY HOW

12/08/2021 National Institute of Technology, Tiruchirappalli
Overview of the Project
 Natural Language Processing
 Morphological Analysis
 Tamil Language
Morphing …
நடப்பான்
நடக்கின் நடக்கின்
… And in Tamil றான் றாள்
நடந்தான் நடந்தனர்

NLP Applications – The Stakeholders
WHO ARE THE STAKEHOLDERS ?

Natural Language Processing Applications like:
 Stemming
 Machine Translation
 Speech Recognition
 Information Retrieval
WHY ARE THESE APPLICATION THE STAKEHOLDERS ?

The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL
திணை - Class பால் - Gender
எண் - Number
இடம் - Person காலம் - Tense

The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL
 Example: vAḷntukkoṇṭiruntēṉ: [வாழ்ந்துகொண்டிருந்தேன்]
vAḷ - வாழ் intu - ந்து koṇṭu - கொண்டு irunta - இருந்த ēn - ஏன்
root voice marker tense marker aspect marker person marker
live past tense during past progressive first person,

object voice Singular

The proposed solution
There are two levels called lexical and surface levels. In the surface level, a
word is represented in its original orthographic form. In the lexical level, a
word is represented by denoting all of the functional components of the word.
SURFACE LEVEL LEXICAL LEVEL
RULE – BASED MORPHOLOGICAL ANALYSIS

Analyzing word inflections using rules specified in Tamil Grammar
அன் ஆன் அள் ஆள் அர் ஆர் பம்மார்

அஆ குடுதுறு என் ஏன் அல் அன்
அம் ஆம் எம் ஏம் ஓமொ டும்மூர் நன்னூல்
கடதற ஐ ஆய் இம்மின் இர்ஈர்
தொல்காப்பியம்
ஈயர் கயவு மென்பவும் பிறவும்
வினையின் விகுதி பெயரினும் சிலவே

The proposed solution
MACHINE LEARNING APPROACH
While checking for suffixes in a given word, more than one suffix might be
possible, if the rules are strictly followed. But only one suffix is semantically
possible.
விகுதி : படித்து – “உ” படித்தது – “து” or “உ” ???
1
M/L approach helps the system in “learning” the correct parsing method for the
word, and in the subsequent processing of the same word, the wrong
possibilities are automatically eliminated.
Two words might share the same inflectional part.
நடக்கின்றான் படிக்கின்றான்
2
The inflectional part of every word is learnt by the system. This helps in
optimization by eliminating the need to analyse the second word again from
scratch

Where does it all fit in ?
Characters ப டி த் தா ன்
Word – Tokenization படித்தான்
Morphological Analysis படி - த்த் - ஆன்
Sentence Syntax Analysis அவன் புத்தகத்தைப்

படித்தான்
Semantic Analysis Meaning of the sentence ???

Need for Tamil Morphological Analysis
ENGLISH vs. TAMIL
I came நான் வந்தேன்
You came நீ வந்தாய்
They came அவர்கள் வந்தனர்
He came அவன் வந்தான்

She came அவள் வந்தாள்
TRANSLATION AND SEMANTIC ANALYSIS
அவன் மதுரைக்கு வந்தாள் -- Semantically Wrong
To check semantic correctness of a sentence, morphological analysis is needed.

How to translate the above sentence ??

Resources Obtained
EMILLE – CIIL TAMIL MONOLINGUAL CORPUS
 Enabling Minority Language Engineering
 Collaborative Venture of
◦ Lancaster University, UK
◦ Central Institute of Indian Languages (CIIL), Mysore, India
 Distributed by European Language Resources Association [ELRA]
TAMIL WORDNET
 The database is a semantic dictionary that is designed as a lexical network
 Developed by
◦ Department of Linguistics of Tamil University
◦ AU-KBC Research Centre, Chennai
 Tamil Wordnet resembles a traditional dictionary. It also contains valuable
information about morphologically related words

Implementation Details - 1
Classify and Backward Scanning
Input Tamil Word
Remove Inflection of inflections
No
Check No Root
in DB verb ?
C-V Segmentation
Yes Yes
Output
Conflict Resolution
Machine Learning

படித்தான்
ப டி த் தா ன்
ப் - அ ட் - இ த் த் - ஆ ன்
ப் அ ட் இ த் த் ஆ ன்
படி < VERB_ROOT >

த்த் < PAST TENSE >
ஆன் < 3SM >

UNICODE SUPPORT FOR TAMIL
 U+0B80 – U+0BFF
GOOGLE TAMIL TRANSLITERATOR IME (Input Method)

 Google Transliteration IME is an input method editor which allows users to
enter text Tamil using a roman keyboard
PROGRAMMING LANGUAGE
 Java
DATABASES
 MySQL Databases, with JDBC to access the database

TRANSLITERATION MODULE
 A simple Transliterator module - to enable conversion from Tamil to English
and vice-versa
 Example:
◦ அ - a
◦ ஆ - aa
◦ க - ka
HASH TABLE GENERATOR

 The application uses two data files, containing a list of vigudhi and idainilai.
 The Java Hash Generator Code loads the data from the workbooks, adds
them to a hash table, and serializes the data and outputs to an external data
file, which can be loaded whenever the application requires access.

Future Scope
 The algorithm can be extended to cover nouns and noun forms too.
 The algorithm can be improved to incorporate stricter rules so as to reduce

conflicts that arise in the output generated by the current system.
 The algorithm can be extended for other agglutinative languages.
 The various resources obtained as a part of this project, including the

EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can
be used for further study, research and development in the field of Natural
Language Processing at our college in the years to come.
14
References
 A Novel Approach to Morphological Analysis for Tamil Language
◦ Anand kumar M1, Dhanalakshmi V1, Rajendran S2, Soman K P
 Nannool and Tholkaapiyam
◦ Tamil Grammar texts
 The Morphological Generator and Parsing Engine for Tamil Verb Forms.
◦ Ultimate Software Solution, Dindigul
 Morphological Analyzer for Tamil
◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]
◦ ICON 2002, RCILTS-Tamil, Anna University, India.
 Morphology. A Handbook on Inflection and Word Formation
◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]
 Tamil Part-of-Speech tagger based on SVMTool
◦ Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008]
◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).
 Unsupervised Learning of the Morphology of a Natural Language.
◦ John Goldsmith. [2001]
◦ Computational Linguistics, 27(2):153–198.
 Computational morphology of verbal complex
◦ Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. [2001]
15
◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001.
Thank you

Tamil Morphological Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tamil Morphological Analysis

Uploaded by

Copyright:

Available Formats

An Efficient Rule-Based System for Morphological

Parsing of Tamil Language

Final Semester Project

1 WHO WHAT WHERE WHY HOW

2 WHO WHAT WHERE WHY HOW

WHO ARE THE STAKEHOLDERS ?

WHY ARE THESE APPLICATION THE STAKEHOLDERS ?

3 WHO WHAT WHERE WHY HOW

திணை - Class பால் - Gender

இடம் - Person காலம் - Tense

4 WHO WHAT WHERE WHY HOW

vAḷ - வாழ் intu - ந்து koṇṭu - கொண்டு irunta - இருந்த ēn - ஏன்

root voice marker tense marker aspect marker person marker

live past tense during past progressive first person,

4 WHO WHAT WHERE WHY HOW

SURFACE LEVEL LEXICAL LEVEL

RULE – BASED MORPHOLOGICAL ANALYSIS

அன் ஆன் அள் ஆள் அர் ஆர் பம்மார்

5 WHO WHAT WHERE WHY HOW

Two words might share the same inflectional part.

6 WHO WHAT WHERE WHY HOW

Word – Tokenization படித்தான்

Morphological Analysis படி - த்த் - ஆன்

Sentence Syntax Analysis அவன் புத்தகத்தைப்

Semantic Analysis Meaning of the sentence ???

7 WHO WHAT WHERE WHY HOW

He came அவன் வந்தான்

TRANSLATION AND SEMANTIC ANALYSIS

அவன் மதுரைக்கு வந்தாள் -- Semantically Wrong

To check semantic correctness of a sentence, morphological analysis is needed.

8 WHO WHAT WHERE WHY HOW

9 WHO WHAT WHERE WHY HOW

10 WHO WHAT WHERE WHY HOW

படி < VERB_ROOT >

11 WHO WHAT WHERE WHY HOW

GOOGLE TAMIL TRANSLITERATOR IME (Input Method)

12 WHO WHAT WHERE WHY HOW

HASH TABLE GENERATOR

13 WHO WHAT WHERE WHY HOW

 The algorithm can be improved to incorporate stricter rules so as to reduce

 The algorithm can be extended for other agglutinative languages.

 The various resources obtained as a part of this project, including the

12/08/2021 National Institute of Technology, Tiruchirappalli

You might also like