You are on page 1of 9

Project Report on Text Normalization Surya Teja I 26-11-2013

(I) Introduction and Background: Text Normalization is the process of transforming texts into a canonical form. In the context of Text to speech conversion systems, this stands for expansion of non standard words, abbreviations etc. But Text normalization in the scope of this project is related to spell chec ing but goes beyond the usual notions of spell chec ing. !pell chec ing includes spelling error detection and correction. "sually, detection is automatic but correction is semi#automatic, the machines provides options and user selects a suggested word. The user may also as the system to accept the word as it is, accept it in all occurrences within the current document, or add it to the personalized dictionary so that all occurrences in future documents will also be accepted. (II) Aim, scope, objectives: The focus of this wor is on Telugu, only modern Telugu prose is considered. The design, development and testing will all be done based on a large text corpus. In our group, we define words in terms of meanings and universal word classes, not in terms of orthographic conventions such as spaces between words. Thus our system should be capable of dealing with all inds of orthographic variations too. In other words, generating a canonical, normalized or standardized pada#paaTha is the main goal. (III) Analysis of Existing system: The existing saara system generates pada#paaTha where each word is tagged and root word is extracted. If a word is not analyzed properly then it is tagged as "N$%un nown&.In this system, spaces are ta en as delimiter of words. In Telugu there are no standard rules for splitting words i.e, words li e 'i(tipaina' can also be written as 'i(ti paina'.Then saara system ta es i(ti and paina as two separate words. In other word li e 'tinalee apoovu' if there is a space between 'tina' and 'lee apoovu' then they are ta en as two separate words, and as 'tina' cannot occur separately, it will be tagged as "N$. )hich means, run on errors and split errors are not being recognized. !tandardizing pada#paaTha includes chec ing for this type of errors and correcting them %here combining 'tina' and 'lee apoovu'&.In final pada#paaTha we can say that space is the actual delimiter of words

%even compound words should be combined or should be separated by # as in 'vaatar#miitar'&. *n observing +,,,,, words that were tagged un nown by saara system, found that only ++,, words are actual spelling errors remaining -.,, are correct words which saara couldn't analyze. *f these -.,, words, /0,, are due to phonetic conflation, ++1, words are due to incomplete lexicon coverage, /1, words are named entities, 2/,, words are loan words, 3,, words are inflected. sand i(p onetic conflation): (ajority of "N$ words that saara did not analyze are due to sandhi %phonetic conflation&. !andhi can be of two types a& Internal sandhi and b&4xternal sandhi. Internal sandhi occurs when a single word is inflected or derived, example 'ceTTu + lu = ceTTlu' or 'ceeyu+aDaM=ceeyaDam'.4xternal sandhi occurs when two different words are joined to become single word, example 'suurya + udayaM = suuryoodayaM'. Majority of /0,, words that were not analyzed by the saara are occurred due to 4xternal sandhi, as saara is capable of handling internal sandhi and external sandhi have to be implemented. !oan "ords: 5fter sandhi other category of words that system couldn't handle are loan words. 6oan words are words that are borrowed from other languages and used in corpus. 5 word li e computer in Telugu corpus can be considered as loan word. 6oan words can be written in different ways, a loan word may be written in its native script as 'computer' or ' ', if written in native script then it can be easily recognized and analyzed using that specific language dictionary. But if it is written in Telugu as '' or '' then detecting them will be difficult. The existing saara system is unable to detect the loan words written in Telugu font. #pelling Errors: ++,, words of +,,,,, "N$ words are actual spelling mista es, which are detected by saara system and not corrected. (any words in Telugu has variants, ie, according to core grammar the word 'write' in Telugu is 'vraayu' but in normal usage it is written as 'raayu'. Both forms are correct and have same meaning, they are called variants. !5575 system handle these variants using a variant table %all possible variants of each words are stored in a table &

or by adding both words to dictionary.

Block $iagram

(I%) "ork &o Be $one: Normalization should be done before bridge module. This includes (a) identifying 'ords %space should be the actual delimiter of words&8

!plit errors and run#on errors should be identified and corrected. Ex: 'pani ceeya leedu' here ceeyaledu should be ta en as sin!le "ord but it "as "ron!ly split. These are split errors. ' ottapaddhati' here otta is "ord and paddhati is another "ords but mer!ed. These are run#on errors. 9ompound words should be merged or hyphenated. Ex: 'ce a maMcamu' this is a compound "ord $ "ooden#cot % it must be hyphen should be included bet"een them as 'ce a#maMcamu'. mer!ed or a "ron!ly

(b) Sandhi splitter


!andhi%euphony&8 !andhi refers to a set of morpho#phonological processes that occur at either morpheme or word boundaries. Two types of sandhi are identified in a language are, internal sandhi and external sandhi. :Internal sand i refers to word#internal morphological changes that ta e place at morpheme boundaries during the process of word#formation. example: 'ceTTu + lu = ceTTlu' or 'ceeyu+aDaM=ceeyaDam'. :External sand i, on the other hand, refers to processes that apply word# externally i.e. across word boundaries. example 'vaaru + ekkaDa = vaarekkaDa' or 'ataDu + oka = ataDoka'. &andhi is formed "ith t"o "ords' the first "ord is called as puurvapada and second "ord is uttarapada. 5lmost /,; of words saara couldn't analyze are due to sandhi. )hen saara system couldn't analyze a word, it could be loan word, sandhi, spelling error or a variant. To build sandhi splitter we need sandhi splitting rules. Then we'll code this rules such that when a word is given as input then it applies the rules and outputs two words whoose sandhi resulted the given word. But for any given word there could be more than one possible sandhi splits ex: vaarekkada = vaaru + ekkaDa (ukaara saMDhi) vaari + ekkaDa (ikkara saMDhi) Now we can, as user to select the correct sandhi split or we can use tag level or words level bi#grams to choose the appropriate split. 9ontextual information can also be used if ambiguity still exist.

But the big <uestion is, whether all sandhi words are to be split =
No, when sandhi is formed between two words, it can be Telugu sandhi or !ans rit sandhi. !ans rit sandhi is formed between two %!ans rit& words, with !ans rit sandhi rules. Ex: raama + aalayaM = raamaalayaM ; bhaanu + udayaM = bhaanoodayaM; Normally when !ans rit sandhi is formed it becomes a compound word %samaasa& which is not to be split, but to be added to dictionary as the meaning of two individual words is different from compound word. But when Telugu sandhi is formed we can split the word, as formation of compound words is not very common. !ome Telugu sandhis are 8 ac saMdhi druti prak tika saMdhi aamredita saMdhi ya ra na puMpvaa!ama saMdhulu

But some words are not be split, ex: aidaaru = aidu + aaru" >ere the meaning of aidaaru %approximately 1 or 0& is different from the meaning of individual words. In Telugu the possible uttarapadaas are closed so storing the list of all uttarapadaas and by chec ing for them in a word we can split the sandhi.

(c) handling spelling errors


5 spell#chec er is an application program that flags words in a document that are not spelled correctly and facilitates corrections. corrections can be automatic or semi#automatic The complexity of spell chec er design in Indian languages is of varied degree and is generally higher than that for 4nglish. *ne of the reasons of this complexity is higher degree of inflectionality in words. 5 wide variety of suffixes can be attached to the root words ad newer words can be easily formed euphony and assimilation%sa(dhi and samaasa&. The number of valid suffixes in >indi is smallest while that in Telugu is largest. Thus from a root word verb in >indi, one can get an order of ten words by adding suffix while in Telugu an order of thousand words can be formed ?2@. 4rrors can be of two types non#words errors and real#world errors, non#word errors can be detected easily when compared to real word errors. To detect real world errors we have to use contextual information.

Two widely used ways of spelling error detection are n#gram approach and dictionary based approach. In general, although n#gram analysis may be useful for detecting machine# generated errors such as those produced by optical#character recognizers, it has proven to be less accurate for detecting human#generated errors ?@. #pelling error patterns: Issues related to spelling error pat terns include such things as what the most common errors are, how many errors tend to occur within a word, whether errors tend to change word length, whether misspellings are typographically, cognitively, or phonetically based, and, in general, whether errors can be characterized by rules or by probabilistic tendencies. !pelling error pattern issues have had perhaps the greatest impact design of correction techni<ues. +& Ahonetically similar characters are often confused, which result in errors. Ex: ranam #or ra$am; shaMkara #or %haMkara; pelli #or pe&li etc" In some cases they can also be ta en as variants. Ex: saMbaMdaM and saMbaMdhaM" %hort vo'el and lon! vo'el" )e can also identify non#word errors when impossible character n#grams are seen in a word. !uch as, two dissimilar vowel will not appear side by side. Ex: ae (eu etc" $nowledge of how corpus is typed, can also be used in finding the spelling patterns. If corpus is typed using <werty eyboard then motor co#ordination errors will be different from typing using specialized Telugu eyboards. Ex: )n ciil corpus ee is o#ten con#used 'ith ai" *+, o# spellin! errors( 'hich have 'ai' in them are due to this con#usion; as okaveela - okavaila( eerpaDavu - airpaDavu" !pelling correction8 )hen a spelling error is detected then proper suggestion should be given to the user to choose from. To get that suggestions one most used way is 'minimum#edit# distance algorithm' which computes the distance between two strings. The erroneous word is compared with all the dictionary and the word with minimum distance is given as suggestion. This wor s in character level, but for Indian languages, using a shara level minimum#edit# distance algorithm is more beneficiary.

(d)(on)standard 'ords:
To say about non#standard words, we should first define what are standard words, in a naBve

way we can say that words which are found in dictionary are standard, or the words which are morphed according to grammar. In real text, many non#standard representation of words appear, for e.g., numbers %year, time, ordinal, cardinal, floating point&, abbreviations, acronyms, currency, dates, "76s. 5ll these non#standard representations must typically be normalized, or in other words converted to standard words, which would then be processed in various applications. consider the sentence8 'ii kaMpeni .//+ aa!as0 12na udayaM .+:3+ki praaraMbhiMcaaru( ' when we run S R on thi! !entence it "eenrate! pa#a paaTha an# morph a! $ollow!% ii kaMpeni .//+ aa!as0 12na udayaM .+ 3+ki

pada paa0ha: praaraMbhiMcaaru

Morph: ii45D67DEM8 0a!: ii99ii995D67DEM kaMpeni4$7:;M7:;<7$"%&7$;M8 0a!: kaMpeni99kaMpeni99$7:;M7:;<7$"%&7$;M .//+4$<M8 0a!: .//+99.//+99$<M aa!as04$7&;:70)M7<$:75=%7$"%&7$;M8 0a!: aa!as099aa!as099$7&;:70)M7<$:75=%7$"%&7$;M 12na4$<M8 0a!: 12na9912na99$<M udayaM4$7&;:70)M7<$:75=%7$"%&7$;M8 0a!: udayaM99udayaM99$7&;:70)M7<$:75=%7$"%&7$;M .+4$<M8 0a!: .+99.+99$<M 3+ki4$<M8 0a!: 3+ki993+ki99$<M praaraMbhiMcaaru4praaraMbhiMcu:>70 .1:,v75=%"?5%07?13"M@"?&78 0a!: praaraMbhiMcaaru99praaraMbhiMcu99>70 .175=%"?5%07?13"M@"?& >ere .//+ aa!as0 12 should be considered as single word and should be tagged as Cdate: and .+:3+ should be detected as time and tagged as Ctime:, this ind of tagging helps parser to detecct aara a roles easily, the ideal pada paaTha for that input should be li e8

pada paa0ha: ii kaMpeni 4date8.//+ aa!as0 12na4Adate8 udayaM 4time8.+:3+ki4Atime8 praaraMbhiMcaaru "

*ther ind of non#standard words that can be addressed are, 7oman numbers, fractions ratios, telephone numbers, email addresses, "76s. 4mail addresses and "76s can be detected easily without much ambiguity, but ambiguity occurs in cases li e years and cardinals, li e ./B2 this can be ta en as number or year or amount%money&,so ambigious tags will be generated as ./B2 AyearAcardinal for such cases, context dependent rules must be applied. (e) handling variations

&ariant! can 'e two type!( a wor# can 'e written in many $orm! )#i$$erent !pellin"* 'ut with !ame +in# o$ pronunciation, -x% .!haanti. / .!haa0ti. 'oth are correct an# pronounce# more or le!! the !ame way, 1ther +in# o$ 2ariant! are where two wor#! mean !ame 'ut pronounce# an# written #i$$erently, -x% .+ri0#i. / .+i0#i. an# .tar2aata. / .taru2aata. / .tara2aata., 3eepin" all the 2ariant! in #ictionary i! one i#ea 'ut it ma+e! #ictionary 2ery lar"e, 1ther i#ea i!( we can maintain a 2ariant ta'le $or each wor#, ###(d) handling dialects - 4 (f) adding case-markers )when nee#e#*% 5a!e-mar+er! play a 2ital role in par!in"( they are u!e# to i#enti$y +aara+a role!, 6rammatically ca!e-mar+er! mu!t 'e u!e# 'ut in u!a"e they are u!ually #roppe#, Ex: 'raamuDu paMdunu tinnaDu' - here parser can recognize paMdu as karma, as karma always takes accusative case. ut i! case-marker is dropped, then identi!ying karma will "e di!!icult. So 'e$ore par!in" ca!e-mar+er! mu!t 'e a##e# where they are nece!!ary,

7eferences8 ?+@ $avi Narayana (urthy. Issues in the Design of a !pell 9hec er for (orphologically 7ich 6anguages. In I9*56#,3, 3rd International 9onference on 5sian 6anguages, "niversity of >yderabad, >yderabad, India, /#0 Eanuary2,,+. ?2@ >indi Text Normalization $.Aanchapagesan, AarthaAratimTalu dar, N.!ridhar $rishna, $ali a Bali, 5.F.7ama rishnan ?3@ Towards Indian 6anguage !pell#chec er Design Bidyut Baran 9haudari Indian !tatistical

Institute GbbcHisical.ac.inI ?/@ Telugu !pell#chec er "ma (aheshwar 7ao, F.5mba,A.$ul arni, 9hristopher (ala and Aarameshwari.$ ?1@ Techni<ues for 5utomatically 9orrecting )ords in Text, $574N $"$I9> ?0@ *n the Design of 9omputational Frammars8 The 9ase of Telugu, !rinivas Badugu.

You might also like