You are on page 1of 1

Introduction and background

For all NLP applications word segmentation is the most important task .in order to suggest list of
possible correction a spell checker requires word boundary information for error words
English language use space & punctuation marks to identify word boundary, but in some Asian
languages like Urdu, Chinese , Japanese etc spaces is not use to identify the word boundary there is
some sequential flow of writing text
Urdu, Urdu is a unique language in which space does not identify word boundary it has two main
problems
I) SPACE INSERTION
II) SPACE OMISSION
Segmentation Issues:
it can be divided into two classes
I) joiner characters
II) non-joiner characters
Joiner
A character can acquire up to four shapes i.e.
I) Initial
II) Medial
III) Final
IV) Isolated
For example Urdu alphabet yeh

I) Initial
II) Medial
III) Final
IV) Isolated
Non joiners
A character can acquire up to four shapes i.e.
I) Final
II) Isolated
Example For example Urdu alphabet daal
I) Final
II) Isolated


We may use different algorithms in this research like statistical methods maximum matching long
matching to solve segmentation ambiguities

You might also like