You are on page 1of 4

IDL - International Digital Library Of

Technology & Research


Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017

Spell checker for Kannada OCR


Suma S, Sneha N
Sharathkumar S UG scholars,
Assistant Professor Depart ment of Information Science and Engineer ing
Depart ment of Information Science and Engineering Siddaganga Institute of Technology, Tumakuru
Siddaganga Institute of Technology, Tumakuru suma.sinchu.ds@gmail.com
skumars@sit.ac.in snehan769@g mail.co m

Abstract A spell checker is an application program to Ex: may be wrongly written as .


process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic Phoneme to grapheme mappi ng errors
necessity and a tedious work in any language, so we require Errors occurred while writ ing the dictated words.
spell checker software to do this, which is the fundamental Ex: may be wrongly written as .
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the Typing errors
work done for a Kannada language. In a software system Errors occurred wh ile typing by pressing wrong key.
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly. Ex: may be wrongly typed as .
In this paper, we describe some techniques used in OCR generated errors
Kannada language by a spell checker. We use NLP, which is
Errors occurred by incorrect recognition of a character by
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we OCR.
have some modern NLP algorithms based on machine Ex: may be wrongly recognized as .
learning to carry out the work. Errors generated by s peech recog nizer
KeywordsS pell checker, NLP, OCR, Dictionary Lookup;
Errors occurred due to wrong pronunciation of words or
wrong recognition of words by speech recognizer.
I. INT RODUCTION
Ex: may be wrongly recognized as .
Kannada is a Dravid ian language spoken predominantly
by people of Karnataka and other neighboring states. It has
roughly forty million native speakers and a total o f 50.8 1.2 Optical Character Recognition (OCR)
million speakers according to 2001 census. Spell checking is Optical character recognition is a technique for moving
the critical problem in NLP. The tool named spell checker is text fro m paper form to electronic form. To convert an image,
the important tool for the number of tightly coupled written text or e-text into a machine readable format we
components for various software like OCR, word processor require an OCR, the input to this can be a plain document,
and even translators. image etc. The source for OCR can be bank statements, ATM
1.1 Error Analyzer transactions, e-statements, mailing documents etc.
A linguistic error analy zer is a tool which studies the types To process different tasks like speech to text, image to text
and causes of language errors. and vice-versa, analyzing of the text is done in digitized
Errors may be classified as: Conceptualization errors format, so that it can be easily edited, stored and even
accessed easily via open-access system. OCR is a field of
(i.e., thinking), phoneme to grapheme mapping errors (i.e.,
writing), typing errors, OCR generated errors, errors generated research in NLP, Machine learning, artificial intelligence and
by speech recognizer. computer vision.
In a modern era, there is a need of flexibility to produce an
accurate OCR system so that it can recognize any type of fonts
Conceptualization errors with the support of various digital image inputs to get more
Errors occurred due to ones way of thinking. accurate outputs for the proper inputs supplied .

IDL - International Digital Library 1 |P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017

OCR Errors Init ially, d ivide the bulk of text data into a series of
Due to the noise, the following errors may occur separate wordsfurther we use an inbuilt analy zer i.e a
morphological analyzer which uses the separate dictionaries to
Reject error access the root word and the suffixed word followed by it. We
The machine reading process may not be able to need to establish a relationship between different varieties of
recognize a character. root words and its suffixesin order to do this process, a
mapping function is necessary. Valid ity of a word is checked
Substitution error using morphological analyzer. We have to identify the type of
OCR may recognize a character incorrectly. error i.e word is incorrect viz. correct root and incorrect suffix,
Character fusions incorrect root and correct suffix and correct root and matching
suffix. These errors are taken care individually and the
Two or more character images merge to appear as a single incorrect words are made transparent by suitable solutions.
connected component. These words which are mis -interrupted are corrected by the
Character fragmentation help of user by giving suitable suggestions. But, the drawback
of this system is that, it fails when it is imp lemented for OCR
output text. It cannot efficiently handle special cases in OCR
A character image is frag mented into more than one sub
like character fusion, character frag mentation etc.
image.
1.3 Spell Checker
A spell checker is an applicat ion program required by Dictionary lookup method
mach ines to process natural languages effectively. Spell Dictionary lookup method is a method of comparing the
checkers can be used as independent tools or they can be a words in the input file with the correct words in the dictionary.
part of larger applications like search engine, translator etc. A This method is used as an advantage over OCR to inspect the
simp le spell checker can perform the fo llowing tasks: letters which are amb iguousbut in a large scale, it leads to
Scanning and extracting the words contained in
the text. size overhead and calculation of probability will become
Matching of the correctly written words with complex and even the cost of searching.
those typed including special N-gram approach
symbols,hyphens..etc is the important step. An N-gram approach is an arrangement of text in a
sequential order for different items like phonemes, graphemes,
To handle morphology to process a language letters, words and even pair of wo rds.
dependent algorithm is required. English Unigram, bigram, trig ram are the varieties in it. In
language also requires a spell checker for the general, we have N-gram which is a types of predictive model
similar words including plurals,verbal designed by the help of Markov to guess the subsequent item
in the form of n-1 order and it follows is a probabilistic
formsetc. So processing even these steps for
language model approach.
other languages will be a complicated issue.
Design
Related Work
The literature survey reveals that most of the research
works on Kannada Spell Checker focus on normal text wh ile
some efforts have been made in other languages like Pun jabi,
Hindi etc., for OCR text . But, no work related to OCR spell
checkers are reported for Kannada directly. So me works on
OCR Spell Checker in other languages and Kannada spell
checker are reported here.
This review discusses common spell checking approaches
and the problems that may occur during spell checking
process. There are two common approaches for imp lementing
spell checker: Dict ionary lookup method and N-gram
approach.

IDL - International Digital Library 2 |P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017

In our project we give Ro manized text as a input for this


module and get a list of tokenized words for co mparison.
Ex: Input: rA manu kAdig.e hodanu
Output: [rA manu, kAdig.e, hodanu]

Comparison
Co mpare each word with a standard dictionary and check
for the validity of the word by using min imu m ed it distance
algorith m.
Mi ni mum Edi t Distance
It is a levenstein distance where two strings or words are
compared and will result to either similar or dissimilarity, the
techniques used to perform this method are substitution,
insertion and deletion in order to convert one word to another
word and calculate the minimu m distance to convert one string
to another by using NLP, where automat ic process ing of data
is done for spelling correction with the help of standard
dictionary and choose the suitable one by selecting the lowest
distance to the word formed.
Ex: When two strings INTENTION and EXECUTION are
considered, the minimu m edit distance between them is 5 i.e.,
Minimu m of 5 operations are required to change INTENTION
as EXECUTION.
Romanizati on The words in the dictionary which has edit distance less
than or equal to 3 are suggested for a given misspelled wo rd.
Ro manization is a process of converting a written text fro m
a specific system to Roman Script. Ro manization includes
Results
following methods:
Transliteration for representing written text
Transcription for representing the spoken word
Combination of both transliteration and
transcription
Ex: 1. in Romanized format is written
as avanu.
2. in Romanized format is written as snEha.
In this tool we read line by line fro m an input file and each
line is Ro man ized to English.
Ex: is Ro man ized to
English as rAmanu kAdig.e hodanu.

Tokenization
Token ization is a process of forming a set of tokens which
has meaningful elements such as words, phrases, symbols in
the form of text .
Ex: Input: This is a spell checker for Kannada OCR.
Output: This, is, a, spell, checker, for, Kannada, OCR.

IDL - International Digital Library 3 |P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017

Conclusion
In this project we have imp lemented spell checker for
Kannada OCR. Fro m this project we learnt various tools to
implement spell checker. After this project, we understood
various problems occur during text processing; also we got to
know how to tackle these problems. Although there were lots
of problems during Kannada text processing we understood a
major way to imp lement a Kannada Spell Checker for
Kannada OCR. As this project is a first attempt for
implementing spell checker for Kannada OCR, we hop e our
project serves as platform fo r beginners to understand various
aspects of spell checker for Kannada OCR.

Future Work
There can be several future work proposed, some of them
involving are improving the performance while others can be
built on top of the work done here. Here are some o f the works
we believe can be performed :
The methods can be improved to achieve better
efficiency.
A larger dictionary with set of huge words can be
used.
The methods can be used to separate root word and
affix word to improve the performance.
Work can be elaborated for semantic errors as well.
Further, the work can be extended by applying the
mu lti-threaded approach in the spell checker tool.

References

[1]. Rajeshakara Murthy S, Ramakanth Kumar P, A non-word Kannada spell


checker using morphological analyzer and dictionary lookup method.
International Journal of Engineering Sciences & Emerging
Technologies, June 2012, Volume 2, Issue 2.
[2].OCR Spell: An Interactive Spelling Correction System for OCR Errors in
Text,Kazem Taghva* and Eric Stofsky.
[3].SPELL CHECKER FOR OCR, Yogomaya Mohapatra, Ashis Kumar
Mishra, Anil Kumar Mishra, International Journal of Computer Science
and Information Technologies, Vol. 4(1), 2013, 91-97.
[4]. OCR Post -processing Error Correction Algorithm Using Googles
Online Spelling Suggestion Yourself Bassil, Mohammed Alwani,
Journal of Emerging Trends in Computing and Information Sciences
VOL.3, NO. 1, January 2012.
[5]. A comprehensive survey on OCR techniques for Kannada script,
Chandrakala.H.T, Thippeswamy.G

IDL - International Digital Library 4 |P a g e Copyright@IDL-2017

You might also like