You are on page 1of 3

International Journal of Information Technology (IJIT) Volume 3 Issue 2, Mar - Apr 2017

RESEARCH ARTICLE
0 OPEN ACCESS

An Analysis of Urdu Optical Character Recogition Processing


Dr. V.Ajantha Devi [1], J. Ashifa [2]
Assistant Professor [1], Research Scholar [2]
Department of Computer Science
Sri Adi Chunchanagiri Womens College, Cumbum
Tamil Nadu -India.

ABSTRACT
This paper is about presenting presents a technique for recognition of Urdu words in Nastaliq font using ligatures as units of
recognition. In Nastalique, word and character overlapping makes optical recognition more complex. Optical character
recognition of the Latin script is relatively easier. This paper based on research on Nastalique OCR discusses a proposed finite
state model for the optical recognition of Nastalique printed text.
Keywords:- Pattern Recognition, Optical Character Recognition, Urdu Text, Ligatures.

I. INTRODUCTION

Urdu is spoken by 490 million people around the


world. It is the 4th largest language spoken and understood
in the world. It is the official language of Pakistan and five
Indian states. Urdu was developed under the great influence
of Arabic, Persian and Turkish languages almost 900 years
ago. Due to huge significance of Urdu script a number of
researchers have focused on Optical Character Recognition
systems, which can convert Urdu ancient literature to digital
format. Fig 1: Urdu Alphabet
The most difficult case in character segmentation is
the cursive script. The scripted nature of Arabic written Classification of Alphabet
language poses some high challenges for automatic In Urdu Alphabet, several characters are similar to
character segmentation and recognition. All these methods each other with respect to their basic shapes. They vary from
need pre-processing and normalization of the scanned word each other because of diacritics (dots) attached with them.
images before the features can be extracted. During this pre- Hence, can group the similar characters to minimize the
processing step as many individual characteristics as diversity. All 38 characters of Urdu alphabet can be
possible are removed, to make the feature extraction and classified into 21 classes. This classification is shown in
recognition as simple as possible. following figure 2

Alphabet
Urdu character set is comprised of 38 basic shapes.
It does not include Aerab (Diacritics used for
pronunciation and vowel sounds). This alphabet is shown in
figure 1.

ISSN: 2454-5414 www.ijitjournal.org Page 18


International Journal of Information Technology (IJIT) Volume 3 Issue 2, Mar - Apr 2017
process step. The proposed UrduOCR algorithm follows the
following three main steps

Fig 4: Proposed Urdu OCR Algorithm

Step 1: Image acquisition


The proposed AOCR starts with image acquisition
. process (see Figure 5) that scans the Urdu text using a 300
dpi scanner, the scanned image is saved in a .bmp image
Fig 2: Classification of Urdu Alphabet
file.

II. GENERIC OCR MODULES

OCR systems proposed by different researchers are


comprised of various modules. Almost every one proposed a
pre-processing unit to clean the image and make it ready for
recognition process.

Fig 5: Original Image

Step 2: Image pre-processing


The image is filtered using a median filter that
removes noise, the image is converted from RGB to gray
scale image, and then it is converted to binary image. It
should be noted that not removing the noise may lead to
incorrect results in the recognition process. The area outside
text boundaries is removed using a clipping process and
finally, the binary image is resized to a common defined size
as shown in Figure 6.

Figure 3: Different phases of OCR

OCR consists of many phases such as Pre-processing,


Segmentation, Feature Extraction, Classifications and
Recognition. The input of one step is the output of next step.
The task of pre-processing relates to the removal of noise
and variation in text document. In this paper, the proposed
OCR algorithm combines the word segmentation, character
segmentation and recognition steps in a coherent template
Fig 6: Binary Clipped Image

ISSN: 2454-5414 www.ijitjournal.org Page 19


International Journal of Information Technology (IJIT) Volume 3 Issue 2, Mar - Apr 2017
[3] Sardar, S, Wahab, A, Optical character recognition
system for Urdu,IEEE 1 - 5 , June 2010.
Step 3: Image recognition [4] Richard G. Casey And Eric Lecolinet A Survey Of
This step equivalent to the mentioned line Methods And Strategies In Character
segmentation, word segmentation, character segmentation Segmentation, IEEE Trans. On Pattern Analysis
and recognition steps. (i) The binary image is segmented And Machine Intelligence, Vol 18, Pp 690-
into lines of text; Figure 7 shows the two lines of text that 706,1996
are segmented from the binary image in Figure 6. [5] A Review on the Various Techniques used for
Optical CharacterRecognition,Pranob K
Charles,V.Harish, M.Swathi, CH.
DeepthiInternational Journal of Engineering
Research and Applications(IJERA) ISSN: 2248-
9622, Vol. 2, Issue 1,Jan-Feb 2012.
[6] Character Recognition in practice Today and
Tomorrow,1996, UdoMiletzki, Siemens
Electrocom GmbH D-78767 Konstanz, Germany.
Second Line First Line [7] S. Moussa, A. Zahour, A. Benabdelhafid and A.
Figure 7: Line segmentation
Alimi, New features using fractal multi-dimensions
for generalized Arabic font recognition, Pattern
Feature Extraction Recognition Letters, 31(2010), 361-371.
In this phase, features of individual character are [8] S.Impedovo&L.Ottaviano&S.Occhinegro. Optical
extracted. The performance of an each character recognition
Character Recognition A survey. Int. Journal of
system that depends on the features that are extracted. The PRAI, Vol. 5, No 1& 2, p. 1-24, 1991.
extracted features from input character should allow [9] V.K.Govindan&A.P.Shivaprasad.Character
classification of a character in a unique way. We used Recognition - a Review.
diagonal features, intersection and open end points features, Pattern Recognition, Vol. 23, No &, P. 671-683,
transition features, zoning features, directional features,
1990.
parabola curve fittingbased features, and power curve [10] Aparna K G and A G Ramakrishnan , A Complete
fittingbased features in order to find the feature set for a Tamil Optical Character Recognition System,
given character. [11] V. Ajantha Devi, Dr. S Santhosh Baboo,
Embedded Optical Character Recognition On
III. CONCLUSION
Tamil Text Image Using Raspberry Pi,
This paper tells about OCR system for online and
International Journal of Computer Science Trends
offline character recognition This study was next step for
segmentation based OCR for Noori Nastaleeq. The work one and Technology (IJCST) Volume 2 Issue 4, Jul-
has been a continuation of already proposed techniques with Aug 2014., pg 127-131.
refinements in different modules. Hence there is a need to [12] Nawaz, S.N., M. Sarfraz, A. Zidouri and W.G. Al-
develop a very good character recognition system which Khatib, 2003. An approach to offline Arabic
must achieve highest accuracy. The challenges are font character recognition using neural networks.
dependent. Hence, a detailed font study can help in finding Proceeding of the 10th IEEE International
good solutions for these challenges.
Conference on Electronics, Circuits and Systems,
Dec. 14-17, pp: 1328-1331.
REFERENCES http://eprints.kfupm.edu.sa/3803/
[13] Yu, B. and A.K. Jain, 1996. A robust and fast skew
[1] Dinesh Dileep, A Feature Extraction Technique detection algorithm for generic documents. Patt.
Based on Character Geometry for Character Recog., 29: 1599-1629. DOI: 10.1016/0031-
Recognition, Arxiv, 2012. 3203(96)00020-9
[2] N. Arica and F. T. Yarman-Vural, Optical
Character Recognition for Cursive Handwriting,
IEEE Transactions on Pattern Analysis and
Machine Intelligence,2002, vol. 24, no. 6.

ISSN: 2454-5414 www.ijitjournal.org Page 20

You might also like