You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 38

Optical Character Recognition for Tamil using


Tchebichef Moments
C.V.Subbulakshmi, N.Malathi, and S.Hemavathi

Abstract— This paper deals with the Optical Character Recognition for Tamil (one of the Indian language) using Tchebichef Moments. It
aims at recognizing the printed Tamil characters of fixed font and size with high accuracy. The document is scanned as a bmp image and
the Tchebichef moments for each character (32x32) are calculated after the segmentation. The calculated moments of each character are
checked for a match using the least-square-difference method, and corresponding UNICODE value is obtained as output, which can be used
to transform the printed text into editable text. The Unicode outputs are then converted in to text by using the programming language Python.

Index Terms: Optical Character Recognition, Python language, Tchebichef Moments, Unicode.

——————————  ——————————

1 INTRODUCTION

O PTICAL Character Recognition (OCR), is the electronic


translation of scanned images that are hand written, type
written or printed text into machine editable text. It is
with the labeled graphs stored for a set of basic symbols. This
algorithm uses topological matching procedure to compute the
correlation coefficients and then maximizes the correlation
used to convert the printed text on paper, books and docu- coefficient.
ments into electronic files. 1.2 Tchebichef Moments in Image Analysis
Tamil is the official language of the Indian state, Tamil Na- Tchebichef moments are the new, simplest discrete orthogonal
du and the union territories of Pondicherry, the Andaman & moments used in image analysis. Tchebichef moment inva-
Nicobar Islands and also countries like Sri Lanka, and Singa- riants perform significantly better than Hu moment invariants
pore. It is one of the 22 languages of India. In Malaysia about and Zernike moment invariants[3]. Tchebichef moments are
543 Tamil Medium government schools providing primary more efficient and superior than other image moments.
education is available. Tamil has the longest unbroken literary
tradition. There are many written works in Tamil. One exam-
2 TAMIL SCRIPT
ple, Thirukkural is the well known literary composition which
proclaims the basic principles for the moral and material life of Tamil is a South Indian language . It is the first legally recog-
the people. It has been interpreted in over 60 languages. To nized Classical language of India.The language has 31 basic
digitize or to serve the precious Tamil documents on the web, alphabets (12 vowels, 18 consonants and a special consonant)
OCR is needed. However less attention had been given to In- and the written script is comprised of 247 characters.
dian languages. Main reason for the slow development could
be attributed to the complexity of the shape of Indian scripts, Vowels:
and also the large set of different patterns that exist in these
languages, as opposed to English.

1.1 Tamil OCR Consonants:


There are few recent works that has been done for Tamil
OCR. Siromoney has described a method for recognition of
machine printed Tamil characters using an encoded character
string dictionary. Shivsubramani [1] presents an efficient me-
thod for recognizing printed Tamil characters exploring the
Special Consonant:
interclass relationship between them, which should be accom-
plished using multi class hierarchical support vector ma-
chines(svm). Chinnuswamy[2] proposed an approach for hand
written tamil character recognition. The procedure consists of
converting the input image into a labeled graph representing 3 Scanning & Binarization
the input character and computing correlation coefficients
 C.V.Subbulakshmi is working as professor and head of Electrical and Elec- The printed documents are scanned using scanners to pro-
tronics Engineering department, Faculty of engineering, Avinashilingam
Deemed University for Women, Coimbatore.
duce a bmp image of the document. Bmp files can be consider-
 N.Malathi is working as a lecturer in Electrical and Electronics Engineer- ably compressed with loss less data compression. The scanned
ing department, Faculty of engineering, Avinashilingam Deemed Universi- 8 bit monochrome bmp image will have pixel values ranging
ty for Women, Coimbatore. from 0 to FF. Binarizing the pixel values to either 0 or FF facili-
 S.Hemavathi is working as a lecturer in Electrical and Electronics Engi-
neering department, Faculty of engineering, Avinashilingam Deemed Uni- tates easy processing, which can be achieved through thre-
versity for Women, Coimbatore. sholding technique. The individual values are compared with a
threshold value which is manually inputted. If the pixel value
is greater than threshold value then it is made white(FF) else
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 39
black(0). The threshold value is set as 192 for the sample inout Figure 3 Segmented Line
by using trial and error method. This value is compared with
each and every pixel value and then the pixel is made as either One character is stripped at a time. The individual characters
FF or 0. The program is coded using C language. are separated by ANDing the last row of pixels with all the
other rows, which results in a single row of pixel values. The
characters are represented as black pixels and the spaces be-
tween them remain white. The white spaces prior to the first
character of any line is stripped off, thus the first value of the
resultant single row of pixels is always black. The single row of
pixels obtained after ANDing, has continuous black values
where there is a character in the line that has been stripped.
The number of continuous black pixels at the start of the row
gives the width of the first character in the line. The white pix-
els,before the next character,is stripped and the above process
is repeated until there are no more characters in the line. The
individual characters stripped are stored as separate bmp files.

Figure 4 Segmented Character


Figure 1 Scanned Document

5 TCHEBICHEF MOMENT
Moments are statistical measures used to obtain relevant in-
formation about a certain object under study (e.g., Signals, im-
ages or waveforms), i.e., to describe the shape of an object to be
recognized by pattern recognition system. Image moments can
be classified into Invariant Moments, Variant Moments, Ortho-
gonal moments, Non-orthogonal moments.
Invariant moments are a special kind of moments de-
signed to remain constant even after some transformations,
such as object rotation, scaling, translation, or image illumina-
tion changes, in order to improve the reliability of a system.
Invariants are sensitive to any image change or perturbations
for which they are not invariant, so any unexpected perturba-
tion will affect the measurement; on the contrary a variant
moment is designed to be sensitive to a specific perturbation,
i.e., to measure a transformation, not to be invariant to it, and
thus if the specific perturbation occurs it will be measured;
Figure 2 After Binarization hence any unexpected disturbance will not affect the objective
of the measurement confronting thus uncertainty.
Orthogonal moments are the moments from which the
image can be obtained from the function or the vice versa
4 SEGMENTATION while non-orthogonal moments are those where the original
It is the process of separating the individual lines and charac- image cannot be obtained by applying the reverse function.
ters, from the scanned image file. The bmp image file is Orthogonal moments are decomposed into two categories
scanned from the bottom left for black pixels. When a black namely
pixel is encountered it signifies the start of a printed line and i)Discrete orthogonal polynomial moments
when an all-white line is encountered this signifies the end of ii)Continuous orthogonal polynomial moments
that printed line, which gives the top and bottom boundaries of Tchebichef moments are a new set of discrete ortho-
the same. The same process is repeated until the end of the gonal moments widely used in 2D image analysis. They have a
document is reached variety of application in visual pattern recognition, object clas-
sification, template matching, robot vision and data compres-
sion. Tchebichef moments are the simplest of discrete ortho-
gonal moments which has low noise sensitivity and computa-
tional complexity.
Coordinate transformation and suitable ap-
proximation of continuous moment integrals are required dur-
ing the computation of moments since Zernike and Legendre
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 40
polynomials are defined only inside the unit circle. But the
Tchebichef Moments are directly defined in the image coordi-
nate space and preserve the property of orthogonality in a
moment set. There is no numerical approximation and the
moments are orthogonal, this property makes it superior to
other moments. Also it has very high reconstruction accuracy.
Earlier many works have been done using
Tchebichef moments viz., Vehicle-logo Recognition Method [3],                     
Traffic Sign Classification, image super resolution [5], image
moment problems [6], analysis of Noise Sensitivity and Recon-
           Figure 5 Picture considered for sample calculation
struction Accuracy of Tchebichef moments [7], Watermarking
Squared norm:
Scheme [8]. This work applies Tchebichef moments for charac-
ter recognition.
From equation(1)
5
Squared Norm: 0,5 5 
2 0 1

,     ‐‐‐‐‐‐‐‐‐‐‐ (1)  ρ(0)=5
The squared norm values calculated using the above equa-
  tion for n = 1 to 4 is as
Where, follows:
ρ = squared norm ρ (1)=1.6
N = size of the image in pixels ρ (2)=0.8064
n = 0 to N-1 ρ (3)=0.36864
ρ (4)=0.1032192
Tchebichef Polynomials: Tchebichef polynomial:
From equations (2), (3) and (4) the Tchebichef polynomial
1                                         ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ (2)  values are calculated and the results obtained are as below:
            Tp[0][0] = 1 Tp[1][0] = -0.8
                  ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ (3)    Tp[1][1] = -0.4 Tp[1][2] = 0
      Tp[1][3] = 0.4 Tp[1][4] = 0.8
Tp[2][0] = 0.48 Tp[2][1] = -0.24
  ‐‐‐‐‐ (4)  Tp[2][2] = -0.48 Tp[2][3] = -0.24
  Tp[2][4] = 0.48 Tp[3][0] = -0.192
Where, Tp[3][1] = 0.384 Tp[3][2] = 0
n = 2 to N-1 Tp[3][3] = -0.384 Tp[3][4] = 0.192
x = 0 to N-1 Tp[4][0] = 0.0384 Tp[4][1] = -0.1536
Tp[4][2] = 0.2304 Tp[4][3] = -0.1536
tn(x) = discrete Tchebichef polynomial of degree n
Tp[4][4] = 0.0384
Tchebichef moments:

From equation (5)


Moments:
, , 163.2 

,  
, , Since the image size is 5x5 the number of moments obtained
is 25(0 to 24). The sample calculation for the first moment is
--------------- (5) shown above. Similarly the value of moments from 1 to 24 is
calculated and is shown below:
Where, T[0][0] = 163.2 T[0][1] = 0
x, y = 0 to N-1, specify the X-Y coordinates T[0][2] = -30.357142 T[0][3] = 0
p,q = 0 to N-1, order of the moments T[0][4] = 113.839285 T[1][0] = 0
Tpq is the Tchebichef moment T[1][1] = 0 T[1][2] = 0
The Tchebichef moments for each character are calculated and T[1][3] = 0 T[1][4] = 0
are used to match the characters. T[2][0] = -30.357142 T[2][1] = 0
T[2][2] = -542.091836 T[2][3] = 0
5.1 SAMPLE CALCULATION T[2][4] = -338.807397 T[3][0] = 0
T[3][1] = 0 T[3][2] = 0
The value of n = 5 for the fig.5 taken as an example (size of the T[3][3] = 0 T[3][4] = 0
image) T[4][0] = 113.839285 T[4][1] = 0
  T[4][2] = -338.807397 T[4][3] = 0
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 41
T[4][4] = -3670.413477 7 UNICODE
The Unicode Standard (http://www.unicode.org)is the Uni-
6 MOMENTS MATCHING versal Character encoding scheme for written characters and
There are 247 letters in the Tamil Script that are made up of text. It defines the uniform way of encoding multilingual text
basic symbols. Tamil language contains combination letters that enables the exchange of text data internationally and
which will have two or three basic symbols. For the purpose of creates the foundation of global software. The Tamil Unicode
creating a database, 127 basic symbols are enough to be consi- range is U+OB80 to U+OBFF. The Unicode characters are com-
dered. Tchebichef moments for these 127 symbols are calcu- prised of 2 bytes in nature. The second byte of the Unicode
lated and stored in a file with a unique index number for each value is only given as the output because the first byte is com-
symbol. The index numbers range from 0 to 126. The font size mon (OB).
and style is to remain constant and is taken as 12 and Arial Un- Based on the recognized index, the Unicode value of
icode-MS. The width of the images is 32 for this case. A 32x32 the corresponding character is returned by using the parsing
size results in 1024 moments. Calculating and using all the technique. Parsing is nothing but reading a sequence of index
1024 moments for matching will be time consuming for printed values for patterns. Tamil language contains combination let-
characters. By trial and error method it is found that 25 mo- ters which will have two or three Unicode values. These Un-
ment values are enough for succesfull pattern recognition. icode values have to be fed in the right sequence to produse the
Each index value contains 25 moment values of the corres- Tamil characters if not the letter will not be represented cor-
ponding symbol. rectly.

Table 1. Index Values of Characters


8 PYTHON
INDEX CHARACTER
Python is an open source programming language
0 which is simple and powerful. It is widely used for internet
1 applications. It has efficient high-level data structures and a
simple but effective approach to object- oriented programming.
2
Features of python: simple, easy to learn, free and open source,
3 high level language, portable, interpreted, object oriented, ex-
4 tensible, embeddable and extensive libraries.
The Unicode output of the program can be converted
5 into text output by using Python. The Python coding displays
. . the corresponding Tamil character for the given Unicode or
. . combination of Unicode values.
126
9 Result
The matching is done by finding the least-square difference
between the moments stored in the database and the calculated
moments. The index value of the set of moments in the data-
base with which the least-square difference is found to be min-
imum is returned. The objective consists of adjusting the para-
meters of a model function to best fit a data set. A simple data
set consists of n points (data pairs) (xi,yi), i = 1, ..., n, where
xi is an independent variable and yi is a dependent variable
whose value is found by observation.
The model function has the form f(x,β), where the m adjustable
parameters are held in the vector β . The parameter values for
which the model "best" fits the data need be found. The least
squares method finds its optimum when the sum, S, of squared Figure 6 Sample Input
residuals,

 
The program recognizes all the characters in a line at a time.
                                                                 ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ (6)  The line number is given to the program and it returns the cha-
is a minimum. A residual is defined as the difference between racter’s index value in the database and the corresponding Un-
the value of the dependent variable and the model value icode value. The output of the program for the third text line is
shown in Fig 6. Since the image is processed from the bottom,
,β --------------- (7)
the bottom most line of the page is the first strip.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 42

12 1292.826638 95
69 3413.753990 B5
75 3208.593787 A9
84 8882.838978 AE
C1
30 1421.994058 A4
42 2238.347342 B1
CD
48 2265.960949 B1
C7
4 1809.094817 89
99 2291.189046 B2
15 1776.052109 95
C1

The Unicode outputs are fed to the Python programming lan-


guage to obtain the corresponding Tamil characters.
Fig 8 shows the first three Tamil characters of sample input,
which are obtained using Python language.

Figure 7 Unicode Output for the Second Line

Fig 6 shows the Unicode outputs generated for the sample input.
The window display can be explained as follows.

Input the threshold value : 192  Threshold value used in Figure 8 Python output of first three characters
binarization.
input the strip no : 3  line number from the
bottom of sample input 10 CONCLUSION
index 37  Index for the left most Tchebichef moment which has been used for various image
character . applications is proposed for Tamil Optical Character Rec-
3951.450009  least difference. ognition in this work. For OCR minimum order 5 Tchebi-
aa  Unicode output for the chef moments are sufficient enough to recognize characters
character with maximum efficiency per line. Since only 25 moment
values have been calculated and used for recognition, ex-
ecution and memory space required for the program is re-
Table 2. Recognized Characters as per the Unicodes duced.
Generated as Output.
REFERENCES
[1] Shivsubramani.k, Loganathan.R, Srinivasan.C.J, Ajay.V, Soman.K.P, “Multic-
Index Least Difference Unicode Character lass Hierarchical SVM for Recognition of Printed Tamil Charactersn,” Centre
37 3951.450009 AA for Excellence in Computational Engineering, Amrita Vishwa Vidyapeetham,
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 43
India. www.citeseerx.ist.psu.edu
[2] Chinnuswamy.P, S.G.Krishnamoorthy,”Recognition of Hand Printed
Tamil”,Pattern Recognition. Elsevier Ltd, Volume 12, Issue 3,pp. 115-
217, 1980.
[3] Shijie Dai, He Huang , Zhangying Gao, Kai Li and Shumei Xiao “Ve-
hiclelogo Recognition Method Based on Tchebichef Moment Inva-
riants and SVM” World Congress on Software Engineering, 2009, IEEE
computer society pp18-21.
[4] Nur Azman Abu,Wong siaw Lang and Shahrin Sahib “Image super –
Resolution via Discrete Tchebichef Moments” 2009 International Confe-
rence on Computer Technology and Development, icctd, vol2, pp 315-
319, 2009.
[5] Judit Martinez, Joseph M.Porta and Federico Thomas “A Matrix-
Based Approach to the Image Moment Problem” Journal of Mathemati-
cal Imaging and vision archieve, vol 26, issue 1-2November 2006, pp 105-
113.
[6] S. M. Elshoura & D. B. Megherbi “Analysis of Noise Sensitivity and
Reconstruction Accuracy of Tchebichef Moments” Southeastern,
2008 IEEE pp 521-526, April 2008.
[7] Wanli Lv, Yutang Guo,Jixin Ma,Bin Luo “A Novel Watermarking
Scheme Based On Relationship Of Tchebichef Moments” IEEE Int.
Conference Neural Networks & Signal Processing ,Zhenjiang, China,
June 8~10, 2008, pp 146-150.

Ms.C.V.Subbulakshmi working as professor and head


of Electrical and Electronics Engineering Department,
having around 19 years of teaching experience. She has
published 4 papers in national/international conferences.

Ms.N.Malathi working as a lecturer in Electrical and


Electronics Engineering Department, having 6 and half
years of teaching experience. She has published 4 pa-
pers in national/international conferences.

Ms.S.Hemavathi working as a lecturer in Electrical and


Electronics Engineering Department, having 5 years of
teaching experience. She has published 4 papers in
national/international conferences.