Professional Documents
Culture Documents
1 Introduction
Machine learning and human computer interaction are the most challenging research
fields since the advent of digital computers. In Optical Character Recognition (OCR),
the text lines, words and symbols in a document must be segmented properly before
recognition. Correctness/incorrectness of text line segmentation directly affects
accuracies of word/character segmentation and consequently changes the accuracies
of word/character recognition [1]. Several techniques for text line segmentation are
reported in the literature [2-5]. These techniques may be categorized into three groups
as follows: (i) Projection profile based techniques, (ii) Hough transform based
techniques, (iii) Thinning based approach. As a conventional technique for text line
segmentation, global horizontal projection analysis of black pixels has been utilized in
[4, 6]. Piece-wise horizontal projection analysis of black pixels is employed by many
researchers to segment text pages of different languages [2, 8]. In piecewise
horizontal projection technique, the text-page image is decomposed into horizontal
stripes. The positions of potential piece-wise separating lines are obtained for each
stripe using horizontal projection on each stripe. The potential separating lines are
then connected to achieve complete separating lines for all respective text lines
located in the text page image.
D.C. Wyld et al. (Eds.): ACITY 2011, CCIS 198, pp. 211–218, 2011.
© Springer-Verlag Berlin Heidelberg 2011
212 V.J. Dongre and V.H. Mankar
Table 3. Half Form of Consonants with Vertical Table 4. Examples of Combination of Half-
Bar Consonant and Consonant
3 Image Preprocessing
We have collected the printed document pages from different official printed letters in
Marathi language. The document pages are scanned using a flat bed scanner at a
resolution of 300 dpi. These pixels may have values: OFF (0) or ON (1) for binary
images, 0– 255 for gray-scale images, and 3 channels of 0–255 colour values for
colour images. Colour image is converted to grayscale by eliminating the hue and
saturation information while retaining the luminance. It is further analyzed to get
useful information. Such processing steps are explained below.
We have used histogram based properties to binarize the document taken as a data set.
The digitized text images are converted into binary images by thresholding using
Otsu’s method [16]. Original image contains 0 for Object and 1 for background. The
image inverted to obtain image such that object pixels are represented by 1 and
background pixels by 0.
The noise, introduced by the optical scanning device or the writing instrument, causes
disconnected line segments, bumps and gaps in lines, filled loops etc. The distortion
including local variations, rounding of corners, dilation and erosion, is also a problem.
Prior to the character recognition, it is necessary to eliminate these imperfections
[10-11]. It is carried using various morphological processing techniques.
Fig. 1. Preprocessed Images (a) Original, (b) segmented (c) Shirorekha removed (d) Thinned
(e) image edging
by calculating skew angle and making proper correction in the raw image using Hu
moments and various transforms [12-14].
3.4 Thinning
After the image is preprocessed using methods discussed in section 3, we now apply
various techniques for segmentation of document lines, words and characters. The
process of segmentation mainly follows the following pattern:
1) Identify the page layout. 2) Identify the text lines in the page. 3) Identify the
words in individual line. 4) Finally identify individual character in each word.
The global horizontal projection method is used to compute sum of all white pixels on
every row and construct corresponding histogram. The steps for line segmentation are
as follow:
• Construct the Horizontal Histogram for the image (fig. 2-b).
• Count the white pixel in each row.
• Using the Histogram, find the rows containing no white pixel.
• Replace all such rows by 1 (fig. 2-c).
• Invert the image to make empty rows as 0 and text lines will have original pixels.
• Mark the Bounding Box for text lines using standard Matlab functions
(regionprops and rectangle).
• Copy the pixels in the Bounding Box and save in separate file. ( separated lines
shown in fig. 2-f).
Segmentation of Printed Devnagari Documents 215
The global horizontal projection method is used here to compute sum of all white
pixels on every column and construct corresponding histogram. The steps for line
segmentation are as follow:
• Construct the Vertical Histogram for the image (fig. 3-b).
• Count the white pixel in each column.
• Using the Histogram, find the columns containing no white pixel.
• Replace all such columns by 1
• Invert the image to make empty rows as 0 and text words will have original pixels.
• Mark the Bounding Box for word. (See fig 3-c)
• Copy the pixels in the Bounding Box and save in separate file. (See fig. 3-d).
216 V.J. Dongre and V.H. Mankar
A slight modification in previous algorithm (section 4.2) is used here. The steps for
line segmentation are as follow:
• Get the thinned image using Matlab bwmorph function. (This is done to
normalize image against thickness of the character).
• Count the white pixel in each column.
• Find the position containing single white pixel.
• Replace all such columns by 1.
• Invert the image to make such columns as 0 and text characters will have original
pixels.
• Mark the Bounding Box for characters using standard Matlab functions. See fig 4-a.
• Copy the pixels in the Bounding Box and save in separate file. (Separated
characters are shown in fig. 4-b).
References
[1] Priyanka, N., Pal, S., Mandal, R.: Line and Word Segmentation Approach for Printed
Documents. IJCA Special Issue on Recent Trends in Image Processing and Pattern
Recognition 1“RTIPPR”, 30–36 (2010)
[2] Wong, K., Casey, R., Wahl, F.: Document Analysis System. IBM J. Res. Dev. 26(6),
647–656 (1982)
[3] Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for
technical journals. Computer 25, 10–22 (1992)
[4] Kumar, V., Senegar, P.K.: Segmentation of Printed Text in Devnagari Script and
Gurmukhi Script. IJCA: International Journal of Computer Applications 3, 24–29 (2010)
[5] Pal, U., Datta, S.: Segmentation of Bangla Unconstrained Handwritten Text. In: Proc. 7th
Int. Conf. on Document Analysis and Recognition, pp.1128–1132 (2003)
[6] Dongre, V.J., Mankar, V.H.: A Review of Research on Devnagari Character Recognition.
International Journal of Computer Applications (0975 – 8887) 12(2), 8–15 (2010)
218 V.J. Dongre and V.H. Mankar
[7] Pal, U., Mitra, M., Chaudhuri, B.B.: Multi-skew detection of Indian script documents. In:
Proc. 6th Int. Conf. Document Analysis Recognition, pp. 292–296 (2001)
[8] Likforman-Sulem, L., Zahour, A., Taconet, B.: Text line Segmentation of Historical
Documents: a Survey. International Journal on Document Analysis and Recognition 9(2),
123–138 (2007)
[9] Magy, G.: Twenty years of Document Analysis in PAMI. IEEE Trans. in PAMI 22,
38–61 (2000)
[10] Serra, J.: Morphological Filtering: An Overview. Signal Processing 38(1), 3–11 (1994)
[11] Arica, N., Yarman-Vural, F.T.: An Overview of Character Recognition Focused On Off-
line Handwriting. In: C99-06-C-203. IEEE, Los Alamitos (2000)
[12] Cheriet, M., Kharma, N., Liu, C.-L., Suen, C.Y.: Character Recognition Systems: A
Guide for students and Practioners. John Wiley & Sons, Inc., Hoboken (2007)
[13] Kapoor, R., Bagai, D., Kamal, T.S.: Skew angle detection of a cursive handwritten
Devnagari script character image. Journal of Indian Inst. Science, 161–175 (May-August
2002)
[14] Pal, U., Mitra, M., Chaudhuri, B.B.: Multi-Skew Detection of Indian Script Documents.
In: CVPRU IEEE, pp. 292–296 (2001)
[15] Mankar, V.H., et al.: Contour Detection and Recovery through Bio-Medical Watermarking
for Telediagnosis. International Journal of Tomography & Statistics 14(S10) (special
volume) (Summer 2010)
[16] Jing, G., Rajan, D., Siong, C.E.: Motion Detection with Adaptive Background and Dynamic
Thresholds. In: Fifth International Conference on Information, Communications and Signal
Processing, Bangkok, W B.4, pp. 41–45 (2005)