You are on page 1of 5

An Enhanced Algorithm for Character Segmentation in Document

Image Processing
1
V. Manikandan, 1V.Venkatachalam, 1M.Kirthiga, 1K.Harini, 2N.Devarajan

Abstract--- Optical Character Recognition errors in scanning a document, column


consists of various steps like skew detection, segmentation of documents with multiple
segmentation of columns, lines, words, and columns, paragraph segmentation, line
characters before feeding the isolated segmentation, word segmentation and finally
character to an optical character recognition Optical Character Recognition [1].
system. Several methodologies are followed to A scanned digital document image can be
perform these steps using conventional Hough considered as an image with collinear black dots
Transformation. In this paper, a new in a white background to form black and white
algorithm is proposed to perform all those digital lines of various lengths. By proper
steps involved in document image processing. interpretation of these digital lines, the document
The algorithm is implemented for skew image analysis can be done successfully. The
detection, column and line segmentation and preprocessing steps involved in document image
Character Segmentation. This can be analysis are(i)Rule line removal (ii)Text
extended to all other steps like character segmentation (iii)Word segmentation.
recognition. The novelty of this approach lies
in “the consideration of any image, as one II. PREVIOUS METHOD
formed by several black and white lines of Detecting straight lines in a document image
various lengths and at various angles”. The involves detection of groups of collinear black
pixel values of the binary image are stored in points. By proper interpretation of these digital
an array. All the pixel values in the array are lines, one can successfully analyze the layout of
compared with their horizontally adjacent a document image. Detection of skew angle as
pixel values, row by row, for the presence of well as individual text lines is also helpful by
collinear points (i.e., a line). It is done by this process. PVC Hough [3] did propose an
detecting the continuity of either the white or interesting procedure for detecting lines in an
black pixels accordingly. Once the continuity image, which is now popularly known as Hough
is detected, the starting and end co-ordinates Transform. It involves transformation of points
are displayed as an intermediate result. A new in image space to parameter space of straight line
image will be generated as a result, which [4]. The Hough Transform transforms between
indicates the pixel area of line, identified from the Cartesian space and a parameter space in
the input image. The algorithm is applied for which a straight line (or other boundary
English and other regional languages. formulation) can be defined [11].

I. INTRODUCTION III. PROPOSED METHOD


Document image analysis are the analysis of text The disadvantages of Hough Transform can be
and graphics in an image, so that they can be eliminated by proposing a Digital Straight Line
understood by a machine for further processing (DSL) version of Hough Transform, which is the
or taking some steps based on them. In the main contribution of this paper. This
analysis of text regions of an image, there are implementation can be done in two ways. One is
some tasks to be performed, namely, skew to generate digital straight lines between two
detection, which is necessary as there are human black pixels and pool all black pixels into the bin
errors in scanning a document, column for that line. However, for n points, nC2 lines are
segmentation of documents with multiple to be examined, that may be prohibitive for large
n. The other way is to look for all possible digital
1
The authors are with Department of Electrical and straight lines generated by pairs of pixels on four
Electronics Engineering, Coimbatore Institute of borders of a document image. This second
Technology, Coimbatore -641 014, India. approach is elaborated in this paper. As the
2
The author is with Department of Electrical and reduced the computations needed in calculating
Electronics Engineering, Government College of all possible straight lines in the image by
Technology, Coimbatore -641 013, India.
Corresponding author: manikandan@cit.edu.in
employing the redundancy in representing patterns along horizontal paths, presence of a
parallel lines in an image. line along the horizontal direction can be
detected. The collinearity of the pattern is
A. Drawing Digital Straight Lines checked by the comparison of adjacent pixels for
Digital straight line has been rigorously same pattern in a row. This ultimately gives the
characterized by Rosenfeld. A digital space is solution for detection of line within a row[5].
represented by lattice points that may be The values of each and every pixel in the
considered as the centre of the pixels. Consider a binarized image is extracted and stored in a
real line between any two lattice points p, q matrix format. In order to detect the presence of
belonging to a subset S of lattice points. If for a line in a row, two parameters are made use, one
any real point (x, y) on this line, there is a lattice to represent the starting coordinate and the
point (i, j) in S such that max [| i-x |, | j-y |] < 1, second to represent the end coordinate of a line.
then S is a digital straight line segment. Steps to Perform Line Detection
1) Initially the start and end coordinates
are set to a value -1, denoting that no
address values are assigned to them
2) Now, the value of the first pixel of the
first row from the matrix is examined
for the presence of a white pixel.
Two cases occur:
(a) Presence of white pixel: the
address corresponding to the
white pixel is stored in the start
and end coordinates.
(b) Absence of white pixel: the
start and end coordinates are
Figure. 1 A Digital Straight Line unaltered.
A document image is surrounded by four 3) Then, the value of right adjacent pixel
borders, ‘AB’, ‘CD’, ‘AD’, and ‘BC’, as shown from the matrix is examined.
in figure 1 and the pixels belonging to these Two cases occur:
borders are called border pixels. The digital (a) Presence of white pixel:
straight lines considered here are those passing
through two border pixels. A typical straight line Here the values are checked for the
obtained is illustrated in Figure 1 as PQ. start and end coordinates which may or
Two parallel digital straight lines in an may not be altered during the step 2.
image follow the same pattern of pixels, as If they are set: start coordinate is left as
shown in Figure 1. such and the end coordinate is
incremented
IV. METHODOLOGY If they are not set: start and end
In general, the images used for line detection are coordinates are set to represent the
usually gray scale images which are available address of that particular pixel.
from common image sensors. A gray scale image
is made up of pixels of varying intensities. The (b)Absence of white pixel: Here values
processing on these gray scale images are time are again to check for the start and end
consuming and in order to make the processing coordinates which may or may not be
simple, time efficient and effective, the gray altered during the step 2.
scale images are initially converted to its If they are set: start coordinate is left as
equivalent binary image. A binary image is one such and the end coordinate is set to the
in which the pixel values are either 0 or 1 (in address representing the previous pixel
some cases 0 or 255). location.
A. Line Detection If they are not set: start and end
The novelty of our approach of line coordinates are left as such.
detection lies on the fact that the binarized image 4) By repeating the above procedure the
is considered to be formed by several white and continuity of the line is detected and
black pixels at various lengths and at various their corresponding start and end
angles. By detecting the collinearity of white address are also stored.
5) Discontinuity occurs due to the various angles. It can be safely assumed that the
presence of any black pixel, in such a skew angle of the image will not exceed 15(or -
case the start and end coordinates are 15) degrees. Hence, digital lines are considered
reset and the above procedure is in this range only.
repeated. The procedure to perform skew detection is
explained as follows.
The following Figure.2 represents procedure The presence of a continuous straight line with
for line detection with (a) showing the initial its length equal to that of the image size is to be
arrangement of the coordinates,(b) showing detected. Once this line is detected, and then by
the presence of white pixel in the right making use of the simple mathematical equations
adjacent position,(c) showing the condition the angle in which the image is skew can be
of absence of white pixel and (d) case if any calculated [6]. The skew detection can be better
discontinuity occurs in the collinearity. explained with an example. Let us consider an
image of size 5 x 5. Initially the presence of a
continuous straight line is done with the
elements of the first row in the matrix, in this
case 5 elements. Now the same procedure is
repeated with elements of the matrix in first row,
except the last value. The last value will be one
which is the last value of the second row, i.e.,
first four elements of row 1 and the fifth element
of row 2. The next step will by using the first
three values of row 1 and the last two values of
row2. Thus the procedure is repeated until all
combinations are done.

B. Column Segmentation
The next processing will be column
segmentation, where previous methods exist[8].
A modification of the procedure used for line
detection can be extended to detect the presence
of lines in the vertical direction, which is
ultimately column segmentation, the next pre
processing step. The procedure will be similar to
that of the horizontal line detection.
Making use of the above method, may
lead to detect lines which are formed by even
two adjacent pixels. This disadvantage is
overcome by introducing a threshold value. The
use of a threshold value is of most importance
which leads to the fact of determining the
number of collinear points which make up a line.
The threshold is calculated by finding the
difference between the values of the end and
start coordinates. The threshold value may even
be a user defined value which makes the system
more advantageous and reliable. Figure 3 Detecting Skew angle (Line with arrow
correspond to the straight line whose length
C. Skew Angle Detection equals the size of the image)

When a document is imaged by flatbed D. Text Line Segmentation


scanner of CCD camera, a few degrees of skew
is probable. To detect the skew angle, we convert The first step to any OCR system is to
the image into two-tone [9] and consider the automatically segment each text line from the
count of black pixels along digital lines at document.Various methods are proposed for text
line segmentation, the most common one being
the projection profile method, which works well
if the document is skew corrected. In this
section, we show that our digital line segment
based method can detect text line without skew
correction. In fact, both skew detection and line
identification can be done in one shot.
For documents with double, triple or more
number of text columns, as in some magazine
and newspaper pages, the approach is slightly
different. While traversing along a candidate
line, we keep a local count of the number of
connected white pixels. This is equivalent to
finding white digital lines rather than black ones.
As soon as this count exceeds a threshold value t,
that is to say, as soon as we come across a
connected white line of length greater than t, we
color the connected segment of white pixels. The
entire line is scanned for such connected white
components with subsequent coloring as
explained. The process is repeated for all
candidate lines.Some of the results obtained are
as illustrated in Figures 4 and 5.
Figure 5. Line and column segmentation in a
document image with picture

E. Character Segmentation
Character segmentation is an
operation that seeks to decompose an image
of a sequence of characters into sub images
of individual symbols. It is one of the
decision processes in a system for optical
character recognition (OCR). A Character is
a pattern that resembles one of the symbols
the system is designed to recognize[14]. But
to determine such a resemblance the pattern
must be segmented from the document
image. Segmentation is the initial step in a
three step procedure. Given a starting point
in a document image:
1) Find the next character image.
2) Extract distinguishing attributes of
the character image.
3) Find the member of a given symbol
Figure 4. Line Segmentation of Skewed set whose attributes best match
English Script image with two columns those of the input, and output its
identity.
The result thusobtained for image with multiple This sequence is repeated until no additional
columns is shown in Figure 5, where row character images are found.
segmentation and column segmentation have
been done together.
[6] A paper on “Method and means for
recognizing complex patterns “by Hough, P.V.C.
U.S. Patent 3,069,654, Dec. 18, 1962
[7]M. C. K. Yang, J. Lee, C. Lien, and C. Huang,
“Hough Transform Modified by Line
Connectivity and Line Thickness,” IEEE
Transaction on PAMI, Vol. 19, No. 8, pp. 905-
910, August, 1997
[8] A. Rosenfeld “Digital line segments”,, IEEE
Transactions on Computers, Vol. C-23, No.
12,pp. 1264-1269 December 1974
[9] A paper on“Use of the Hough
Transformation To Detect Lines and Curves in
Pictures” by Richard O. Duda and Peter E. Hart,
Stanford Research Institute, Menlo Park,
California
[10] A paper on “Application of the generalised
Hough transform to corner detection” by E.R.
Davies, MA, DPhil, CPhys, FlnstP.
[11] A paper on “Automatic Line Detection” by
GhassanHammarneh, KavinAlthoff , Rafeef
Abu-gharbeigh, Image Analysis group,
Figure 6.Character segmentation (Tamil Script) Department of Signals and Systems, Chalmes
university of technology, Sep.1999.
V. CONCLUSION [12] Seong-whan Lee, member IEEE computer
In this paper, a new algorithm is proposed for the society, Dong-June Lee, Member IEEE and Hee-
preprocessing of a document image, which Seon Park, member IEEE “A New Methodology
includes line segmentation, column for Gray Scale Character Segmentation and
segmentation, skew detection, line and column Recognition” IEEE transactions on Pattern
segmentation in skewed images which leads to Analysis and Machine Intelligence, Vol.18,
the isolation of a particular word and character. No.10, Oct 1996.
Thus the recognized character can be used for [13] ManjunathAradhya V.N. Hemantha Kumar
automatic essay scoring. G, Shivakumar P “Skew Detection technique for
binary Document Images Based on Hough
REFERENCES Transform” International Journal of Information
[1] A paper on “Automatic scoring of short technology3:3 2007.
handwritten essays in reading comprehension [14] Richard G.Casey and Eric lecolinet,
tests” by SargurSrihari, Jim Collins, Harish Member IEEE “A Survey of Methods and
Sinivasan, ShavyaShetty, JaninaBrutt-Griffler, Strategies in Character Segmentation” IEEE
‘Center of Excellence for Document Analysis transactions on Pattern Analysis and Machine
and Recognition’(CEDAR), University of Intelligence Vol.18 No.7, Jul 1996.
Buffalo, State university of New York, USA [15] R.L. Hoffman and J.W.McCullough“
[2] A paper on “Digital straight line segments” Segmentation Methods for Recognition of
by Azriel Rosenfeld. Machine-Printed Characters” IBM Research and
[3] P.V.C. Hough, “Method and means for Development, pp.153-165. Mar 1971.
recognizing complex patterns”, US Patent
3069654, December 18, 1962
[4] R. O. Duda , P. E. Hart, “ Use of the Hough
transformation to detect lines and curves in
pictures”, Communications of the ACM, Vol.15
No.1, pp.11-15, Jan. 1972
[5] A paper on “Picture Processing by
Computer” by Rosenfeld, A., Academic Press,
New York, 1969.

You might also like