You are on page 1of 11

Basic components in Optical Character Recognition Systems.

Experimentally Analysis on Old Bulgarian Character Recognition

Rumiana Krasteva, Ani Boneva,


Ditchko Butchvarov, Veselin Geortchev

Central Laboratory of Mechatronics and Instrumentation - BAS


Acad. G. Bontchev Str. Bl.2, 1113 Sofia, BULGARIA
Phone: 72 13 61; Fax: 72 35 71

E-mail: rumikristeva@hotmail.com

Abstract. A document image is a visual representation of a paper document, such as a journal article
page, a cover page of facsimile transmission, office correspondence, an application form, etc. Document
image understanding as a research endeavor consists of developing processes for taking a document
through various representations: from scanned image to semantic representation. This paper describes
the processes and subprocesses involved in document image understanding. In the paper presented an
approach for Old Bulgarian character recognition and it’s program realization. It’s described input
transformation, recognition algorithm and criteria for recognition decision.

Keywords: Document image understanding (DIU), Optical character recognition


(OCR), Text Recognition, Word segmentation, Binary transformation.

1. INTRODUCTION

The need to process documents on paper by computer has led to an area of research
that may be referred to as document image understanding [DIU]. The goal of a DIU
system is to convert a raster image representation of a document, e.g., a paper
document scanned by a flatbed document scanner, into an appropriate symbolic form
[1]. DIU as a research endeavor consists of studying all processes involved in taking a
document through various representations: from a scanned or facsimile multi-page
document to high-level semantic descriptions of the document. Thus it involves many
sub-disciplines of computer science including image processing, pattern recognition,
natural language processing, artificial intelligence and database systems.
The symbolic representation desired as output of a DIU system can take one several
forms: an editable description, a representation from which the document can be
(exactly) reconstructed, a semantic description useful for document sorting/filing etc.
Representation schema that are useful for editing and exact reproduction are standards
for electronic document description.
Developing a DIU system with performance comparable to that achieved by human
expert is still decades from realization [4]. The state-of-the-art in DIU can be subdivided
into five areas as follows:
1.System architecture - The complexity of the DIU task leads to modularization into
manageable processes. Due to interdependency of processes, issues of how to
maintain communication and integrate results from each process arise.

2.Decomposition and Structural Analysis - Documents consist of text (machine-printed


and handwritten), line drawings, tables, maps, half-tone pictures, icons, etc. It is
necessary to decompose a document into its component parts in order to process these
individual components. Their structural analysis, in terms of spatial relationships and
logical ordering, is necessary to invoke modules in appropriate order and to integrate
the results of the appropriate modules.

3.Text recognition and interpretation - It is necessary to recognize words of text, often


using lexicons and higher level linguistic and statistical context. The necessity for
contextual analysis arises from the fact that it is often impossible to recognize
characters and words in isolation, particularly with handwriting and degraded print.
4.Tables, graphics and halftone recognition - Specialized subsystems are necessary for
processing a variety of non-text or mixed entities, such as recognizing tabular data,
converting graphical drawings into vector representation, and extracting objects from
half-tone photographs.
5.Databases and system performance evaluation - Methods for determining data sets
on which evaluation is based and the metrics for reporting performance.
Deriving a useful representation from a scanned document requires the development
and integration of many subsystems. The subsystems have to incorporate in
themselves the necessary image processing, pattern recognition and natural language
processing techniques so as to adequately bridge the gap from paper to electronic
media [5].
In discussing DIU it is useful to note that significant research is still required for
extracting descriptions at the desired level of detail so that exact paper documents can
be exactly replicated, e.g., fonts are not typically recognized in today's OCR systems.

2. SYSTEM ARCHITECTURE

Figure 1 shows the organization of the DIU system developed in CEDAR [5]. The
architecture allows for parallel development of different subsystems. The DIU
architecture consists of three major components:
Fig. 1. Organization of DIU system

1.The Tool box contains all the modules needed for document processing. Tools
developed for different conceptual levels are coordinated by the control.
2.The knowledge base consists of two sub-components: document models and
general knowledge. A document model describes the aspects of a document domain or
a group of documents that share similar layout structure. The expressive power of the
model representation dictates the capability of a DIU system to handle different types of
documents. General knowledge is shared by different document domains. It describes
the tasks that are needed to locate and identify document components, such as text
blocks and line segments. A task is carried out by one of the modules in the tool box.
The general knowledge can apply to objects of different domains since they share
similar structural information. Lexicons used by different tools such as for OCR and NLP
are stored in document models.
3.Control is the most critical issue in DIU system design. Its functions include: (1)
selective use of tools, and (2) intelligent combination of data extracted from document
sub-areas to generate a representation of the scanned document. It examines the
problem state in the working memory and uses the facts in the knowledge base to
determine which modules in the tool box should be used. Working memory is a
temporary storage where different levels of data will be stored during document
processing and will be updated after each module activation. The search process stops
when all the objects specified in the document model have been located.
Tool interaction is determined by the knowledge. The general knowledge defines the
dependency or the activation order of tools, e.g., area-labeling can only be activated
after area-segmentation. A document model defines the tool interactions needed in
different document sub-areas since each sub-area may require a different level of
interpretation, e.g., recognizing the recipient (name and address) on a business letter
requires both OCR and NLP while reading the title of a technical document only needs
OCR.
3. DECOMPOSITION AND STRUCTURAL ANALYSIS

A document image is a visual representation of a printed page such as a journal


article page, a facsimile cover page, a technical document, an office letter, etc. Typically,
it consists of blocks of text, i.e., letters, words, and sentences that are interspersed with
tables, and figures. The figures can e symbolic icons, gray-level images, line drawings,
or maps. A digital document image is a two-dimensional representation of a document
image obtained by optically scanning and digitizing a hardcopy document. It may also
be an electronic version that was created for publishing or drawing applications
available for computers.
The document decomposition and structural analysis task can be divided into three
phases [1].
Phase 1 consists of block segmentation where the document is decomposed into
several rectangular blocks. Each block is a homogeneous entity containing one of the
following: text a uniform font, a picture, a diagram, or a table. The result of phase 1 is a
set of blocks with the relevant properties. A textual block is associated with its font type,
style and size; a table might be associated with the number of columns and rows, etc.
Phase 2 consists of block classification. The result of phase 2 is an assignment of
labels (title, regular text, picture, table, etc.) to all the blocks sing properties of individual
blocks from phase 1, as well as spatial layout rules. Phase 3 consists of logical
grouping and ordering of blocks. For OCR it is necessary to order text blocks. Also the
document blocks are grouped into items that "mean" something to the human reader
(author, abstract, date, etc.), and is more than just the physical decomposition of the
document.
Approaches for segmenting document image components can be either top-down or
bottom-up. op-down techniques divide the document into major regions which are
further divided into sub-regions based upon knowledge of the layout structure of the
document. Bottom-up methods progressively refine the data by layered grouping
operations.
Blocks determined by the segmentation process need to be classified into one of a
small set of predetermined document categories. Knowledge of the layout structure of a
document can aid the classification process. For instance, if it is known a priori that a
given document is a facsimile cover age, then inferences like the central block must be
labeled as the destination address and the top of the document must be labeled as the
name of the organization, etc. are plausible. However, to ensure portability, document-
specific formatting rules should be avoided.
It is necessary to provide a logical grouping of blocks to process them for recognition
and understanding. Textual blocks corresponding to different columns have to be
ordered for performing OCR.
The layout structure of a document divides and subdivides the document into
physical rectangular units, whereas the logical structure divides and subdivides the
document into units that "mean" something to the reader.

4. TEXT RECOGNITION
Character Recognition, also known as Optical Character Recognition or OCR, is
concerned with the automatic conversion of scanned and digitized images of characters
in running text into their corresponding symbolic forms. The ability of humans to read
poor quality machine print as well as text with unusual fonts and handwriting is far from
matched by today's machines.
We have experimented an approach [11] for character recognition of old Bulgarian
text documents. Most OCR systems have binarization as a preprocessing step. This
approach, uses vertical projection on horizontal axis on in advance inclined text
characters. In this transformation the projection contour assumes different type from
standing characters.
Its rather simplify to find identity between image projection and model
projection.Observed minimum number of parameters.
Figure 2 shows old bulgarian scanning text document.

Fig.2 Scanning text document (old bulgarian text)

Figure 3 shows algorithm on vertical projection.

Fig.3 Algorithm for vertical projection


Processing and analyzing algorithm makes previous image transformation for
reduce input data content. It allows input image U{u(x,y)} processing to internal image
W{w(x,y)} with better quality and data summarization. Each pixel value w(x,y) of
processing image W depends only of same pixel u(x,y) of input image U.
Methods for character recognition can be divided [7] into recognition without context
and recognition with context.
The next higher level of model knowledge useful in OCR is linguistic syntax. In such
cases, linguistic constraints may be used to select the best sentence candidate or at
least to reduce the number of possibilities. Methods can be syntactic, statistical or
hybrid.

5. PROGRAME FOR EXPERIMENTALLY ANALYSIS ON OLD BULGARIAN


TEXT RECOGNITION- CYR1.0

This item presents an approach for character recognition which is very suitable for old
bulgarian text character recognition. Old Bulgarian texts have to take separated place,
because the characters was hand drawn and painter ambition was maximum identically
for same characters. Character spaces was accurately observed, which reduce
character segmentation problems.
It’s presented information of developed program CYR1.0. The program used for
recognition and analysis on old bulgarian characters. In existing programs has not
possibility for working with old bulgarian texts. Experiments was made only with font
OldCyr for recognition without/after information loss.
Most OCR systems have binarization as a preprocessing step. An approach, offered
in this paper [11,12], uses vertical projection on horizontal axis on in advance inclined
text characters. In this transformation the projection contour assumes different type from
standing characters.
Its rather simplify to find identity between image projection and model
projection.Observed minimum number of parameters: minimum value, maximum value
and width value. Figure 4 shows differences between vertical projection on standing and
inclined characters.
Fig. 4. Vertical character projection (Old Cyr)

The projection on in advance inclined character gives more information. Its saves
time for single character recognition.
Figure 5 shows main menu.

Fig. 5. CYR1.0 - Main menu

For correct working it’s need to do next [12]:


1. from menu LOAD IMAGE loading input image;
2. in menu PIXEL COUNT is making binarization on input image. This routine saves
information for pixel number on axis X and axis Y, needed for recognition - it’s pixels
operation.
3. in menu VIEW HISTOGRAM is showing the histogram.
After that, starts computing and comparing procedures, needed for character
recognition.
For each character are building tables with value - maximal value on x-axis and
absolute maximum on y-axis. After operation with input image this values is compared
[11].
Previous processing for old bulgarian character recognition includes two steps:

• 30% inclination on input characters - figure 6 (step one);


• image binarization - figure 7 (step two).

Fig. 6. Step one

Fig. 7. Step two

There are two criteria of each character recognition:

• absolute maximum value on y axis Wmax(x,y);


• absolute maximum value on x axis (base width) - Wmax(x).
Recognition algorithm uses two tables of values - table 1 for absolute maximum value
Wmax(x,y) and table 2 for base width value - Wmax(x). Each input character, after
binarization, comparing with values in table 1 and table 2.
In the case with information loss described criteria have to increased. The criteria
which inspected in this case are:

• absolute maximum value on y axis Wmax(x,y);


• absolute maximum value on x axis (base width) - Wmax(x);
• absolute minimum value on y axis - Wmin(x,y);
• first local minimum on y axis - W1min(x,y);
• first local maximum on y axis W1max(x,y);
• number of pixels in the columns

All criteria structured in the tables. The recognition algorithm compares values for
each input character (after described transformations) with values in the tables and
makes recognition decision. Additionally, OCR system may use spell checkers or other
lexical analyzers that make use of context information to correct recognition errors and
resolve ambiguities in generated text.
Program CYR1.0 is structured as 5 separated modules. Each of them is a specific
routine and has specific functions:
MEN1 - routine realizing main menu and searching for input file, needed to be
processed. It’s operated only with files .BMP format .
MIT - routine for reading and processing for single character. After loading from input
file, making normalization on coordinates . There are separated procedures for
computing operation and computing for all parameters.
HIST1 - routine for histogram visualization on each character and saves it in .BMP
format.
TT1 - routine including all needed tables with parameters.
TT2 - routine, forming output. It’s making decision based on values from TT1.

6. CONCLUSION

The major modules in DIU system are: system architecture, decomposition and
structural analysis, text recognition and interpretation, table, diagram and image
understanding, and database and system performance evaluation.
The system architecture provides a computational framework to integrate and
regulate activities needed in document layout analysis and content interpretation.
Decomposition and structural analysis is responsible to decompose a document into
several regions, each of which contains homogeneous entities. These regions are then
grouped into logical units to form a high-level interpretation of the document structure.
Current OCR technology has limited success in recognizing poor quality text.
The use of contextual information, such as lexicon and syntax, has shown promising
results in degraded text recognition. Evaluation of the performance of document
analysis system was discussed. Meaningful performance evaluation should be related
directly to the goals of the system.
Presented approach uses vertical projection on horizontal axis on in advance inclined
text characters. This transformation dives possibility for additional recognition methods
as using fuzzy logic, neural networks and others. Large capacity of input information
reduced to few base criteria. Its rather decreasing and simplify comparing operation.
The program CYR1.0 for old bulgarian character recognition can uses for analysis on
old bulgarian texts and as additional tool in humanity.

REFERENCES

1. Michael Garris, Darrin Dimmick, Form Design for Hight Accuracy Optical
Character Recognition, IEEE Transactions PAMI, June 1996
2. P.J. Grother, Handprinted Forms and Character Database, NIST Special
Database 19, Technical Report, National Institute of Standards and Technology, March
1995
3. S.N. Srihari and S.W. Hull. Character Recognition. Center of Excellence for
Document Analysis and Recognition (CEDAR), Technical Report, January 1995
4. M. Garris, J. Blue, G. Candela, D. Dimmick, J. Geist, P. Grother, S. Janet and C.
Wilson, NIST form - base Handprint Recognition Systems, Technical Report NISTIR
5469, National Institute of Standards and Technology, July 1994
5. R. Wilkinson, J. Geist, S. Janet, P. Grother, C. Burges, R. Greecy, B. Hammond, J.
Hull, N. Larse, T. Vigl and C. Wilson, The First Census Optical Character
Recognition System Conference, Technical Report NISTIR 4912 National Institute of
Standards and Technology, July 1992
6. P. Grotcher, Karhunen Loeve feature extraction for neural handwritten
character recognition, Proc. Application of Artificial Neural Network III, vol 1709, pp.
155-166, SPIE, Orlando, April, 1992
7. S.N. Srihari. Document Image Understanding. Center of Excellence for
Document Analysis and Recognition (CEDAR), May, 1992
8. S.W. Lam, A.C. Girardin and S.N. Srihari. Gray-Scale Character Recognition
Using Boundary Features. SPIE/IS&T Symposium on Electronic Imaging Science
&Technology, San Jose, California, 1992.
9. J.J. Hull, S. Khoubyari, T.K. Ho, Visual Global Context: Word Image Matching in a
Methodology for Degraded Text Recognition, Symposium on Document Analysis and
Information Retrieval Las Vegas, Nevada March, 1992.
10. C.L. Wilson, Evaluation of Character Recognition Systems, Neural Networks for
Signal Processing III, IEEE, pp.485-495, New York, 1992
11. Geortchev V., Krusteva R., Boneva A., Stanischev K., Experimentally analysis on
old Bulgarian text character recognition, MIM2000 IFAC Symposium on
Manufacturing, Modeling, Management and Control, University of Patras Rio, Greece,
(July 12¸14, 2000), Proceeding (Editors:P. Groumpos & A.Tzes) ISBN 0 08043554 8,
Sesion WP1: Applications, WP1, pp. 124-127, 2000
12. Geortchev V., D. Butchvarov, A. Boneva , R. Krusteva and K. Stanischev (1999).
Letter characters
recognition after information loss. In: Proceedings "Scientific reports" (in bulgarian):
Section 3: Mechatronics, ISSN 1310-3946, Sofia, Bulgaria, pp. 3.39-3.44., 1999

Technical College - Bourgas,


All rights reserved, © March, 2000