Professional Documents
Culture Documents
=
B A
AB
C
Xc
m
,
The distance between the groups A an
AB
is equal to the average between
elements of group A and all elements
density of the group denoted by , Xc
between the distances of all elemen
group. The number of groups is m .
For all tests performed the best SOM n
one that presented a 4x5 mesh co
threshold of 0.3. The clusters formed fr
are presented in Table 02.
Table 02 The best clusters obtaine
Clusters Meta-c
G 01 m, n, r,
G 02 c, e, o
G 03 b, d, l
G 04 f, i, j
G 05 a, g,h
For the formation of the candida
observes for each character the two c
highest activation. This way it is pos
graph with which one generates
combinations of the characters in the d
as shown in the graph of Figure 7.
completion, one sweeps the graph
possible words. From those one sieve
by dictionary look-up
f classes. Such
complexity of
ase of this work,
ress the set of
from the block of
uch a purpose, a
with different
nce was used for
OM was trained
plied to find the
adjusted through
resholds in the
ers for a given
n this work took
6, 0.4, 0.3, 0.25
etwork builds a
them is the one
ion 01.
nd B, denoted by
the pairs of all
of group B. The
is the average
nts in the same
network was the
onfiguration and
rom this network
ed
classes
v, y, w
o, p, u
l, k, s
, t, z
h,q,x
ate words one
clusters with the
ssible to build a
s all possible
defined clusters,
After the graph
and forms all
s the valid ones
letter
G03;G02;G04;G04;G
G03;G02;G03;G04;G
Figure 7. Example of the
of graphs for the generatio
5. Experimental Results
The proposed method was tested
historical documents kept by the
Nabuco [18]. The documents wer
flatbed scanner with 200 dpi.
transcribed and corrected yielding
those, 3,814 present physical nois
off parts) that implied in informati
maximum of 70% of the original
result of the improvement in re
word correction the test documen
with the commercial OCR tool AB
Professional Editor. Figure 8 prese
results obtained. In part (c) of Fi
green the inserted characters, in r
models for substitution and blue is
the document area.
A quantitative analysis was perfor
improvement in transcription rat
work. The same one hundred
transcribed twice by the tool ABB
Professional Editor and the results
physical noises is shown in T
between Correct, Incorrect an
words.
Table 03 Transcription results i
FineReader
Correct 1949
Incorrect 895
Not-recognized 738
Accuracy 54.5%
(01)
02;G01
01;G05
generation
on of words.
with a file of 100
Fundao Joaquim
re digitized using a
Such letters were
13,833 words. From
es (holes and thorn-
ion losses reaching a
word. To access the
cognition rate after
nts were transcribed
BYY FineReader 10
ents a sample of the
igure 8 one sees in
red the ones used as
s no model found in
rmed to evaluate the
e presented in this
d documents were
BYY FineReader 10
s for the 3,582 with
Table 03, classified
nd Not-recognized
in degraded areas.
New Method
2752
646
184
78.2%
556 556
(a) Original text binar
(b) Transcription of (a) using ABBYY
(c) Text after processing with the pro
(d) Transcription of (c) using ABBYY
Figure 8. Transcription of the docum
Figure 4 without and with the scheme
6. Conclusions
This paper presents a scheme for impro
transcription rate of historical docume
physical noises such as thorn-off reg
originated by punches (i.e. filing an
worms, etc. The scheme automatica
damaged areas and when the text is seg
associated with blocks that are
characters of the same group of featur
the probability of being the origi
character. Such replacement yields a c
lexemes that stand as candidates
word. Dictionary look-up decides whic
chosen as the most likely transcription.
The processing strategy presented here
batch of one hundred typewritten histo
totaling over 12,000 words, of whi
damaged. The results obtained
FineReader 10 Professional Editor an
correct transcription rate of about 25%
rized
Y FineReader 10
oposed scheme.
Y FineReader 10
ment shown in
presented here.
oving the correct
ents damaged by
gions and holes
nd staples) and
ally detects the
gmented they are
replaced with
res that increase
inal damaged
class of possible
for the original
ch word is to be
e was tested in a
orical documents
ich 3,500 were
with ABBYY
nd a gain in the
was observed.
References
[1] R.D. Lins, G. F. P Silva. Assessing a
of Document Images Acquired with P
In: ICDAR 2007, Curitiba. IEEE Press
[2] R.D. Lins, G. F. P Silva, G. Torreao .
Indexing in the Livememory Platform
230, 2010.
[3] R. D. Lins. A Taxonomy for Noise
Paper Documents - The Physical N
Springer Verlag, 2009. v. 5627. p. 844
[4] R. D. Lins, et al. An Environment f
Historical Documents. Microprocessin
pp. 111-121, North-Holland, 1994.
[5] R. D. Lins, G. F. P. Silva, S. Baner
Thielo. Automatically Detecting an
Document Images. SAC 2010, ACM
39.
[6] R. Farrahi, R. Moghaddam and M.
multi-level classifiers and clusterin
spotting in historical document ima
Press, 2009. p. 511-515.
[7] S. Pletschacher, J. Hu and A. A
Framework for Recognition of Heavil
Historical Typewritten Documents B
Clustering. ICDAR 2009, IEEE Press,
[8] V. Kluzner, A. Tzadok, Y. Shimo
Antonacopoulos. Word-Based Adapt
Books. ICDAR 2009, IEEE Press, 200
[9] G. F. P. e Silva; R.D. Lins, S. Baner
Thielo. A Neural Classifier to
Interference in Paper Documents, ICP
[10] D. M. Oliveira, R. D. Lins, G. Torre
New Algorithm for Segmenting
Document Images, ACM-SAC 2011,
[11] C.Y. Suen, J. Guo, Z.C Li, Analy
Alphanumeric Handprints by parts,
Systems,Man, and Cybernetics, N. 24,
[12] Z.C. Li, C.Y. Suen, J. Guo, A Regiona
for Recognizing Handprinted Characte
Systems, Man, and Cybernetics, N. 25
[13] P. V. W. Radtke, L.S. Oliveira,
Intelligent Zoning Design Using Mul
Algorithms, ICDAR2003, p.824-828,
[14] C. O. A. Freitas, L.S. Oliveira, S.B
Zoning and metaclasses for character
2007. P. 632-636, 2007.
[15] T. Kohonen, Self-Organizing Map
Information Sciences,Springer, second
[16] E.V. Samsonova, J.N. Kok, A.P. IJze
analysis in the self-organizing map, N
935-949, 2006.
[17] NIST Scientific and Tech. Databases h
[18] Fundao Joaquim Nabuco Fundaj, ht
[19] ABBYY FineReader 10
http://finereader.abbyy.com/.
[20] C. Liu, K. Nakashima, H. Sako, and H
digit recognition: benchmarking of sta
Pattern Recognition, 36(10):2271-228
[21] C. Liu, Y. Liu, and R. Dai, ''Prepr
structural feature extraction for
recognition'', Progress of Handwri
Downton and S. Impedovo eds., World
[22] L. Oliveira, R. Sabourin, F. Bor
''Automatic recognition of handwritt
recognition and verification strategy''
Analysis and Machine Intelligence, 24
[23] M. Hu, ''Visual pattern recognition
IEEE Transactions on Information The
and Improving the Quality
Portable Digital Cameras.
s, 2007. v. 2. p. 569-573.
Content Recognition and
m. LNCS, v. 6020, p. 224-
e Detection in Images of
Noises. In: ICIAR 2009.
4-854.
for Processing Images of
ng & Microprogramming,
rgee, A. Kuchibhotla, M.
d Classifying Noises in
M Press, 2010. v. 1. p. 33-
Cheriet. Application of
ng for automatic word
ages.ICDAR 2009, IEEE
Antonacopoulos. A New
ly Degraded Characters in
ased on Semi-Supervised
, 2009. p. 506-510.
ony, E. Walach, and A.
tive OCR for Historical
09. p. 501-505.
rgee, A. Kuchibhotla, M.
Filter-out Back-to-Front
R 2010, Istanbul. 2010.
eo, J. Fan, M. Thielo. A
Warped Text-lines in
ACM Press, 2011.
ysis and Recognition Of
, IEEE Transactions on
, p. 614-631, 1994.
al Decomposition Method
ers, IEEE Transactions on
5, p. 998-1010, 1995.
R. Sabourin, T. Wong,
ti-Objective Evolutionary
2003.
.K. Aires, F. Bortolozzi,
r recognition. ACMSAC
ps, Springer Series in
d edition, vol. 30, 1997.
erman, TreeSOM: cluster
Neural Networks, N. 19, p.
http://www.nist.gov/data/.
ttp://www.fundaj.gov.br/.
Professional Editor,
H. Fujisawa, ''Handwritten
ate-of-the-art techniques'',
5, 2003.
rocessing and statistical/
handwritten numeral
iting Recognition, A.C.
d Scientific, 1997.
rtolozzi, and C. Suen,
ten numerical strings: A
, IEEE Trans. on Pattern
4(11):1438-1454, 2002.
by moment invariants'',
eory, 8(2):179-187, 1962.
557 557