Professional Documents
Culture Documents
I. I NTRODUCTION
Fig. 1: Example characters taken from ICDAR2003 (first and second
Recognition of the text in natural scenes has attracted rows) and SVT (third and fourth rows) datasets. First row: ’E’, ’S’,
increasing research attention in recent years due to its crucial ’S’, ’N’, ’N’, ’G’, ’G’, ’R’, ’A’. Second row: ’A’, ’A’, ’f’, ’M’, ’H’,
importance in scene understanding. It has become a very ’R’, ’T’, ’T’. Third row: ’E’, ’S’, ’L’, ’M’, ’b’, ’J’, ’o’, ’R’, ’R’.
Fourth row: ’P’, ’M’, ’K’, ’E’, ’h’, ’M’, ’T’, ’A’, ’n’.
promising tool in different applications such as unmanned
vehicle/robot navigation, living aids for visually impaired
persons, content based image retrieval, etc. Though optical
character recognition (OCR) of scanned document images has texts as well as their background.
achieved great success, recognition of the scene text by using
existing OCR systems still has a large space for improvements A number of scene text recognition techniques have been
due to a number of factors. reported which can be classified into two categories. The
traditional approach first performs certain preprocessing like
First, unlike scanned document texts that usually lie over binarization, slant correction, perspective rectification before
a ”blank” document background with similar color, texture, parsing scene texts to the existing OCR engines. Chen et
and controlled lighting, scene texts often have a much more al. [1] performed a variant of Niblack’s adaptive binarization
variational background that could have arbitrary color, texture, algorithm [2] on the detected text region before feeding to
and lighting conditions as illustrated in Fig. 1. Second, unlike OCR for recognition. An iterative binarization method is
scanned document texts that are usually printed in some widely proposed in [3] on single character using k-means to produce
used text font and text size, the scene text could be captured a set of potential binarized characters and then Support Vector
in arbitrary size and printed in some fancy but infrequently Machines (SVM) is used to measure the degree of character-
used text fonts as illustrated in Fig. 1. Even worse, the font likeness and the one with maximum character-likeness is
of the scene text may even change within a single word, selected as the optimal result. Recently the Markov Random
for the purpose of special visual effects or attraction of the Field (MRF) model is adopted for binarization in [4] where an
human attention. Third, unlike scanned document texts that auto-seeding technique is proposed to first determine certain
usually have a front-parallel view, scene texts captured from foreground and background pixel seeds and then use MRF to
arbitrary viewpoints often suffer from the perspective distortion segment text and non-text regions.
as illustrated in Fig. 1. All these variations make OCR of
scene texts a very challenging task. A robust OCR technique Another way is to extract feature descriptors and then
is urgently needed that is tolerant to the variations of scene use classifiers for scene text recognition. The recent approach
1 2
trains classifiers for scene text recognition. This approach is
studied extensively in [5] where the scene text recognition
performance is evaluated by using different feature descriptors
including Shape Contexts, Scale Invariant Feature Transform
3 4
(SIFT), Geometric Blur, Maximum Response of filters, patch
descriptor, etc. in combination with bag-of-words model. But
the results are not satisfactory to serve as the basis for
word recognition. In [6], the authors address the problem
by employing Gabor filters and then building a similarity (a) Sample image (b) Gradient Orientation
model to measure the distance between characters in their text
recognition framework. Maximally Stable Extremal Regions
(MSER) is used in [7] to get a MSER mask and extract
orientation features along the MSER boundary. In addition, an
unsupervised feature learning system is proposed by [8] using
a variant of K-means clustering to first build a dictionary and
then map all character images to a new representation using
the dictionary.
Recently, the classical HOG feature [9] has also been
widely used for scene text recognition. As studied in [10],
[11], [12], HOG outperforms almost all the other features due
1 2 3 4
to its robustness to illumination variation and invariance to
the local geometric and photometric transformations. However, (c) Histogram and Vectorization
HOG is just a statistics of gradient orientation in each block
which does not capture the spatial relationship of neighboring Fig. 2: Illustration of HOG feature extraction: (a) shows a character
sample which is divided into 4 blocks (the blocks overlay with
pixels sufficiently. For example, two image patches having neighboring blocks in implementation). (b) shows the corresponding
similar HOG features may look very different when their pixel gradient orientation of each block. (c) shows the histogram of gradient
locations are rearranged. orientation and concatenated one after another to form a HOG feature
vector.
Therefore, we propose to recognize the scene text by using
an extension of the HOG, namely, co-occurrence HOG (Co-
HOG) [13], that captures gradient orientation of neighboring
pixel pairs instead of a single image pixel. Co-HOG divides normalized to overcome illumination variation. The features
the image into blocks with no overlap which is more efficient from all blocks are then concatenated together to form a feature
than HOG with overlapped blocks [13]. This is essential in descriptor of the whole image. Fig. 2 illustrates the extraction
real-time text recognition system. More importantly, relative process of the classical HOG feature. Due to its robustness
location and orientation are considered with each neighboring to illumination variation and invariance to the local geometric
pixel, respectively, which is more precise to describe the and photometric transformations, many scene text recognition
character shape. In addition, Co-HOG keeps the advantages works employ HOG for the recognition of texts in scenes.
of HOG, i.e., the robustness to varying illumination and local On the other hand, HOG captures orientation of only isolated
geometric transformations. Extensive tests show that Co-HOG pixels, whereas spatial information of neighboring pixels is
outperforms other feature descriptors significantly for scene ignored. Co-HOG instead captures more spatial information
text recognition. and is more powerful in scene text recognition to be discussed
in the ensuing subsection.
II. C O - OCCURRENCE OF H ISTOGRAM OF
O RIENTED G RADIENTS B. Co-occurrence of Histogram of Oriented Gradients
Co-HOG is extended Histogram of Oriented Gradients. It Co-HOG captures spatial information by counting frequen-
becomes HOG when the offset is (0, 0) as to be illustrated later cy of co-occurrence of oriented gradients between pixel pairs.
in this section. In this section, we first explain the general idea Thus relative locations are stored. The relative locations are
of HOG and then show how to extend it to Co-HOG for the reflected by the offset between 2 pixels as shown in Fig. 3(a).
scene text recognition task. The yellow pixel in the center is the pixel under study and
the neighboring blue ones are pixels with different offsets.
A. Histogram of Oriented Gradients Each neighboring pixel in blue color forms an orientation pair
with the center yellow pixel and accordingly votes to the co-
HOG feature [9] is first proposed to deal with human occurrence matrix as illustrated in Fig. 3(b). Therefore, HOG
detection task and later becomes a very popular feature in is just a special case of Co-HOG when the offset is set to
object detection area. When extracting HOG features, the (0, 0), i.e., only the pixel under study is counted.
orientations of gradients are usually quantized into histogram
bins and each bin has an orientation range. An image is divided The frequency of co-occurrence of oriented gradients is
into overlapping blocks and in each block, a histogram of captured at each offset via co-occurrence matrix as shown in
oriented gradients falling into each bin is computed and then Fig. 3(b). Co-occurrence matrix at a specific offset (x, y) is
913
(a) Offset in Co-HOG (b) Co-occurrence matrix
α − θ1 β − θ3
H(θ1 , θ3 ) ← H(θ1 , θ3 ) + M1 1− + M2 1−
θ2 − θ 1 θ 4 − θ1
(c) Vectorization α − θ1 β − θ3
H(θ1 , θ4 ) ← H(θ1 , θ4 ) + M1 1 − + M2
θ2 − θ 1 θ 4 − θ1
Fig. 3: Illustration of Co-HOG feature extraction: (a) illustrates the α − θ1 β − θ3
H(θ2 , θ3 ) ← H(θ2 , θ3 ) + M1 + M2 1−
offset used in Co-HOG. (b) shows the co-occurrence of one block θ2 − θ 1 θ 4 − θ1
in Fig. 2(a). (c) shows the vectorization of co-occurrence matrix and α − θ1 β − θ3
concatenated one after another to form Co-HOG feature vector. H(θ2 , θ4 ) ← H(θ2 , θ4 ) + M1 + M2
θ2 − θ 1 θ 4 − θ1
(2)
914
TABLE I: Character recognition accuracy on ICDAR and SVT dataset
like book covers, road signs, brand logo and other texts
randomly selected from various objects. The text font, size, (a) Successfully recognized characters
illumination, color and texture corresponding vary greatly.
Take character size for example, the character width ranges
from 1 to 589 pixels and the character height ranges from 10
to 898 pixels.
For SVT dataset, only the testing part is annotated in
[12] for character recognition which contains about 3796
samples. Compared with the ICDAR 2003, this one is more
challenging where most of the characters are cropped from
business board and brand names taken from Google Street
View. They are usually of fancy fonts, low resolution and
often suffer from bad illumination as illustrated in Fig. 1. In
addition, we add Char74K dataset [5] for training. Thus the
training dataset consists of ICDAR 2003 training dataset and
Chars74K dataset, which have roughly 18,500 characters in
total.
(b) Wrongly recognized characters. First row: ’5’(r), ’V’(N), ’P’(R), ’E’(f),
In the experiment, we resize each character to 32 × 32 ’P’(e), ’D’(A). Second row: ’T’(f), ’4’(A), ’r’(I), ’n’(R), ’E’(L). Third row:
pixels and then divide each into 4 × 4 blocks before feature ’R’(0), ’g’(q), ’l’(I), ’e’(l), ’A’(G), ’U’(i), ’G’(E). The characters in single
extraction. After we get the Co-HOG features, a linear SVM quotation marks are our predicted ones while those in parentheses are the
ground truths.
classifier is trained with LIBLINEAR [14] and evaluated on
the testing datasets. Fig. 5: Some samples of the character recognition results
915
accuracy gap between those two very different datasets shows
the power of the Co-HOG in capturing the shape information
of characters under different scenes.
In the future, we will investigate some global features and
combine them with Co-HOG to formulate a more accurate
scene character recognition technique.
R EFERENCES
[1] X. Chen and A. Yuille, “Detecting and reading text in natural scenes,”
in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Pro-
ceedings of the 2004 IEEE Computer Society Conference on, vol. 2.
IEEE, 2004, pp. II–366.
[2] W. Niblack, An introduction to digital image processing. Strandberg
Publishing Company, 1985.
[3] K. Kita and T. Wakahara, “Binarization of color characters in
Fig. 6: Confusion matrix of character recognition on the SVT and scene images using k-means clustering and support vector machines,”
ICDAR datasets. There are 62 classes indicated by the number on in Proceedings of the 2010 20th International Conference on
the coordinate, which represents ’0-9a-zA-Z’ respectively. Pattern Recognition, ser. ICPR ’10. Washington, DC, USA:
IEEE Computer Society, 2010, pp. 3183–3186. [Online]. Available:
http://dx.doi.org/10.1109/ICPR.2010.779
[4] A. Mishra, K. Alahari, and C. Jawahar, “An mrf model for binarization
little character recognition accuracy has been reported on this of natural scene text,” in Document Analysis and Recognition (ICDAR),
2011 International Conference on. IEEE, 2011, pp. 11–16.
dataset. The most recent work in [12] achieves an accuracy
of 61.9% which is much lower compared with our proposed [5] T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition
in natural images.” in VISAPP (2), 2009, pp. 273–280.
method. If we ignore letter cases, the accuracy of our proposed
[6] J. Weinman, E. Learned-Miller, and A. Hanson, “Scene text recognition
method goes up to 80.6%. using similarity and a lexicon with sparse belief propagation,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 31,
Fig. 5 shows some challenging examples that are correctly no. 10, pp. 1733–1746, 2009.
recognized by our propose method and some failure cases. [7] L. Neumann and J. Matas, “A method for text localization and recogni-
As Fig. 5a shows, the proposed technique is capable of tion in real-world images,” Computer Vision–ACCV 2010, pp. 770–783,
recognizing many challenging characters in scenes. At the 2011.
same time, many failed cases as illustrated in Fig. 5b are [8] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang,
difficult to read even for humans. D. Wu, and A. Ng, “Text detection and character recognition in scene
images with unsupervised feature learning,” in Document Analysis and
Recognition (ICDAR), 2011 International Conference on. IEEE, 2011,
C. Discussion pp. 440–445.
[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
We compute the confusion matrix on the two datasets and detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
add them together as shown in Fig. 6. As Fig. 6 shows, the 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
mistakes concentrate on those confusing letter cases as we 886–893.
discussed earlier. Besides, two most obvious mistakes are the [10] K. Wang and S. Belongie, “Word spotting in the wild,” Computer
mis-classification between ’I’ and ’l’, and between ’0’ and ’O’, Vision–ECCV 2010, pp. 591–604, 2010.
which even human beings often fail to differentiate correctly. [11] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recogni-
tion,” in Computer Vision (ICCV), 2011 IEEE International Conference
Certain recognition failure can be explained by several on. IEEE, 2011, pp. 1457–1464.
other factors. For example, some characters are mistakenly [12] A. Mishra, K. Alahari, and C. Jawahar, “Top-down and bottom-up cues
annotated in both ICDAR and SVT datasets. Besides, there for scene text recognition,” in Computer Vision and Pattern Recognition
even exists character of size 1×135 pixels in the ICDAR 2003 (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2687–2694.
dataset because some characters are not cropped carefully. The [13] T. Watanabe, S. Ito, and K. Yokoi, “Co-occurrence histograms of orient-
ed gradients for human detection,” Information and Media Technologies,
scene text recognition could be greatly improved without these vol. 5, no. 2, pp. 659–667, 2010.
interfering factors. [14] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan,
“A dual coordinate descent method for large-scale linear svm,” in
V. CONCLUSION AND FUTURE WORK Proceedings of the 25th international conference on Machine learning,
vol. 951, no. 08, 2008, pp. 408–415.
Character recognition has played a crucial role in text [15] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support
recognition in scene images. We propose to use co-occurrence vector machines,” Intelligent Systems and their Applications, IEEE,
vol. 13, no. 4, pp. 18–28, 1998.
histogram of oriented gradients (Co-HOG) with weighted
voting scheme for scene character recognition. Compared with [16] T. Joachims, “Training linear svms in linear time,” in Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery
histogram of oriented gradients (HOG), Co-HOG captures and data mining. ACM, 2006, pp. 217–226.
more local spatial information but at the same time keeps the [17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young,
advantage of HOG, i.e., the robustness to illumination variation “Icdar 2003 robust reading competitions,” in Proceedings of the Sev-
and invariance to local geometric transformation. The results enth International Conference on Document Analysis and Recognition,
on both ICDAR 2003 and SVT datasets greatly outperform vol. 2, 2003, pp. 682–687.
all the previous feature descriptor based methods. The small [18] ABBYY FineReader 10, http://www.abbyy.com/.
916