You are on page 1of 5

2013 12th International Conference on Document Analysis and Recognition

Scene Text Recognition using Co-occurrence of


Histogram of Oriented Gradients

Shangxuan Tian ∗ , Shijian Lu † , Bolan Su ∗ and Chew Lim Tan ∗


∗ Department of Computer Science,School of Computing,National University of Singapore
Computing 1, 13 Computing Drive, Singapore 117417
Email: {tians, subolan, tancl}@comp.nus.edu.sg
† Visual Computing Department, Institute for Infocomm Research
1 Fusionopolis Way, #21-01 Connexis, Singapore 138632
Email: slu@i2r.a-star.edu.sg

Abstract—Scene text recognition is a fundamental step in End-


to-End applications where traditional optical character recogni-
tion (OCR) systems often fail to produce satisfactory results. This
paper proposes a technique that uses co-occurrence histogram
of oriented gradients (Co-HOG) to recognize the text in scenes.
Compared with histogram of oriented gradients (HOG), Co-HOG
is a more powerful tool that captures spatial distribution of
neighboring orientation pairs instead of just a single gradient
orientation. At the same time, it is more efficient compared
with HOG and therefore more suitable for real-time applications.
The proposed scene text recognition technique is evaluated
on ICDAR2003 character dataset and Street View Text (SVT)
dataset. Experiments show that the Co-HOG based technique
clearly outperforms state-of-the-art techniques that use HOG,
Scale Invariant Feature Transform (SIFT), and Maximally Stable
Extremal Regions (MSER).

I. I NTRODUCTION
Fig. 1: Example characters taken from ICDAR2003 (first and second
Recognition of the text in natural scenes has attracted rows) and SVT (third and fourth rows) datasets. First row: ’E’, ’S’,
increasing research attention in recent years due to its crucial ’S’, ’N’, ’N’, ’G’, ’G’, ’R’, ’A’. Second row: ’A’, ’A’, ’f’, ’M’, ’H’,
importance in scene understanding. It has become a very ’R’, ’T’, ’T’. Third row: ’E’, ’S’, ’L’, ’M’, ’b’, ’J’, ’o’, ’R’, ’R’.
Fourth row: ’P’, ’M’, ’K’, ’E’, ’h’, ’M’, ’T’, ’A’, ’n’.
promising tool in different applications such as unmanned
vehicle/robot navigation, living aids for visually impaired
persons, content based image retrieval, etc. Though optical
character recognition (OCR) of scanned document images has texts as well as their background.
achieved great success, recognition of the scene text by using
existing OCR systems still has a large space for improvements A number of scene text recognition techniques have been
due to a number of factors. reported which can be classified into two categories. The
traditional approach first performs certain preprocessing like
First, unlike scanned document texts that usually lie over binarization, slant correction, perspective rectification before
a ”blank” document background with similar color, texture, parsing scene texts to the existing OCR engines. Chen et
and controlled lighting, scene texts often have a much more al. [1] performed a variant of Niblack’s adaptive binarization
variational background that could have arbitrary color, texture, algorithm [2] on the detected text region before feeding to
and lighting conditions as illustrated in Fig. 1. Second, unlike OCR for recognition. An iterative binarization method is
scanned document texts that are usually printed in some widely proposed in [3] on single character using k-means to produce
used text font and text size, the scene text could be captured a set of potential binarized characters and then Support Vector
in arbitrary size and printed in some fancy but infrequently Machines (SVM) is used to measure the degree of character-
used text fonts as illustrated in Fig. 1. Even worse, the font likeness and the one with maximum character-likeness is
of the scene text may even change within a single word, selected as the optimal result. Recently the Markov Random
for the purpose of special visual effects or attraction of the Field (MRF) model is adopted for binarization in [4] where an
human attention. Third, unlike scanned document texts that auto-seeding technique is proposed to first determine certain
usually have a front-parallel view, scene texts captured from foreground and background pixel seeds and then use MRF to
arbitrary viewpoints often suffer from the perspective distortion segment text and non-text regions.
as illustrated in Fig. 1. All these variations make OCR of
scene texts a very challenging task. A robust OCR technique Another way is to extract feature descriptors and then
is urgently needed that is tolerant to the variations of scene use classifiers for scene text recognition. The recent approach

1520-5363/13 $26.00 © 2013 IEEE 912


DOI 10.1109/ICDAR.2013.186
first extracts certain features from gray/color images and then

1 2
trains classifiers for scene text recognition. This approach is
studied extensively in [5] where the scene text recognition
performance is evaluated by using different feature descriptors
including Shape Contexts, Scale Invariant Feature Transform

3 4
(SIFT), Geometric Blur, Maximum Response of filters, patch
descriptor, etc. in combination with bag-of-words model. But
the results are not satisfactory to serve as the basis for
word recognition. In [6], the authors address the problem
by employing Gabor filters and then building a similarity (a) Sample image (b) Gradient Orientation
model to measure the distance between characters in their text
recognition framework. Maximally Stable Extremal Regions
(MSER) is used in [7] to get a MSER mask and extract
orientation features along the MSER boundary. In addition, an
unsupervised feature learning system is proposed by [8] using
a variant of K-means clustering to first build a dictionary and
then map all character images to a new representation using
the dictionary.
Recently, the classical HOG feature [9] has also been
widely used for scene text recognition. As studied in [10],
[11], [12], HOG outperforms almost all the other features due
1 2 3 4
to its robustness to illumination variation and invariance to
the local geometric and photometric transformations. However, (c) Histogram and Vectorization
HOG is just a statistics of gradient orientation in each block
which does not capture the spatial relationship of neighboring Fig. 2: Illustration of HOG feature extraction: (a) shows a character
sample which is divided into 4 blocks (the blocks overlay with
pixels sufficiently. For example, two image patches having neighboring blocks in implementation). (b) shows the corresponding
similar HOG features may look very different when their pixel gradient orientation of each block. (c) shows the histogram of gradient
locations are rearranged. orientation and concatenated one after another to form a HOG feature
vector.
Therefore, we propose to recognize the scene text by using
an extension of the HOG, namely, co-occurrence HOG (Co-
HOG) [13], that captures gradient orientation of neighboring
pixel pairs instead of a single image pixel. Co-HOG divides normalized to overcome illumination variation. The features
the image into blocks with no overlap which is more efficient from all blocks are then concatenated together to form a feature
than HOG with overlapped blocks [13]. This is essential in descriptor of the whole image. Fig. 2 illustrates the extraction
real-time text recognition system. More importantly, relative process of the classical HOG feature. Due to its robustness
location and orientation are considered with each neighboring to illumination variation and invariance to the local geometric
pixel, respectively, which is more precise to describe the and photometric transformations, many scene text recognition
character shape. In addition, Co-HOG keeps the advantages works employ HOG for the recognition of texts in scenes.
of HOG, i.e., the robustness to varying illumination and local On the other hand, HOG captures orientation of only isolated
geometric transformations. Extensive tests show that Co-HOG pixels, whereas spatial information of neighboring pixels is
outperforms other feature descriptors significantly for scene ignored. Co-HOG instead captures more spatial information
text recognition. and is more powerful in scene text recognition to be discussed
in the ensuing subsection.
II. C O - OCCURRENCE OF H ISTOGRAM OF
O RIENTED G RADIENTS B. Co-occurrence of Histogram of Oriented Gradients
Co-HOG is extended Histogram of Oriented Gradients. It Co-HOG captures spatial information by counting frequen-
becomes HOG when the offset is (0, 0) as to be illustrated later cy of co-occurrence of oriented gradients between pixel pairs.
in this section. In this section, we first explain the general idea Thus relative locations are stored. The relative locations are
of HOG and then show how to extend it to Co-HOG for the reflected by the offset between 2 pixels as shown in Fig. 3(a).
scene text recognition task. The yellow pixel in the center is the pixel under study and
the neighboring blue ones are pixels with different offsets.
A. Histogram of Oriented Gradients Each neighboring pixel in blue color forms an orientation pair
with the center yellow pixel and accordingly votes to the co-
HOG feature [9] is first proposed to deal with human occurrence matrix as illustrated in Fig. 3(b). Therefore, HOG
detection task and later becomes a very popular feature in is just a special case of Co-HOG when the offset is set to
object detection area. When extracting HOG features, the (0, 0), i.e., only the pixel under study is counted.
orientations of gradients are usually quantized into histogram
bins and each bin has an orientation range. An image is divided The frequency of co-occurrence of oriented gradients is
into overlapping blocks and in each block, a histogram of captured at each offset via co-occurrence matrix as shown in
oriented gradients falling into each bin is computed and then Fig. 3(b). Co-occurrence matrix at a specific offset (x, y) is

913
(a) Offset in Co-HOG (b) Co-occurrence matrix

Fig. 4: Bi-linear interpolation of weighted magnitude

gradient magnitude and orientation bin is combined and Fig.


4 gives a simple illustration.

   
α − θ1 β − θ3
H(θ1 , θ3 ) ← H(θ1 , θ3 ) + M1 1− + M2 1−
θ2 − θ 1 θ 4 − θ1
   
(c) Vectorization α − θ1 β − θ3
H(θ1 , θ4 ) ← H(θ1 , θ4 ) + M1 1 − + M2
θ2 − θ 1 θ 4 − θ1
   
Fig. 3: Illustration of Co-HOG feature extraction: (a) illustrates the α − θ1 β − θ3
H(θ2 , θ3 ) ← H(θ2 , θ3 ) + M1 + M2 1−
offset used in Co-HOG. (b) shows the co-occurrence of one block θ2 − θ 1 θ 4 − θ1
   
in Fig. 2(a). (c) shows the vectorization of co-occurrence matrix and α − θ1 β − θ3
concatenated one after another to form Co-HOG feature vector. H(θ2 , θ4 ) ← H(θ2 , θ4 ) + M1 + M2
θ2 − θ 1 θ 4 − θ1
(2)

where H is the co-occurrence matrix at a specific offset


given by: as defined in Equation 1. M1 is the gradient magnitude at
location (p, q) and α is its corresponding gradient orientation.
  1 if O(p, q) = i & O(p + x, q + y) = j
M2 is the gradient magnitude at location (p + x, q + y) with
Hx,y (i, j) = corresponding gradient orientation β. θ1 and θ2 denote the
0 otherwise neighboring orientation bin centers of α, similar to θ3 and θ4 .
(p,q)∈B
(1) In the proposed weighting scheme, a pixel with very small
where Hx,y is the co-occurrence matrix at offset (x, y), which gradient value can have a fair large weight if its pair pixel
is a square matrix and its dimension is decide by number has a large gradient value. To avoid such situations, we do not
of orientation bins. Therefore, we will have 24 co-occurrence count pixel pairs when at least one of them has very small
matrix with offsets as illustrated in Fig. 3(a). O is the gradient gradient value.
orientation of the input image I and B is a block in the image. 3) Feature Vector Construction: The obtained block fea-
Therefore, Equation 1 computes co-occurrence matrix in a tures are first normalized with L2 normalization method. The
block and Fig. 3(b) shows an example. The Co-HOG feature Co-HOG feature descriptor of the whole image under study
descriptor of an image can thus be constructed by vectorizing can then be constructed by concatenating all the normalized
and concatenating the Co-HOG matrix of all blocks of the block features.
image under study.
The Co-HOG feature extraction process can be summarized III. S CENE T EXT R ECOGNITION
in the following three steps. Characters in scenes can thus be recognized by training
a classifier based on the Co-HOG descriptors as described in
1) Gradient Magnitude and Orientation Computation: the last Section. In our implemented system, a linear SVM
Gradient magnitude is computed as an L2 norm of horizontal classifier is trained using LIBLINEAR [14] which is much
and vertical magnitude computed by Sobel filter. For color faster but with similar performance compared with LIBSVM
images, the gradient is computed separately for each color [15] and SVMLight [16]. We train the SVM classifier by using
channel and the one with maximum magnitude is used. Gradi- up to 18500 character images to be discussed in the next
ent orientation ranges between 0◦ − 180◦ (unsigned gradient) Section.
and is quantized into 9 orientation bins.
2) Weighted Voting: The original Co-HOG is computed IV. E XPERIMENTAL R ESULTS
without weighting as specified in Equation 1 [13], which by A. Datasets
itself can not reflect the difference between strong gradient
and weak gradient pixels. We propose to add in a weighting We evaluate our methods on ICDAR 2003 [17] and SVT
mechanism based on the gradient magnitude where bi-linear [11] datasets. The ICDAR 2003 character dataset has about
interpolation is employed to vote between two neighboring 6100 characters for training and 5400 characters for testing.
orientation bins. Equation 2 shows how the weighting of The characters are collected from a wide branches of scenes,

914
TABLE I: Character recognition accuracy on ICDAR and SVT dataset

Method ICDAR SVT

ABBYY FineReader 10 [18] 26.6% 15.4%

GB+NN [5] 41.0% -

HOG+NN [10] 51.5% -

NATIVE+FERNS [11] 64.0% -

MSER [7] 67.0% -

HOG+SVM [12] - 61.9%

Proposed Co-HOG 79.4% 75.4%


Proposed Co-HOG
83.6% 80.6%
(Case Insensitive)

like book covers, road signs, brand logo and other texts
randomly selected from various objects. The text font, size, (a) Successfully recognized characters
illumination, color and texture corresponding vary greatly.
Take character size for example, the character width ranges
from 1 to 589 pixels and the character height ranges from 10
to 898 pixels.
For SVT dataset, only the testing part is annotated in
[12] for character recognition which contains about 3796
samples. Compared with the ICDAR 2003, this one is more
challenging where most of the characters are cropped from
business board and brand names taken from Google Street
View. They are usually of fancy fonts, low resolution and
often suffer from bad illumination as illustrated in Fig. 1. In
addition, we add Char74K dataset [5] for training. Thus the
training dataset consists of ICDAR 2003 training dataset and
Chars74K dataset, which have roughly 18,500 characters in
total.
(b) Wrongly recognized characters. First row: ’5’(r), ’V’(N), ’P’(R), ’E’(f),
In the experiment, we resize each character to 32 × 32 ’P’(e), ’D’(A). Second row: ’T’(f), ’4’(A), ’r’(I), ’n’(R), ’E’(L). Third row:
pixels and then divide each into 4 × 4 blocks before feature ’R’(0), ’g’(q), ’l’(I), ’e’(l), ’A’(G), ’U’(i), ’G’(E). The characters in single
extraction. After we get the Co-HOG features, a linear SVM quotation marks are our predicted ones while those in parentheses are the
ground truths.
classifier is trained with LIBLINEAR [14] and evaluated on
the testing datasets. Fig. 5: Some samples of the character recognition results

B. Scene Text Recognition Accuracy


The ICDAR 2003 test dataset altogether has 5430 charac- those characters along the boundary which is impossible to fit
ters. We exclude those that do not belong to the 62 classes (52 in a square bounding box.
upper and lower case English letters plus 10 digits) thus have
5379 characters left. The trained linear SVM is first tested on Text case identification is a very challenging task for
this dataset. scene text recognition because scene texts usually lack specific
document layout information. For scene texts, some upper
Experimental results in Table I show that our proposed case letters and lower case letters like ’C’ and ’c’, ’S’ and
method outperforms all previous feature descriptors with an ’s’, ’O’ and ’o’ are extremely difficult to distinguish, even
accuracy of 79.4%, while the FineReader 10 [18] gets the for human beings. In fact, letter cases are often annotated
worst accuracy with 26.6%, largely due to the fact that the incorrectly in the ground truth dataset. Therefore, we also show
FineReader was designed for document text recognition. The case insensitive result which further increases the scene text
result of GB+NN (41.0%) trained on Chars74K dataset, is the recognition accuracy of the proposed technique up to 83.6%.
one that performs the best as reported in [5]. The character
recognition accuracy on ICDAR dataset is not given for the In addition, the accuracy of our proposed method on
HOG+SVM method as reported in [12]. The method in [8] the SVT dataset is 75.4%, which is only 4% lower than
reports an accuracy of 81.7%. On the other hand, that method that on the ICDAR. This gap may be further reduced by
uses a huge amount of training data that is not available to retraining the SVM using 52 classes because there are no digits
the public. More importantly, the testing dataset in [8] only in the SVT dataset. The above comparison to some degree
consists of 5198 instead of 5379 characters because it re-crops shows the superiority of our proposed method which works
square character patches from the original dataset and ignores comparatively well even on very different dataset. Currently

915
accuracy gap between those two very different datasets shows
the power of the Co-HOG in capturing the shape information
of characters under different scenes.
In the future, we will investigate some global features and
combine them with Co-HOG to formulate a more accurate
scene character recognition technique.

R EFERENCES
[1] X. Chen and A. Yuille, “Detecting and reading text in natural scenes,”
in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Pro-
ceedings of the 2004 IEEE Computer Society Conference on, vol. 2.
IEEE, 2004, pp. II–366.
[2] W. Niblack, An introduction to digital image processing. Strandberg
Publishing Company, 1985.
[3] K. Kita and T. Wakahara, “Binarization of color characters in
Fig. 6: Confusion matrix of character recognition on the SVT and scene images using k-means clustering and support vector machines,”
ICDAR datasets. There are 62 classes indicated by the number on in Proceedings of the 2010 20th International Conference on
the coordinate, which represents ’0-9a-zA-Z’ respectively. Pattern Recognition, ser. ICPR ’10. Washington, DC, USA:
IEEE Computer Society, 2010, pp. 3183–3186. [Online]. Available:
http://dx.doi.org/10.1109/ICPR.2010.779
[4] A. Mishra, K. Alahari, and C. Jawahar, “An mrf model for binarization
little character recognition accuracy has been reported on this of natural scene text,” in Document Analysis and Recognition (ICDAR),
2011 International Conference on. IEEE, 2011, pp. 11–16.
dataset. The most recent work in [12] achieves an accuracy
of 61.9% which is much lower compared with our proposed [5] T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition
in natural images.” in VISAPP (2), 2009, pp. 273–280.
method. If we ignore letter cases, the accuracy of our proposed
[6] J. Weinman, E. Learned-Miller, and A. Hanson, “Scene text recognition
method goes up to 80.6%. using similarity and a lexicon with sparse belief propagation,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 31,
Fig. 5 shows some challenging examples that are correctly no. 10, pp. 1733–1746, 2009.
recognized by our propose method and some failure cases. [7] L. Neumann and J. Matas, “A method for text localization and recogni-
As Fig. 5a shows, the proposed technique is capable of tion in real-world images,” Computer Vision–ACCV 2010, pp. 770–783,
recognizing many challenging characters in scenes. At the 2011.
same time, many failed cases as illustrated in Fig. 5b are [8] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang,
difficult to read even for humans. D. Wu, and A. Ng, “Text detection and character recognition in scene
images with unsupervised feature learning,” in Document Analysis and
Recognition (ICDAR), 2011 International Conference on. IEEE, 2011,
C. Discussion pp. 440–445.
[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
We compute the confusion matrix on the two datasets and detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
add them together as shown in Fig. 6. As Fig. 6 shows, the 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
mistakes concentrate on those confusing letter cases as we 886–893.
discussed earlier. Besides, two most obvious mistakes are the [10] K. Wang and S. Belongie, “Word spotting in the wild,” Computer
mis-classification between ’I’ and ’l’, and between ’0’ and ’O’, Vision–ECCV 2010, pp. 591–604, 2010.
which even human beings often fail to differentiate correctly. [11] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recogni-
tion,” in Computer Vision (ICCV), 2011 IEEE International Conference
Certain recognition failure can be explained by several on. IEEE, 2011, pp. 1457–1464.
other factors. For example, some characters are mistakenly [12] A. Mishra, K. Alahari, and C. Jawahar, “Top-down and bottom-up cues
annotated in both ICDAR and SVT datasets. Besides, there for scene text recognition,” in Computer Vision and Pattern Recognition
even exists character of size 1×135 pixels in the ICDAR 2003 (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2687–2694.
dataset because some characters are not cropped carefully. The [13] T. Watanabe, S. Ito, and K. Yokoi, “Co-occurrence histograms of orient-
ed gradients for human detection,” Information and Media Technologies,
scene text recognition could be greatly improved without these vol. 5, no. 2, pp. 659–667, 2010.
interfering factors. [14] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan,
“A dual coordinate descent method for large-scale linear svm,” in
V. CONCLUSION AND FUTURE WORK Proceedings of the 25th international conference on Machine learning,
vol. 951, no. 08, 2008, pp. 408–415.
Character recognition has played a crucial role in text [15] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support
recognition in scene images. We propose to use co-occurrence vector machines,” Intelligent Systems and their Applications, IEEE,
vol. 13, no. 4, pp. 18–28, 1998.
histogram of oriented gradients (Co-HOG) with weighted
voting scheme for scene character recognition. Compared with [16] T. Joachims, “Training linear svms in linear time,” in Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery
histogram of oriented gradients (HOG), Co-HOG captures and data mining. ACM, 2006, pp. 217–226.
more local spatial information but at the same time keeps the [17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young,
advantage of HOG, i.e., the robustness to illumination variation “Icdar 2003 robust reading competitions,” in Proceedings of the Sev-
and invariance to local geometric transformation. The results enth International Conference on Document Analysis and Recognition,
on both ICDAR 2003 and SVT datasets greatly outperform vol. 2, 2003, pp. 682–687.
all the previous feature descriptor based methods. The small [18] ABBYY FineReader 10, http://www.abbyy.com/.

916

You might also like