You are on page 1of 6

Scene Text Extraction and Translation for Handheld Devices

Ismail Haritaoglu
ZBM Almaden Research, San Jose, CA 95 120, USA
(ismailh @ almaden.ibm.com)

Abstract
We describe a scene text extraction system for handheld de-
vices to provide enhance information perception services to
user. It uses a color camera attached to personal digital
- assistant as an input device to capture scene images from
the real world and it employs image enhancement and seg-
mentation methods to extract written information from the
scene, convert them to text information and show them to
user so that the user can see both the real world and infor-
'
mation together. We implemented a prototype application:
an automatic sigdtext language translation for foreign trav-
elers where people can use the system whenever they want
to see text or signs in their own language where they are
originally written in foreign language in the scene

Figure 1: Examples of application for scene text extraction:


1 Introduction foreign text translation

The real physical world and digital information space are


coming closer and closer as wireless communication, mo- images, captured by PDA's built-in camera, which does not
bile display and camera technologies and continue to ad- requires very precise registration between real world and
vance that introduces new research areas, such as, aug- virtual world: an automatic sigdtext translation for foreign
mented reality, location-aware pervasive computing. An travelers.
exiting array of new opportunities and applications emerge In this paper, we describe the computer vision techniques
as small and inexpensive wearable computers and perva- to extract scene text information from the images captured
sive devices are becoming widely available and accept- by a camera attached to PDA. We developed a prototype
able [l, 2, 3, 13, 9, 8, 121. These devices introduces new that people can interact with text in the scene. Color camera
concepts on how user interact with real world, such as in- system provide functionality that it captures the snapshot of
teraction with objects, buildings. Recent trends in multime- the real world what user sees and brings it to digital world,
dia communication and pervasive computing pushing mo- so that people can see texts or signs in their own language
bile devices, such as Personal Digital Assitant (PDA's), cel- where they are written in foreign language.
lular phones, to have a built-in camera, and global positing The system integrates symmetric neighborhood filter
devices. Those trends shows that in near future, those de- (SNF) to enhance the image, a hierarchical connect com-
vices will have built-in camera, GPS and wireless commu- ponents to segment the characters from background in the
nication modules We are conducting exploratory research scene, and wavelet transformation to verify the text region
to provides enhance information perception services which to get binary image which contains the characters only.
is becoming desirable and profitable way to interact with Then the system uses off-the self OCR to recognize the
information in the real world as those wearable and hand- characters written in foreign language. After that it trans-
held devices are becoming as a part of daily life in near lates them in to desired language and the translated text are
future. We implemented an information augmentation sys- superimposed (augmented) back to the same location in the
tem for handhelawearable devices to extract information scene image where the original foreign text is so that user
from images and augment 2D information onto real scene can see the translated text as shown in Figure 1 (top row).

0-7695-1272-0/01$10.00 0 2001 lEEE 11-408


There has been many research on text detection and
recognition for a long time in OCR research area where text
in printed documents which has uniform background with
m m _-. . - . .

uniform characters. 'However, detection and recognition of


text in "videohcene images" is relatively new area in com-
puter science and it is explored in content based retrieval
researches [4,6, 1 1,7,5] It is a different domain than docu-
ment images analysis and requires different approaches due
to non-uniform background, varied color, sizes and font
characters in the text. The text detection and recognition
problem in videohcene image could be divided into three
main sub-problem: locating text in video images, segment-
ing characters in text area, converting segmented charac-
ters into ASCII text (recognition). Each of the sub-problem
are difficult computer vision problem and there has been re- Figure 2: Image enhancement and segmentation flow
search going on to develop new heuristics and algorithms
to solve those problems. In this paper, we focused on ex-
tracting the text from images given the approximate loca- nection through GSM modem (Nokia 8190 GSM Phone)
tion of text in the image are known. Text in images ap- that allows client send partial image and other information
pears as either scene text [7,5] or superimpose text. Most of to server, and a Global Positioning system (Pharos Nav- 180
the previous content based image retrieval systems has been GPS) that provide accurate location information. Server
designed to extract superimpose images which is relatively side consist of an server gives the system processing power
easier to detect, such as, captions, commercials, tickers in and internet connectivity.
news segments, where they can be used to index in to video
for searching [4, 6, 113. In our system, we are extracting
the text information from scene images where text is a part 2 Scene Text Extraction
of real world scene, such as, warning or information signs,
price tags, advertisements on billboards. The assumption After user select the rectangle region, the sub-image are en-
we made is that text regions appears in the scene sufficiently hanced by a SNF filter and hierarchical connected compo-
large so that we can recognize the characters with a com- nents are applied to enhance image to segment the char-
mercially available OCR engines. As the system is an inter- acters from background. Then each segment are verified
active system and amount of response and processing time whether it contains a character or not using a wavelet based
is limited and processing power is very limited on PDA, edge density measures. Following subsections explains the
user selects the region in the scene image which contains computer vision techniques that we used in the system.
the text, then characters are extracted automatically using
image enhancements and segmentation methods. Also each 2.1 Character Segmentation
selected regions is going to be verified whether it contains
text using wavelet transformation based edge density com- Scene images have significant variability in color. Noise,
illumination changes can cause a gradient of color levels in
putation. That allows us to ease the text detection prob-
lem and segment characters in user selected box to obtain a pixels across the same region in the characters. That makes
blacklwhite binary image where we can send it to character the character segmentation difficult. Also, because of rel-
recognize engines. ative motion of camera while capturing scene images, and
aperture effects, edges or regions are usually blurred so that
the transition in color levels between regions is not a perfect
1.1 System Architecture step over a single pixel, but gradually changes from one re-
gion to the other over several pixels. Therefore, we first en-
As commercial available PDA devices have memory and hance the scene images using a filter, the Symmetric Neigh-
processing power limitation, we preferred current imple- borhoodFilter (SNF) [ 151that smoothes out the interior pix-
mentation of the system that works on clienvserver type ar- els of a region. SNF filter is an edge-preserving filter which
chitecture. Client side consists of a color camera attached can detect blurred transitions sharpens them while preserv-
personal digital organizer, Cassiopeia E 125 that have 32M ing the true border location. Most preprocessing filters ei-
memory and 150Mz Mimp Processor, runs under Windows ther smooth the interior of regions at the cost of degrad-
Pocket 3.0 OS and Casio JK-71ODC Color Digital Camera, ing the edges, or conversely, sharpen the edges while intro-
where people point it and capture the image, wireless con- ducing interior error on previously homogeneous regions.

II-409
Figure 3: Segmentation results: original image (top-left), SNF-enhanced image (top-right), connected components (left) and
border of each regions

However, the SNF is an edge-preserving smoothing filter ter pixel is equal distance from the pair, or is a local minima
which performs well for both edge-sharpening and region or maxima, its value is selected instead. The collection of
smoothing. After the image is enhanced by the SNF, we use four selected pixels are averaged together, and finally, the
a hierarchical connected components algorithm (HCC), to center pixel is replaced by the mean of this average and the
combine similar pixels into homogeneously labeled regions center pixels current color. The first phase of segmentation
so that we can segment characters in the real scene. is a combination of five iterative SNF. The first step runs
for a small number of iterations with E = 0 and is used to
2.1.1 Symmetric Neighborhood Filter based Image preserve edges. We define n to be the median of the stan-
Enhancement dard deviations of all 3x3 neighborhoods centered around
each non-border pixel in the image. To flatten the interior
Due to noise and blur, color regions of the characters in the of regions, SNF is iterated with E = hm,where k is typically
scene images are not homogeneous and sharp along their set to 2.0 . When the percentage of fixed pixels more than
edges, an enhancement of the scene images with a filter 90%, that stage is stopped. Then, again using E = 0 with
which reduces those problems could produce better char- couple of iteration in order to sharpen the edges between
acter segmentation results. The Symmetric Neighborhood the regions. After that, we applied a Nearest Neighbor filter
Filter (SNF)is introduced by Harwood [15] and applied on the SNF enhanced images that cleans single pixel out-
in many low-level image segmentation, such as, automatic liers by replacing each pixel in the image with the mean
target and road detection in aria1 images, people detection. of its value and the color value of an adjacent pixel which
SNF compares each pixel to its %connected neighbors. The is closest to its current value. An example of original and
neighbors are compared in symmetric pairs around the cen- SNF-enhanced image are shown n figure 4 (top-left) and
ter which are North-South, East-West, NW-SE, and NE- (top-right).
D
SW. The one pixel in each pair closest to the center in color
is selected, if its color is within a threshold E of the center
pixel, otherwise, the center pixels value is used. If the cen-

11-410
two adjacent pixels z and y are connected if y(z, y) < o.
We compare the color similarities of border pixels of each
adjacent regions in other levels. Two adjacent region
and R; are connected if the average color of all pixel which
is on the border between theses two regions is less than a
threshold E = ~ Z where
U k is the iteration number (level-
number)and U is the standard deviation used in SNF. After
that, each pixel will have the same label if and only if they
belong to the same connected region after HCC.
After hierarchical connected components, we convert the
segmented images to binary blacWwhite images where the
background regions will be colored white and regions which
may contain text will be colored black so that we can use
that binary segmentation in character recognitions. We used
two main heuristics based on observation about the appear-
ance of background regions for final segmentation: back-
ground regions has low texture and most of time they are
bigger than text regions and encapsulate other regions.

2.1.3 Text Verification


We used observations about the appearance of the text in
scene images to verify them, such as, high contrast with its
background, sharp horizontal or vertical edges and shapes,
ct,
. I
close proximity with other characters in the word. Those
properties of the scene texts can be easily observed in
discrete wavelet transformed images. Recently, discrete
wavelet transform (DWT) based techniques have become
popular for indexing and image retrievals as edges and
shapes of the objects can be easily estimated in DWT do-
main. We defined a text-similarity measure E ( z ,y) for
Figure 4: text character extraction steps: original image(a),
each pixel location in scene image based on discrete wavelet
SNF-enhanced image (b), pixel level connected compo- coefficients. We used one level non-standard decompo-
nents (level-I) (c), final results of hierarchical connected sition on two-dimensional wavelet transform which alter-
components after region merging (leveM)(d), foreground nates between operation on rows and columns of the each
character region in different colors (e), final extracted char- color band of the image using two dimensional daubechies
acters with their bounding boxes and computed text lines wavelet (Haar basis). First we performed on step of hor-
(0 izontal painvise averaging and differencing on the pixel
values in each row of the image. Next, we apply vertical
pairwise averaging and differencing to each column of the
2.1.2 Hierarchical Connected Component Analysis results. One level decomposition yields four coefficients:
approximation a,(z, y), horizontal detail w;(z, y), vertical
Many of the previous system, connect component analysis detatil wz(z, y), and diagonal detail w:(z, y) . One level
has been applied to binary images which is obtained using wavelet decomposition of an example scene image shown
simple thresholding to obtain a final segmentation. How- in Figure 5 (a,b,c,d)). The Normalized edges density mea-
ever, finding a good threshold value to obtain clean binary sure E ( z ,y) is computed as
images is not easy [lo]. In our system, we applied a hier- N N
archical connected components(HCC)analysis on the SNF-
E(Z>Y) = (1)
csci,j,*(w~(Z,Y)+w,v(2,Y)+w,d(Z>Y))
enhanced images to segment the images in to regions where i=l i s 1
the pixel level connectivity are considered in first level of
where s is a N x N (typically 5x5) Gaussion mask. In Fig-
segmentation and region-level connectivity are considered
ure 5(e) shows the graphical representation of the edges
in other levels. densities, peaks on the graphs shows high probability values
Let z and y are two adjacent pixels, and their color e.g., probability that pixel belongs to a text region. Once the
in RGB are (z,.,zg, zb)and(y,, yg, yb), then their absolute Edge densities are computed, initial text region bounding
color different 7 is calculated as y(z, y) = argmaz{ (z,. - boxes are detected using simple thresholding on Edge den-
yrl, )z9- yg),I z b - ybl}. In the first level (level-I) of HCC, sities where E ( z , y ) > 0.5 to estimate regions with high

11-411
Figure 5 : An example one level non-standard Discrete wavelet transform, approximation (a), horizontal details (b), vertical
details(c), and diagonal details (d) and Edge density probability distribution of an image (e) where high peaks indicates the
high probability of having text region

probability that there is a scene text. After edge densities commercial OCR available for general purpose that one can
computed for every pixel in the image, then the edge den- use easily if the characters are segmented well and clean
sity measure are computed for each connected component T and their size are big enough. Most of the text in scene im-
as ages are either similar to known font or in the form of hand-
tE(T) = E(z)/N, (2) writing. We are using recognition and handwriting recog-
=Er nition engines which have been developed by IBM China
where N, is number of pixel in T and E ( z ) is edge den- Research Lab. Once the segmentation step is done, we send
sity. If tE(T) bigger than a predetermined threshold value the blacWwhite segmentation of each text line to characters
(it is 0.5 in current implementation), that region is classi- recognition engine to convert them to ASCII characters. As
fied as character, otherwise, it is classified as background. we only need simple world translation, once characters in
In Figure 4 (e) final background regions segmentation are ASCII form, one can easily use and of commercial available
shown, each character region is colored different color and language translation engines. In our prototype, we are go-
background regions are colored with white. ing to integrate Chinese and Japanese to English translator
After the character segmentations, text-line boundaries which has been developed by IBM Tokyo Lab. Three ex-
are computed in order to group the regions into text-lines ample of text-extraction, translation and augmentation are
so that we can send each text-line to character recognition
engine. Pixels on each character regions are projected on shown in Figure 6.
the y-axis and histogram of the projection h P ( y ) of each
horizontal value y are computed.
3 Discussion and Future Work
W Y )= %Y) (3)
z<N
We described a scene text extraction system with an ex-
where & ( E , y) is 1 if there is a character pixel at location ample application which combines pervasive devices with
(2, y), otherwise it is 0. Each text lines appear as a peak on augmented reality and computer vision techniques to pro-
the horizontal projection histograms. The locations of the vide functionality to user that they can interact real world
peaks shows the text lines. Each character regions are group through the camera-attached PDA's.
together as they are in same text-lines. Figure 4(f) shows One of the crucial problem in augmented reality research
the bounding boxes of each character regions and text-lines is 3D registration between real and virtual world which re-
detected projection histogram schemes. quires precise calibration between camera world. How-
There has been many well established research efforts on ever, registration problem was relaxed using computer vi-
character recognition (OCR) for document analysis or hand- sion techniques and user interactions as the virtual world
writing recognition in different languages, there are some is going to overlay on to PDA's 2D scene image which

11-412
there are false-negatives where some of the characters can-
not be segmented. In 2 word, our system over segment the
region causing misrecognition.

References
R. Azuma, Tracking Requirements for Augmented Reality,
Communications of the ACM 36 No. 7,50-51 (July 1993).
S . Feiner, B. MacIntyre, and D. Seligmann,Knowledge-Based
Augmented Reality Communications of the ACM 36 No. 7,
53-62 (July 1993).
S . Feiner, B. MacIntyre, T. Hollerer, and A. Webster, A
Touring Machine: Prototyping3D Mobile Augmented Reality
Systems for Exploring Urban Environments, Personal Tech-
nologies 1 No. 4,208-217 (1997).
A. Jain and B. Yu Automatic Text Location in Images and
Video Frames in Proc. IEEE Pattern Recognition, Vol 31,
1998
H. Li, and D. Doermann. A Video Text Detection System
Based on Automated Training. ICPR, pages 223-226, 2000
R. Lienhart Automatic text recognition for video indexing In
Proc. ACM Multimedia 96, pp: 1 1-20. Nov, 1996
Figure 6: Examples of text translation and augmentation: J. Ohya, A. Shio, and S . Akamatsu, Recognizing Characters in
left images shows what user sees, middle one is extracted Scene Images IEEE Trans. On Pattern Analysis and Machine
Intelligence, Vol 16 Feb, 1994
text, right images shows final version of image that user J. Rekimoto, NaviCam: A Magnifying Glass Approach to
sees on their PDA's Augmented Reality, MIT Presence (August 1997),
J. Rekimoto, Y. Ayatsuka, and K. Hayashi, Augment-able
Reality: Situated Communication Through Physical and Dig-
is a close-loop AR system where image based registration ital 'Spaces, Wearable Computers: The Second International
can be used. We encountered two major technical problem Symposiumon Wearable Computers,Pittsburgh, PA (October
19-20, 1998), pp. 68-75.
while developing the system: speed of wireless communica-
tions, processor power limitation of available PDA's. Cur- [IO] T. Rider and S.Calvard Picture thresholding using an inter-
active selection method IEEE Transactions on Systems, Man
rent implementation of application requires wireless com- and Cybernatics 8(8), 1978
munication that it can send data to server where heavily im- Ill] J.C. Shim, C. Dorai, R. Bolle, Automatic Text Extraction
age processing methods are computed. We are using GSM from Video for Content-Based Annotation and Retrieval In.
wireless networks which provides 9600 bps speed. On the Proc. Of ICPR, pp. 1998.
average, sending a portion of the scene images takes around [I21 J. Spohrer, Information In Places, IBM System Journal, Vol
38, No. 4 - Pervasive computing
10 sec, and as we are sending only text information back
[13] T. Starner, B. Schiele, and A. Pentland, Visual Contextual
to PDA, it takes less than lsec. Image processing computa- Awareness in Wearable Computing, Wearable Computers:
tions takes less than 3 sec on the server site. Therefore, the The Second International Symposium on Wearable Comput-
augmented text information will be available 15 sec later af- ers, Pittsburgh, PA (October 19-20, 1998), pp. 50-57.
ter user select the region. It is still slow for an interactive [14] M. Swain and D.Ballard, Color Indexing Interotional Jour-
application, however when the new generation PDA which nal of Computer Vision, 7(1), 1991
will have more than two CF slots on board is available, [IS] T. Westman, D. Harwood, T. Laitinen, and M. Pietikainen.
Color Segmentation By Hierarchical Connected Components
we will replace the GSM modem with IEEE 802.1 1 wire- Analysis with Image Enhancement by Symmetric Neighbor-
less connections or when available 3G modems that provide hood Filters. In Proceedings of the 10th ICPR, pp796, June
very fast connection so user will get information less than 5 1990.
second in future prototypes.
We tested our texvsign augmentation system where we
are in Chinatown at San Francisco in which many text was
written handwriting in Chinese characters (Figure 6), our
system successfully works inside a Chinese store where the
labels about the items and their prices was translated to En-
glish. We used 36 images contains 47 word to evaluate the
performance of the system. The system correctly segments
41 word which were successfully recognized. In 5 words,

11-413

You might also like