Professional Documents
Culture Documents
Ismail Haritaoglu
ZBM Almaden Research, San Jose, CA 95 120, USA
(ismailh @ almaden.ibm.com)
Abstract
We describe a scene text extraction system for handheld de-
vices to provide enhance information perception services to
user. It uses a color camera attached to personal digital
- assistant as an input device to capture scene images from
the real world and it employs image enhancement and seg-
mentation methods to extract written information from the
scene, convert them to text information and show them to
user so that the user can see both the real world and infor-
'
mation together. We implemented a prototype application:
an automatic sigdtext language translation for foreign trav-
elers where people can use the system whenever they want
to see text or signs in their own language where they are
originally written in foreign language in the scene
II-409
Figure 3: Segmentation results: original image (top-left), SNF-enhanced image (top-right), connected components (left) and
border of each regions
However, the SNF is an edge-preserving smoothing filter ter pixel is equal distance from the pair, or is a local minima
which performs well for both edge-sharpening and region or maxima, its value is selected instead. The collection of
smoothing. After the image is enhanced by the SNF, we use four selected pixels are averaged together, and finally, the
a hierarchical connected components algorithm (HCC), to center pixel is replaced by the mean of this average and the
combine similar pixels into homogeneously labeled regions center pixels current color. The first phase of segmentation
so that we can segment characters in the real scene. is a combination of five iterative SNF. The first step runs
for a small number of iterations with E = 0 and is used to
2.1.1 Symmetric Neighborhood Filter based Image preserve edges. We define n to be the median of the stan-
Enhancement dard deviations of all 3x3 neighborhoods centered around
each non-border pixel in the image. To flatten the interior
Due to noise and blur, color regions of the characters in the of regions, SNF is iterated with E = hm,where k is typically
scene images are not homogeneous and sharp along their set to 2.0 . When the percentage of fixed pixels more than
edges, an enhancement of the scene images with a filter 90%, that stage is stopped. Then, again using E = 0 with
which reduces those problems could produce better char- couple of iteration in order to sharpen the edges between
acter segmentation results. The Symmetric Neighborhood the regions. After that, we applied a Nearest Neighbor filter
Filter (SNF)is introduced by Harwood [15] and applied on the SNF enhanced images that cleans single pixel out-
in many low-level image segmentation, such as, automatic liers by replacing each pixel in the image with the mean
target and road detection in aria1 images, people detection. of its value and the color value of an adjacent pixel which
SNF compares each pixel to its %connected neighbors. The is closest to its current value. An example of original and
neighbors are compared in symmetric pairs around the cen- SNF-enhanced image are shown n figure 4 (top-left) and
ter which are North-South, East-West, NW-SE, and NE- (top-right).
D
SW. The one pixel in each pair closest to the center in color
is selected, if its color is within a threshold E of the center
pixel, otherwise, the center pixels value is used. If the cen-
11-410
two adjacent pixels z and y are connected if y(z, y) < o.
We compare the color similarities of border pixels of each
adjacent regions in other levels. Two adjacent region
and R; are connected if the average color of all pixel which
is on the border between theses two regions is less than a
threshold E = ~ Z where
U k is the iteration number (level-
number)and U is the standard deviation used in SNF. After
that, each pixel will have the same label if and only if they
belong to the same connected region after HCC.
After hierarchical connected components, we convert the
segmented images to binary blacWwhite images where the
background regions will be colored white and regions which
may contain text will be colored black so that we can use
that binary segmentation in character recognitions. We used
two main heuristics based on observation about the appear-
ance of background regions for final segmentation: back-
ground regions has low texture and most of time they are
bigger than text regions and encapsulate other regions.
11-411
Figure 5 : An example one level non-standard Discrete wavelet transform, approximation (a), horizontal details (b), vertical
details(c), and diagonal details (d) and Edge density probability distribution of an image (e) where high peaks indicates the
high probability of having text region
probability that there is a scene text. After edge densities commercial OCR available for general purpose that one can
computed for every pixel in the image, then the edge den- use easily if the characters are segmented well and clean
sity measure are computed for each connected component T and their size are big enough. Most of the text in scene im-
as ages are either similar to known font or in the form of hand-
tE(T) = E(z)/N, (2) writing. We are using recognition and handwriting recog-
=Er nition engines which have been developed by IBM China
where N, is number of pixel in T and E ( z ) is edge den- Research Lab. Once the segmentation step is done, we send
sity. If tE(T) bigger than a predetermined threshold value the blacWwhite segmentation of each text line to characters
(it is 0.5 in current implementation), that region is classi- recognition engine to convert them to ASCII characters. As
fied as character, otherwise, it is classified as background. we only need simple world translation, once characters in
In Figure 4 (e) final background regions segmentation are ASCII form, one can easily use and of commercial available
shown, each character region is colored different color and language translation engines. In our prototype, we are go-
background regions are colored with white. ing to integrate Chinese and Japanese to English translator
After the character segmentations, text-line boundaries which has been developed by IBM Tokyo Lab. Three ex-
are computed in order to group the regions into text-lines ample of text-extraction, translation and augmentation are
so that we can send each text-line to character recognition
engine. Pixels on each character regions are projected on shown in Figure 6.
the y-axis and histogram of the projection h P ( y ) of each
horizontal value y are computed.
3 Discussion and Future Work
W Y )= %Y) (3)
z<N
We described a scene text extraction system with an ex-
where & ( E , y) is 1 if there is a character pixel at location ample application which combines pervasive devices with
(2, y), otherwise it is 0. Each text lines appear as a peak on augmented reality and computer vision techniques to pro-
the horizontal projection histograms. The locations of the vide functionality to user that they can interact real world
peaks shows the text lines. Each character regions are group through the camera-attached PDA's.
together as they are in same text-lines. Figure 4(f) shows One of the crucial problem in augmented reality research
the bounding boxes of each character regions and text-lines is 3D registration between real and virtual world which re-
detected projection histogram schemes. quires precise calibration between camera world. How-
There has been many well established research efforts on ever, registration problem was relaxed using computer vi-
character recognition (OCR) for document analysis or hand- sion techniques and user interactions as the virtual world
writing recognition in different languages, there are some is going to overlay on to PDA's 2D scene image which
11-412
there are false-negatives where some of the characters can-
not be segmented. In 2 word, our system over segment the
region causing misrecognition.
References
R. Azuma, Tracking Requirements for Augmented Reality,
Communications of the ACM 36 No. 7,50-51 (July 1993).
S . Feiner, B. MacIntyre, and D. Seligmann,Knowledge-Based
Augmented Reality Communications of the ACM 36 No. 7,
53-62 (July 1993).
S . Feiner, B. MacIntyre, T. Hollerer, and A. Webster, A
Touring Machine: Prototyping3D Mobile Augmented Reality
Systems for Exploring Urban Environments, Personal Tech-
nologies 1 No. 4,208-217 (1997).
A. Jain and B. Yu Automatic Text Location in Images and
Video Frames in Proc. IEEE Pattern Recognition, Vol 31,
1998
H. Li, and D. Doermann. A Video Text Detection System
Based on Automated Training. ICPR, pages 223-226, 2000
R. Lienhart Automatic text recognition for video indexing In
Proc. ACM Multimedia 96, pp: 1 1-20. Nov, 1996
Figure 6: Examples of text translation and augmentation: J. Ohya, A. Shio, and S . Akamatsu, Recognizing Characters in
left images shows what user sees, middle one is extracted Scene Images IEEE Trans. On Pattern Analysis and Machine
Intelligence, Vol 16 Feb, 1994
text, right images shows final version of image that user J. Rekimoto, NaviCam: A Magnifying Glass Approach to
sees on their PDA's Augmented Reality, MIT Presence (August 1997),
J. Rekimoto, Y. Ayatsuka, and K. Hayashi, Augment-able
Reality: Situated Communication Through Physical and Dig-
is a close-loop AR system where image based registration ital 'Spaces, Wearable Computers: The Second International
can be used. We encountered two major technical problem Symposiumon Wearable Computers,Pittsburgh, PA (October
19-20, 1998), pp. 68-75.
while developing the system: speed of wireless communica-
tions, processor power limitation of available PDA's. Cur- [IO] T. Rider and S.Calvard Picture thresholding using an inter-
active selection method IEEE Transactions on Systems, Man
rent implementation of application requires wireless com- and Cybernatics 8(8), 1978
munication that it can send data to server where heavily im- Ill] J.C. Shim, C. Dorai, R. Bolle, Automatic Text Extraction
age processing methods are computed. We are using GSM from Video for Content-Based Annotation and Retrieval In.
wireless networks which provides 9600 bps speed. On the Proc. Of ICPR, pp. 1998.
average, sending a portion of the scene images takes around [I21 J. Spohrer, Information In Places, IBM System Journal, Vol
38, No. 4 - Pervasive computing
10 sec, and as we are sending only text information back
[13] T. Starner, B. Schiele, and A. Pentland, Visual Contextual
to PDA, it takes less than lsec. Image processing computa- Awareness in Wearable Computing, Wearable Computers:
tions takes less than 3 sec on the server site. Therefore, the The Second International Symposium on Wearable Comput-
augmented text information will be available 15 sec later af- ers, Pittsburgh, PA (October 19-20, 1998), pp. 50-57.
ter user select the region. It is still slow for an interactive [14] M. Swain and D.Ballard, Color Indexing Interotional Jour-
application, however when the new generation PDA which nal of Computer Vision, 7(1), 1991
will have more than two CF slots on board is available, [IS] T. Westman, D. Harwood, T. Laitinen, and M. Pietikainen.
Color Segmentation By Hierarchical Connected Components
we will replace the GSM modem with IEEE 802.1 1 wire- Analysis with Image Enhancement by Symmetric Neighbor-
less connections or when available 3G modems that provide hood Filters. In Proceedings of the 10th ICPR, pp796, June
very fast connection so user will get information less than 5 1990.
second in future prototypes.
We tested our texvsign augmentation system where we
are in Chinatown at San Francisco in which many text was
written handwriting in Chinese characters (Figure 6), our
system successfully works inside a Chinese store where the
labels about the items and their prices was translated to En-
glish. We used 36 images contains 47 word to evaluate the
performance of the system. The system correctly segments
41 word which were successfully recognized. In 5 words,
11-413