Professional Documents
Culture Documents
Steps in OCR
Conclusion
PIXELS
Pixel – (Picture Elements) or pels (Picture Elements), an
image sample area that is almost always square.
All pixels are identical in size and arrangement.
All pixels are processed the same way.
All pixels are scanned, displayed, and printed in the
same way.
Each pixel has a location and a colour.
Both given as numbers.
Location: By Coordinates.
Color: Amount of Red ,Green and Blue.
Max on all 3 is white, minimum on all 3 is black.
WHAT AND WHY OF OCR…
Optical Character Recognition (OCR) is the process of
translating scanned images of typewritten text into
machine-editable information.
Brightness
Contrast
Distance
Type of Scanner
(Courtesy : Scanned from G K Today)
BINARISATION
Converts a Gray level (8bit)
TIFF (Tagged Image File
Format) image to Binary
Image.
Histogram based global
threshold approach.
One bit black , Other bit
white.
Helps in segregating
background from text.
SKEW CORRECTION
Determine the degree of skewness.
Use HEADLINE or page edges for correction by
rotating the image.
BACKGROUND NOISE
Non-Textual
Textual Noise
Noise
• Extraneous symbols • Black Borders.
from the neighboring • Speckles.
page.
• Hand Written
Material.
TEXTUAL NOISE
Top – Down
Mixed
BOTTOM UP APPROACH
BOTTOM UP APPROACH :
Segmentation starts with
individual letters on a
page, then based on text-
layout conventions, group
letters into words, words
into paragraphs, and so on.
Line-art and half-tones are
often detected by their
size, or their non-text
layout.
TOP DOWN APPROACH
Method 1
Top-down approaches
take advantage of the fact
that formatted documents
usually have margins
surrounding each region.
The page can be
subdivided into different
regions by examining the
white-space in the
document.
Method 2
Top-down method also use the bit-density or texture
of the document to identify and classify regions.
OVER SEGMENTATION
Dot matrix printing or insufficient inks
Characters tend to be fragmental
UNDER SEGMENTATION
Ink smudging
Small fonts
Signatures
SEGMENTATION
PROBLEMS
Character Character
Merging Fragmentation
Over
Characters Under threshold Thin strokes.
like threshold binarization. Dot matrix or
“ry”, “%l”, binarization,
ink jet
“m” appear Ink Poor quality
smudging printing
connected printing.
OVERCOMING SEGMENTATION PROBLEMS
Separation by Valley Separation by
of Vertical Projection Connected White Path