You are on page 1of 71

Urdu Optical Character

Recognition Using Feedforward


Neural Network

By
Zaheer Ahmad
MS-IT
Institute of Management Science, Peshawar, Pakistan
6th February,09
1
Optical Character Recognition
• Optical Character Recognition (OCR) is the
mechanical or electronic translation / reading
of images of handwritten, typewritten or
printed text (usually captured by a scanner)
into machine- editable text

• OCR is Branch of
– Pattern Recognition and Machine Vision

2
Optical Character Recognition
Main Three Steps

• Scanner--------with new tech not difficult


• Analyzing Image ---script/style dependent
• Classification-----------script dependent

3
OCR History
• Research commenced in 1950s
• In the 1960s and 1970s, new OCR applications
developed for businesses, banks, hospitals, post
offices; insurance, railroad, and aircraft
companies; newspaper publishers
• In (1960s), OCR machines tended to make lots of
errors when the print quality was poor, caused
either by wide variations in type fonts and
roughness of the surface of the paper or by the
cotton ribbons of the typewriters
4
• To reduce errors: Standardization of print
fonts, paper, and ink qualities
• In the 1970s New fonts such as OCRA and
OCRB were designed
• These efforts revolutionalized data entry
process ….…..loss of jobs

5
Urdu and Arabic Script OCR History
• Little work in the field, mostly on standalone
alphabets…India….and Pak
BUT
• Some work on Arabic and Farsian but still……..
• Different style used for Arabic, Farsian and
Urdu.i.e nastaleeq ….naskh….
• Some work on Pashto

6
Some Applications Of Urdu OCR
• It will expand and multiply already available
knowledge in hard copies i.e.
– Centuries old rare script in Arabic, Urdu and
Persian will become available to common man
– improve the interaction between man and
machine in many applications, including
• office automation,
• check verification of banking,
• business and data entry applications,

7
Some Applications Of OCR
• library archives,
• documents identifications,
• e-books producing,
• invoice and shipping receipt processing,
• subscription collections,
• questionnaires processing,
• exam papers processing and
• online address and signboard reading

8
Pattern Recognition
• The act of taking in raw data and taking an action
based on the category of the data (also known as
classification or pattern classification)
• It uses methods from statistics, machine learning
and other areas.
• Some popular techniques for pattern recognition
include:
• Neural Networks
• Hidden Markov Models ----Probability
• Bayesian networks ….

9
Urdu Script ( ‫)اردو رسم الخط‬
• National Language of Pakistan and
• One of the popular script in the Indian
subcontinent evolved in the subcontinent from
the mixture of Arabic, Turkish, Farsi and Hindi
Languages
• spoken by more than 60 million speakers in over
20 countries
• 58 Character Set by NLA Pakistan
• 40 Basic plus one do-chashmi-hey ( ‫ ) ھ‬is used to
form all composite alphabets; therefore the
working set is consists of 41 alphabets.
10
Character Set (58 alphabets) of Urdu Script

11
• Urdu is a modification of the Persian alphabet,
which is itself a derivative of the Arabic
alphabet
• and adopted some characters like Rhe( ‫) ڑ‬from
Hindi script.
• Urdu is a right to left Script written in the
calligraphic Nasta'liq script where as Naskh
style is used for Arabic.

12
• Most combined Characters form a degree of
about 45 to the horizontal line.
• because of which Urdu script reading is faster
than roman script but
• It makes it harder the machines to recognize
the word or segment one character from the
rest.
• for the novice readers as well

13
• No Capital or Small Character but the Last character is
considered to be capital as it is in its full form.
• Stand alone and joining forms ----changes shape but also
its size .
• It increase the number of classes to be recognized from. In
our experiments we have used 54 different classes for 41
different Urdu characters

‫ث‬ ‫ث‍ث‍‍ث‬ sē
‫ج‬ ‫ج‍ج‍‍ج‬ jīm

• The word Urdu ( ‫ )اردو‬or of the similar category


are not joinable or cannot be connected
14
Problems Of Urdu Script
• A large Character set but most of which are
similar:

‫ب‬ ‫ببب‬ ‫ج‬ ‫ججج‬


‫پ‬ ‫پپپ‬ ‫چ‬ ‫چچچ‬
‫ت‬ ‫تتت‬ ‫ح‬ ‫ححح‬
‫ٹ‬ ‫ٹٹٹ‬
‫خ‬ ‫خخخ‬
‫ث‬ ‫ثثث‬
15
• Other forms ( initial, middle) of these character
reveals that ein( ‫ )ع‬is similar to hamza(‫)ء‬,
• Waw (‫ )و‬might be confusing with (‫ )ی‬,
• Ze (‫ )ز‬resembles noon (‫)ن‬
• Zanl (‫)ذ‬, dhal (‫ )ڈ‬is close match to initial form of
tay ( ‫)ٹ‬
• Mem(‫ )ممم‬at middle of a word can be confused
with middle form of ein (‫ )ععع‬and with stand
alone goal-he (‫)ہ‬.

16
• Some characters contain closed loop the
Character ‫ ھ‬contains two loops.
• The open portion of characters Jim ‫ ـﺝ‬, Hey
‫ـﺣ‬and Khe ‫ ـﺥ‬forms a triangle.
• The loop of character Mem ‫ م‬,Waw ‫ و‬and Ein ‫ـﻌـ‬
sometimes becomes too small that the internal
opening part is disappeared
• Hamza (‫ )ء‬zigzag shape, is not really a letter but it
can cause difficulty in segmentation process as it
resembles with the segmented middle form of
ein ( ‫) ع‬.
17
• Dots may appear as two separated dots,
touched dots, hat or as a stroke.
• Another style of Urdu handwriting is the
artistic or decorative calligraphy.
• Which is usually full of overlapping making
the recognition process even more difficult by
human being rather than by computers

18
What Have Been Done
--------------------NOTHING------------
• No text databases or dictionary available, except
the one under preparation by the Urdu Language
Authority but their Web shows a slow progress so
far.
• Even no standard keyboard exits, National
Language Authority of Pakistan has devised a
keyboard in which the most used characters are
set under the main fingers but it is very different
from the one already in use ( phonetic keyboard
of Inpage).

19
• Moreover still to be adopted by software vendors
as even Windows Vista is using its own version of
Urdu keyboard.
• The research carried out on Urdu language is
mostly scattered and outside from the Urdu
world.
• There are no specialized conferences or
symposium conducted so far.
• There is no financial support from government.
20
Neural Networks

21
NN A Brain-Inspired Model

Inputs Outputs

Connection between cells

out
in 22
NN A Brain-Inspired Model
• A neural network acquires knowledge through
learning.
• A neural network's knowledge is stored within
inter-neuron connection strengths known as
synaptic weights.
• The largest modern neural networks
achieve the complexity comparable to a
nervous system of a fly.

23
Historical Background
• 1943 McCulloch and Pitts proposed the first
computational models of neuron.
• 1949 Hebb proposed the first learning rule.
• 1958 Rosenblatt’s work in perceptrons.
• 1969 Minsky and Papert’s exposed limitation of the
theory.
• 1970s Decade of dormancy for neural networks.
• 1980-90s Neural network return (self-organization,
back-propagation algorithms, etc)

24
NN Applications
• Process Modeling and Control- Creating a neural network model for a physical
plant then using that model to determine the best control settings for the plant.
• Machine Diagnosis- Detect when a machine has failed so that the system can
automatically shut down the machine when this occurs.
• Target Recognition- Military application which uses video and/or infrared image data to
determine if an enemy target is present.
• Medical Diagnosis- Assisting doctors with their diagnosis by analyzing the reported
symptoms and/or image data such as MRIs or X-rays.
• Target Marketing- Finding the set of demographics which have the highest response
rate for a particular marketing campaign.
• Voice Recogntion- Transcribing spoken words into ASCII text.
• Financial Forecasting( Stock predication) - Using the historical data of a security to
predict the future movement of that security.
• Quality Control - Attaching a camera or sensor to the end of a production process to
automatically inspect for defects.
• Intelligent Search - An internet search engine that provides the most relevant content
and banner ads based on the users' past behavior.
• Fraud Detection - Detect fraudulent credit card transactions and automatically decline
the charge. 25
How NN Work ( Mathematically)
• Linear and Non Linear Pattern / Classification
• Regression / Function Estimation
• Curve Fitting

Why to USE NN
• Parallel Processing
• Fault tolerance
• Self-organization
• Generalization ability
• Continuous adaptivity
26
Artificial Neurons
• Neural networks are made up of nodes which have
– Input edges, each with some weight
– Output edges (with weights)
– An activation level (a function of the inputs)
• Weights of edges can be positive or negative and may change
over time (learning)
• The output function is the weighted sum of the activation levels
of inputs
• The activation level is a linear or non-linear transfer function “a”
of the input :
• Some nodes are inputs, some are outputs.

27
A Model of Artificial Neuron

x1
wi1
x2 
wi2 yi
.
. f (.) a (.)
.
wim =i
xm= 1
bias

28
A Model of Artificial Neuron

x1 yi (t  1)  a( f )
wi1
x2 
mw yi
.
f ( i )  w x
i2
ij j
. j 1 f (.) a (.)
.
wim =i 1 f  0
a( f )  
xm= 1 0 otherwise
bias

29
Structural Types of NN
• Un-weighted -- McCulloch–Pitts ( 1943 )
• Weighted---- Introduced by Hebb
• Supervised
• Perceptron -- by Frank Rosenblatt—foundation
• ADALIN and MADALIN
• FFNN

• Unsupervised
• ART1 and ART2
• Kohenon’s Self Organizing Maps(SOM)..etc

30
The Perceptron
Bias
x1 w1 xn+1=-1 =wn+1
wn+1
w2


x2 y

. a=  bias+w x
. wn
{ 0 if a <0
i i
1 if a  0
x.
n
y=

•Bias , the extra weight connected to a constant is called the bias of


the element
• It enables to set the threshold equal to zero which help in
calculation
•To get an extra dimension for representation This means
that every point in (n + 1)-dimensional weight space can be
associated with a hyperplane in (n + 1)-dimensional extended input
space. 31
Linear Separability Problem
• If two classes of patterns can be separated by a decision boundary,
represented by the linear equation
b  i 1 xi wi  0
n

then they are said to be linearly separable. The simple network can
correctly classify any patterns.
• Decision boundary of linearly separable classes can be determined
either by some learning procedures or by solving linear equation
systems based on representative patterns of each classes
• If such a decision boundary does not exist, then the two classes are
said to be linearly inseparable.
• Linearly inseparable problems cannot be solved by the simple
network , more sophisticated architecture is needed.

32
• Examples of linearly separable classes
- Logical AND function o x
patterns (bipolar) decision boundary
x1 x2 y w1 = 1
-1 -1 -1 w2 = 1
-1 1 -1 b = -1 o o
1 -1 -1 =0
1 1 1 -1 + x1 + x2 = 0 x: class I (y = 1)
o: class II (y = -1)
Equation of Line ( Decision Boundary )
x x
- Logical OR function
patterns (bipolar) decision boundary
x1 x2 y w1 = 1
-1 -1 -1 w2 = 1 o x
-1 1 1 b=1
1 -1 1 =0 x: class I (y = 1)
1 1 1 1 + x1 + x2 = 0 o: class II (y = -1)
33
• Examples of linearly inseparable classes
x o
- Logical XOR (exclusive OR) function
patterns (bipolar) decision boundary
x1 x2 y
-1 -1 -1 o x
-1 1 1
1 -1 1 x: class I (y = 1)
1 1 -1 o: class II (y = -1)

34
Multilayer NN
• Neural Net for Nonlinear Classification
• Combination of Perceptron
• Back propagation learning

35
Multilayer FFNN
A NN with one or more than one hidden layers
What do each of the layers do?

1st layer draws 2nd layer combines 3rd layer can generate
linear boundaries the boundaries arbitrarily complex boundaries
36
Back propagation Algorithm
• Multiple outputs.
• Forward pass:
• Error calculation:
• Backward propagation:
• No guarantee to in getting best possible
weights after correcting.
• Classifies inputs into multiple classes.
• Can be modified to represent any function.
MATLAB and NN Toolbox
• The name MATLAB stands for matrix
laboratory.MATLAB is a high-performance language for
technical computing. It integrates computation,
visualization, and programming in an easy-to-use
environment where problems and solutions are
expressed in familiar mathematical notation. Typical
uses include:
• Math and computation
• Algorithm development
• Modeling, simulation, and prototyping
• Data analysis, exploration, and visualization
• Scientific and engineering graphics
• Application development, including Graphical User
Interface building
38
Urdu Optical Character
Recognition Using Feedforward
Neural Networks

The Proposed System


39
Introduction
Input Urdu Text Image

• UOCR developed for Preprocessing

Urdu Script Segmentation

• Ariel 36 Font Segmented Character

• Single line of Urdu Binary Character ( Resized )

text image.
• Segmentation Part
• Neural Network /
Classification Part

Character Code (Results) 40


Feature Extraction and Segmentation
Algorithm developed
• Pixels strength is measure where the pixels
strength or energy is the number of black pixels
in a specific direction. Down-up or right-left or
any degree
• A search for finding a path, bottom to top, right
to left or any degree is made during which black
pixels are counted and selected , the path on
which minimum number of black pixels are
encountered (minimum number of black pixels
are found)

41
i ii iii iv v vi vii viii
i 0 0 0 0 1 0 1 0
ii 0 0 1 0 0 0 1 0
iii 0 1 1 1 1 0 1 0
iv 0 1 0 1 0 0 1 0
v 0 1 0 1 0 1 1 0
vi 0 0 0 1 0 1 1 0
vii 0 0 1 1 0 1 1 0
viii 0 0 1 1 0 1 1 0
Ix 0 0 1 1 0 1 1 0 42
• Energy Level
– For Words segmentation zero level energy is
selected
– For character segmentation energy of the seam of
is calculated and compared with the average
energy of all the seam(of the image )
• Character Size and threshold values
• Large Segments further segmented
• Small Segments are merged
43
Garbage Characters
the main problem in the algorithm

44
Training Patterns
• Ariel font of size 36 was
resized
– Enlarged in some cases ‘imresize’ function with
– reduced in some case ‘nearest’ parameter

• 54 different classes
• 100 samples
• Ms Paint, photoshop and
Ms Word

45
Neural Network Training and Sim
• A Multilayer Feedforward Neural Network(FFNN)
• Input layer21x15 (315)----size of character
• Hidden layer with 2000------trial and error
• Output layer 6 nodes ----to cover 56 Alphabets
• Activation Functions Tansig and logsig
• epochs = 2000 to get trained/meet the goal of
0.0005----goal selected from results
• Time 5-7 hours on 2 GHZ, Dual Core with 2 Gb of
RAM
46
• Segmented character again resized before
feeding to NN
• sim’ function returns a 6 digit binary number
• The number is matched with the 54 character
set (used as target during the training).

• The ‘No Character Found’ message


47
Simulation Results
• Recognition of character family of (‫) ب‬, pee (‫)پ‬,
tee (‫ ) ت‬tay (‫ )ٹ‬,cee ( ‫ )ث‬and fee ( ‫ )ف‬is around
80 %
• same is the case of character family of kaf ( ‫)ک‬
and gaf ( ‫ )گ‬as these are the most simple
characters and despite their similarity with each
other they are totally different from the other
characters.
• The character lam (‫ ) ل‬when used in middle of a
word behave like and alif (‫ ) ا‬which decrease its
recognition percentage but alif is not
misunderstood as lam (‫ ) ل‬in most of the cases.
48
Simulation Results
• The character waw(‫ )و‬and choty yee ( ‫ ) = ی‬are
difficult to be differentiated by the NN as the
segment of choty yee (‫ )ی‬after it produce the
garbage is very similar to waw (‫ )و‬.
• Characters fee (‫ )ف‬, mem (‫= م‬ ) and ein (‫) = ع‬
when used in the middle form of a character can
deceive neural network for each other during the
recognition process which leads to a low
percentage for their recognition.
• Character noon ( ‫ ) = ن‬when used in the
beginning it looks like ze ( ‫ )ز‬and zal ( ‫ )ذ‬and thus
produces low results.
49
Simulation Results
• The combination of lam (‫ )ل‬and alif ( ‫ )‏ا‏‬when
used in ( ‫اسالم‬
‫ ) ‏‬like words make some what a
new character, in the segmentation phase as
shown in figure below.

50
Conclusions
• The results show 70% success at the neural network output
but the algorithm developed shows about 85% results when
seen through human eye.
• Most of the error (garbage characters) are produced at the
end character of a word when the word is ending on noon or
a character having similar shape like noon.
• It is hard to find which character is the end character
therefore the problem cannot be overcome easily.
• A large percentage of error is produced by the character
seen(‫ )س‬,sheen(‫ )ش‬,swad(‫)ص‬,dwad(‫)ض‬, noon(‫)ن‬, noon
ghuna(‫ )ں‬which in most of the cases get passes the character
test during segmentation,
• where as bee (‫) ب‬, pee (‫)پ‬, tee (‫ ) ت‬, tay (‫ )ٹ‬, cee ( ‫ )ث‬and
fee ( ‫ )ف‬also produces garbage characters in some cases.
51
Thanks

52
53
x0  1

The Role of a Bias Weight


W0
x1 W1
W2
x2

with a fixed input -1 g


Wn

xn

The decision boundary : The decision boundary :


a2 a2 W1a1  W2 a2  W0  0
W1a1  W2 a2  0

a1 a1

Without the bias weight! With the bias weight!

54
55
Binary
AND OR NOT

56
AND
y  = 0.5
w1 w2
x1 x2

input output f(x1w1 + x2w2) = y


00 0 f(0w1 + 0w2) = 0
01 0 f(0w1 + 1w2) = 0 f(a) = 1, for a > 
10 0 f(1w1 + 0w2 ) = 0 0, for a  
11 1 f(1w1 + 1w2 ) = 1 

some possible values for w1 and w2


w1 w2
0.20 0.35
0.20 0.40
0.25 0.30
0.40 0.20
XOR
y  = 0.5
w1 w2
x1 x2

input output f(x1w1 + x2w2) = y


00 0 f(0w1 + 0w2) = 0
01 1 f(0w1 + 1w2) = 1 f(a) = 1, for a > 
10 1 f(1w1 + 0w2 ) = 1 0, for a  
11 0 f(1w1 + 1w2 ) = 0 

some possible values for w1 and w2


w1 w2
XOR
y
input output  = 0.5
00 0 w 3 w5 w 4
01 1 z  = 0.5
10 1
11 0 w1 w2
x1 x2

f(w1, w2, w3, w4, w5)

a possible set of values for ws


(w1, w2, w3, w4, w5)
f(a) = 1, for a > 
0, for a  
(0.3,0.3,1,1,-2)

XOR
input output
00 0 w5 w6
01 1  = 0.5 for all units
10 1
11 0
w1 w w2 w4
3

f(w1, w2, w3, w4, w5 , w6)

a possible set of values for ws


(w1, w2, w3, w4, w5 , w6)
f(a) = 1, for a > 
0, for a  
(0.6,-0.6,-0.7,0.8,1,1)

XOR

can be solved by a more complex network with hidden units


(-1, -1) (-1, -1) -1
(-1, 1) (-1, 1) 1
(1, -1) (1, -1) 1
(1, 1) (1, 1) -1

1
x1 2 z1 2 0
-2
Y
-2
x2 z2 2
2

61
Linear Separation
Linear Logistic
Discriminant Regression

1
Y = a(X) + b Y=
1 + e-a(X) + b
AND, OR, NOT

x1 1.0
w1
1.0
x2 w2 
wn Integrate Threshold

xn 1.5
AND, OR, NOT
x1 1.0
w1
1.0
x2 w2 
wn Integrate Threshold

xn .9
AND, OR, NOT
x1 -1.0
w1
x2 w2 
wn Integrate Threshold

xn .5
67
68
Perceptron Learning Algorithm:
Initialise weights and threshold.
Set wi(t), (0 <= i <= n), to be the weight i at Perceptron Learning Algorithm:
time t, and ø to be the threshold value in the
output node. Set w0 to be -ø, the bias, and x0 start: The weight vector w0 is
to be always 1. generated randomly,
Set wi(0) to small random values, thus set t := 0
initialising the weights and threshold. test: A vector x 2 P [ N is selected
randomly,
if x 2 P and wt · x > 0 go to test,
Present input and desired output if x 2 P and wt · x 0 go to add,
Present input x0, x1, x2, ..., xn and desired if x 2 N and wt · x < 0 go to test,
output d(t) if x 2 N and wt · x 0 go to subtract.
Calculate the actual output add: set wt+1 = wt + x and t := t +
y(t) = fh[w0(t)x0(t) + w1(t)x1(t) + .... + wn(t)xn(t)] 1, goto test
Adapts weights subtract: set wt+1 = wt − x and t :=
wi(t+1) = wi(t) + ñ[d(t) - y(t)]xi(t) , where 0 <= ñ t + 1, goto test
<= 1 is a positive gain function that controls
the adaption rate.
Steps iii. and iv. are repeated until the iteration
error is less than a user-specified error
threshold or a predetermined number of
iterations have been completed. 69
Neural Networks –
Training
Backpropagation training cycle

70
71

You might also like