You are on page 1of 7

Recognize Assyrian cuneiform characters by virtual

dataset
Abdul Monem S. Rahma#1, Ali Adel Saeid#2, Muhsen J. Abdul Hussien#3
#
Department of computer science, University of Technology
Iraq- Baghdad
1
110003@uotechnology.edu.iq

Abstract— Cuneiform symbols represent a complex problem in Leonard Rothacker [4] proposed a new approach for
pattern recognition, in particular for OCR (optical character identifying cuneiform symbols by considering each
recognition) due to challenges related to cuneiform-like cuneiform symbol with statistical metrics (e.g., Bag-of-
character distortion and font heterogeneity. This paper Features and Hidden Markov) and the use of SIFT
proposes new approaches to recognise Assyrian cuneiform
descriptors. For documenting and presenting cuneiform
characters using OCR to classify the symbols. as a new way to
recognize the Assyrian letters by dealing with symbols of image tablets, Jonathan Cohen [5] adopted an internet
complex character. The dataset utilised consists of 16 patterns deployment platform that supports searchers from a digital
to reflect all probabilities associated with each cuneiform archive comprised of images of cuneiform tablets with
symbol related to their shape and directions, assuming each different 2D views based on applying a scanning technique
character consists of a set of symbols. Polygon approximation supported with Java programming tools.
techniques are used to generate feature vectors for the This paper proposes a new method to recognise
classification tasks. The proposed method obtains classification cuneiform Assyrian characters using OCR (optical character
ratios up to 91% depending on the algorithm used for the recognition) techniques. The process analyses segmented
feature vector.
symbols of a cuneiform character to determine their quantity
and directions (horizontal, vertical, or diagonal) to be used
Keywords— Cuneiform, polygon approximation, dataset, as classification characteristics based on a generated feature
pattern recognition, OCR. vector containing appropriate boundary features. A polygon
approximation technique creates the feature vector.
I. INTRODUCTION
Cuneiform writing is one of the oldest written systems II. ASSYRIAN CUNEIFORM LANGUAGE
invented in the land of Mesopotamia during the third The Assyrian cuneiform language represents one stage in
millennium BC around 3200 BC [1]. The beginning of this the development of cuneiform writing in Mesopotamia,
system depended on a collection of symbols depicting which continued from the beginning of the first millennium
images of things and appeared in the ancient Sumerian BC to 600 BC. Its method relies on drilling symbols on clay
language. This language underwent stages of evolution that or stone tablets from left to right to form groups that reflect
transformed the symbols into cuneiform patterns used in basic language meanings. The cuneiform language includes
Babylonian and Assyrian languages. The cuneiform system a set of about 600 letters, each of which consists of one or
differs from the hieroglyphic visual language as it is more a more symbols. These symbols or wedges are organised in
vocal and expressive language and is formed in different either horizontal, vertical, or diagonal directions [1]. The
terms to express certain meanings. Around 100,000 letters and their corresponding symbols vary from one
cuneiform tablets were discovered, which are now located character to another, such as the number of symbols, their
in museums around the world [2], especially the Iraqi direction, and their location, as seen in Figure (1).
Museum in Baghdad, which contains nearly 20,000 Many challenges associated with cuneiform writing
cuneiform tablets representing different civilisations. contribute to obstacles in the processes of analysis and
Because of the small number of translations that deal with recognition. One issue is related to the distortion of
the cuneiform language, it is necessary to use information characters and heterogeneity of fonts and patterns, as
technology, especially those areas dealing with the compared in Figure (2). Another complication results from
interpretation of patterns and symbols, to solve the problem shadows of symbols that may change from one image to
of translation. Therefore, the field was opened for another (of the same character) due to varying angles of
researchers to adopt different concepts and approaches to reflected light due to the three-dimensional geometry of the
achieve efficient translations. From a recognition approach, cuneiform symbol [6].
Hilal Yousif [1] adopted a density curve of cuneiform
symbols to create feature vectors to classify symbols using
a KNN classifier. Fahimeh Mostofi [3] suggested a
character recognition system for Old Persian cuneiform
based on a neural network mythology for classifying tasks.
1. pre-processing
2. segmentation
3. feature extraction
4. classification
5. post-processing
A. Pre-processing
This first step consists of sequential processes to deal
Fig. (1). Models of Assyrian cuneiform writings. with the raw image data. The aim is to remove noise in the
image and enhance the image’s data with acceding to
requisite efficiency to support the subsequent stages. Pre-
processing utilised in this paper follows the steps:
1. Image enhancement: Remove noise from the image,
which was created from a scan or photograph. The
median filter is used,
Y(n)= med[X(n - k),..., X(n),..., X(n + k)] …(1)
(a) (b) (c) where Y(n) is the output image and [X(n - k),..., X(n
+ k)] are the ranked pixels values in a specific
window size.
2. Image binarization: Convert the grey level intensity
colours to a binary image with only two colour tones,
black and white, representing background and
foreground regions. This paper adopts Otsu’s [9,10]
(d) (e) method for global binarisation to segment the
Fig. (2). Different fonts may be distinguished between the first (a-c)
cuneiform images, which depends on selecting
and second letters (d-e). threshold values as a minimum sum of weighted
variants between the background, b, and foreground,
Figure (3) illustrates how various locations of dark areas from
f. Starting with separate image density colours in two
one image to another can depend on the angle of illuminating light.
This issue affects the character recognition analysis and is evident
intervals, dark and light, or V1 and V2 respectively,
when two images of the same character are subject to segmented the initial intensity colour V0={0, 1,.. v} and the
processes and generate different results. second is V1={v,v+1,..,l-1,l}, the threshold value is
then calculated according to the following formulas
for each interval:

ᅊ2w= wb(v) * ᅊ2b(v)+ wf(v) * ᅊ2f(v)…… (2)

where
wb(v)=σ௩௜ୀଵ ‫݌‬ሺ݅ሻ……………………………. (3)

wf(v)=σ௩௜ୀ௩ାଵ ‫݌‬ሺ݅ሻ………………………….. (4)

Fig. (3). The effect of the direction of reflected light. μb(v)= σ௩௜ୀଵ ݅ ‫݌ כ‬ሺ݅ሻ/wb(v)………………….. (5)

μb(v)= σ௟௜ୀ௩ାଵ ݅ ‫݌ כ‬ሺ݅ሻ/ wf(v)………………. (6)


III. OPTICAL CHARACTER RECOGNITION (OCR)
OCR is considered a main branch of pattern recognition μb(v)= σ௟௜ୀ௩ାଵ ݅ ‫݌ כ‬ሺ݅ሻ/ wf(v)………………. (7)
since the middle of the twentieth century. Specifically, since
1950, this field has been the subject of many research and ᅊʹ„ሺ˜ሻൌ σ௩௜ୀଵሺ݅ െ Ɋ„ሺ˜ሻሻ ʹȀwb(v)…………... (8)
development areas as it supports institutional government
applications and is easily explored in financial, banking, and ᅊʹˆሺ˜ሻൌ σ௟௜ୀ௩ାଵሺ݅ െ Ɋˆሺ˜ሻሻ ʹȀ wf(v) ……… (9)
archiving applications. There are several definitions for
OCR, such as a process for selecting an image segment and
determining the corresponding text character or the process Next, the separation process is iterated to choose new
of choosing the correct pattern for an image segment [7]. interval density colours (e.g., in each iteration, shift
Others define OCR as a process of reading written text and one level density colour) and recalculate the above
converting to the ASCII formulation [8]. equations (3-8) until satisfying a minimum value of
A general model for OCR follows the process: (2) as the selection criterion.
3. Removing spots: Cuneiform tablets suffer from
distortions features, such as spots, because of
obsolescence and the nature of the materials used, Image Labeling Algorithm
such as stone and clay. This causes spots to appear in Input: threshold image, Output: labelled image
the image and negatively affects the analytical Start with binary image content with foreground
results of the subsequent stages, as seen in Figure (4). segment A and let B be the structure element, with
To remove these spots, an image connected- Label = 0.
component labelling technique is adopted to solving DO
this problem. Scan the binary image row-by-row to
Step 1: determine the first seed point, P, for the
first segment.
Step 2: Label = Label + 1.
Step 3: Let X0 = P.
Apply the flowing formula,
Step 4:
ܺ௞ ൌ ሺܺ௞ିଵ ْ ‫ܤ‬ሻ ‫ܣ ת‬ǡ ݇ ൌ ͳǡʹǡ͵.
Set A(i,j) = 0, where A(i,j) == XK(i,j), y =
Step 5:
Fig. (4). Spots, unrelated to the symbols, appear in the two Label.
images. If (XKĮXK-1), go to step 4
Step 6:
else Label = Label + 1.
Image connected-component labelling represents an Step 7: Go to step 1.
important field of pattern recognition, computer
Step 8: Print y, the labelled image.
vision, and machine intelligence. By using this
technique, each connected segment in a binary image while sum(A(i,j)) Į 0
will be characterised with a unique label to B. Image segmentation
distinguish it from other labels, as represented in
Figure (5). This technique is also required in other This stage OCR recognition system is a segmentation
applications, such as target identification, diagnosis process, which separates the image components into distinct
applications, and biometric applications [11]. Many parts depending on the analyses mythology adopted by the
theories and algorithms, such as multi-scan recognition system. Therefore, the districted parts may be a
algorithms, two-scan algorithms, hybrid algorithms, character from a word or line segment from paragraph. In
and tracing-type algorithms have contributed to the this paper, the segmentation process is applied by the
evolution of this technique, which especially connected-component labelling algorithm previously
improved the speed of performance in real-time described in III.A.3.
applications [11]. This work adopts the multi-scan C. Feature extraction
algorithms with image morphology as an
Feature extraction is an important step in character
implementation [12]. This approach depends on the
recognition because all successive analysis steps are
dilation principle with the flow formula:
dependent on its results. This step is defined by its process
‫ ܤ ْ ܣ‬ൌ ቄ‫ݖ‬ȁ൫‫ܤ‬෠ ൯௭ ‫׎ ് ܣځ‬ቅ ǥ ǥ ǥ Ǥ ሺͺሻ to extract a minimum amount of important points or
where A and B are a set in z and the dilation A by B character features to represent the character best. Feature
is defined as‫ܤ ْ ܣ‬. extraction techniques in character recognition methodology
are classified into the four groups of statistical, global
transformation, geometric, and approximation features [13,
14]. This work applies the approximation approach for
feature extraction, which depends on selecting the optimal
points to represent the boundary shape from a rejoin. This
selection procedure creates the feature vector for classes to
represent each pattern. The polygonal approximation
method is a contour approximation technique that reforms
the closed curve of a polygon into a new shape with a
minimum number of lines or polygonal segments [15].
Mathematically, this approach is defined by letting
Fig. (5). Image labelling. G={g1,g2…gn; g1=gn} = {(x1,y1),..,(xn,yn)}, which represents
a polygon’s vertices in 2D space, and are approximated with
The flowing algorithm steps for connected- component a polygon of line segments Vs consisting of (V+1) vertices,
labelling (CCL) based on the morphology concept is Q={q1,..,qm+1 ;q1=qm+1}. In general, there are three strategies
outlined in the following algorithm. adopted in the implementation of polygon approximation,
including the sequential method, split and merge method,
and the hybrid method. The method utilised in this research
for the recognition system is based on an updated concept D. Classification method
on the sequential method [15], which is derived from the No more than three levels of headings should be used in
idea of dominant points (DP). the last stage for the classification task, and the K-Nearest
The algorithm starts from generating a breakpoint from Neighbor (KNN) is adopted for this classification approach.
the shape’s boundary after determining the boundary by The classification decision depends on distance as a metric
digitising the points. Next, Freeman’s chain code is applied, for determination and is calculated by the Euclidian distance
as shown in Figure (6) to determine the breakpoints that between the feature vector of the shape or symbol under
satisfy the condition, where the breakpoint is the point with consideration with all other feature vectors of the patterns in
a value (chain code) that is not equal to the value of the the learning set. The results are ranked to determine the
previous and subsequent point. The breakpoints in this stage minimum number which corresponds to the appropriate
represent the initial version of the dominant points, as pattern.
presented in Figure (6). ඥσ௡ଵሺ݈݅ െ ‫ݐ‬ሻሺ݈݅ െ ‫ݐ‬ሻ ………………………..(9)

where
n: number of features vectors in the dataset,
li: features vectors in the dataset,
t: tested features vector.

IV. PROPOSED MODEL


The model proposed here to recognise the cuneiform
Fig. (6). (a) Freeman’s chain. (b) Dominate points. Assyrian characters includes two components. The first is
related with the proposed dataset and the second is with
To apply sequential principles, the associated error value applied OCR concepts, according to the following
(AVE) must be calculated such that the perpendicular discussion.
distance between the boundary and strength line join
between two dominate points (DPj), for all dominate points, A. Proposed dataset
and then determine the minimum number of points for all The design of the dataset plays a major role in the
elements. Then, the AVE is recalculated for the eliminated classification process. This paper proposes a virtual dataset
points of the two neighbours, (DPJ-1, DPJ+1), and the consisting of 16 patterns with triangular forms consistent
maximum error is computed to determine a termination with the geometric shapes of three-dimensional cuneiform
value about the condition. This iterative procedure for symbols. This set covers all the possible configurations of
eliminating new points (DP) continues until the termination cuneiform characters with any direction (horizontal, vertical,
condition is satisfied. The following outlines the pseudo or diagonal), and is compatible with the patterns of
code for this algorithm [15]. cuneiform symbols that include issues with shadows as
previously discussed. Figure (14) shows the classification of
Polygon Approximation Algorithm trigonometric dataset patterns consistent with all probable
Input: binarized image, Output: approximate polygon cuneiform symbols. Categorizing one pattern from another
points depends on two factors: the number of heads and the
Started with the binary image. direction. This is adopted in the construction of feature
Step 1 Apply edge detection method with a suitable vectors where the designed dataset is used not only for test
filter. data.
Step 2 Apply thinning technique.
Step 3 Compute Freeman’s chain code for the B. Feature vector design
boundary. The construction of a dataset containing feature vectors
Step 4 Find the breakpoints. representing each pattern is shown in Figure (7). Each one
Step 5 Compute AVE for all DP. consists of data with the Cartesian coordinates for each head,
Step 6 Repeat. which corresponds to the shape of the pattern. For example,
Step 7 Determine DP with a minimum value, DPmin. in Figures (7) and (8), the pattern with three heads has a
Step 8 Remove DPmin from the dominant table. vector including 14 cells, represented by the (X,Y)
Step 9 Recalculate AVE for DPmin’s adjacent coordinates seven vertices.
neighbour.
Step 10 Compute maxerror.
Step 11 Repeat until (max error< threshold).
Step 12 The remaining points about the DPs are the
approximate polygon points.
By constructing the feature vector with the polygon
approximation concept, the following algorithm depends on
the previous dominant point’s algorithm.

Polygon Approximation Algorithm


Figure (7). The selected features points form each pattern, where
Input: binarized image, Output: approximate polygon
(a) has seven feature points and (b) has five features points. points.
Start with the binary image.
Let z = 0, where z is the number of heads or vertices.
Set th = 1, where th is the threshold.
Fig. (8). Feature vector for the pattern in Figure (7a).
Step 1 Apply edge detection method with a suitable
filter.
C. Proposed recognition system Step 2 Apply thinning technique.
The system in this research is consistent with a general Step 3 Compute Freeman’s chain code for the
model of OCR, as previously described. The pre-processing boundary.
stage on the image utilizes a medina filter and the Step 4 Find the breakpoints.
binarization technique by Otsu’s method provides the result Repeat.
in Figure (9). Step 5 Compute AVE for all DP.
Step 6 Repeat.
Step 7 Determine DP with a minimum value, DPmin.
Step 8 Remove DPmin from the dominant table.
Step 9 Recalculate AVE for DPmin’s adjacent
neighbor.
Step 10 Compute maxerror.
Fig. (9). The output of binarization technique. Step 11 Repeat until maxerror < th.
Step 12 z = remaining points about DP’s approximate
The following connected-component algorithm solves polygon points.
the spots problem. Step 13 Delete all DPs, which construct a straight
angle with its neighbours.
Spot Removal Algorithm Step 14 th = th – eps, where eps epsilon value, e.g.,
Input: binarized image, Output: spot-free image 0.009.
Start with binarized image, B, with content foreground Step 15 Repeat until ((z == 3) or (z == 5) or (z == 7)
segment and spots. or (z == 9)).
Step 1 Apply image labelling algorithm. Step 16 End, where z = number of heads.
Step 2 Apply histogram process on the image, let
labelling vector be Hi. The results obtained from this algorithm on the
Step 3 Calculate the total number of pixels in the cuneiform symbols are shown in Figure (11).
image, T.
Step 4 Let R be the ratio vector, R = 0.
Step 5 Divide each label’s count result from Step 2
by T, R(i)=Hi/T, where i = number of labels.
Step 6 For each element on R(i), if R(i) < threshold,
then R(i) = 0.
Step 7 Scan each pixel in B, if corresponding label
value about R(i) == 0, then B(x,y) = 0.
Step 8 Print B, representing the spot-free image. Fig. (11). Red circles indicate the candidate points from the dominate
Step 9 End. points.
Figure (10) presents results from the proposed algorithm.
The coordinates of the red dots form the basis to construct
the feature vector.

V. RESULTS AND DISCUSSION


This section reviewed the results that are achieved after
applied the proposed system Figure (12). To demonstrate
the system’s efficiency, test data consisting of 85 images
(a) (b) taken from the Iraqi Museum from the Assyrian Hall is used,
Fig. (10). (a) Binarized image, (b) Spot-free image.
which consist of 350 cuneiform symbols where the goal as
stated relates with evaluate the proposed recognition method
with states of cuneiform symbol and new dataset. Different
methods are used to construct the feature vectors, the results
of which are compared to the proposed method. The KNN
classifier provides results of 67%, 73%, and 81% accuracy.
The Zerink moment, Hu moment, and projection
histogram are used to construct the feature vectors and are
compared to applying the proposed system, which resulted
in 91% accuracy, as shown in the following table. Where the
drawback of calcification related with Zerink moment and
projection histogram it's require high proportion of
similarity between symbols and the corresponding pattern,
this can not be achieved with irregular symbols. The
weakness of the third method is centered on invariant factor
that lead to achieved incorrect classification especially with
direction of pattern . The KNN classifier was adopted in
this paper to evaluate the level of reliability about the
approximation process for generate feature vector.
Generally KNN provide fairly acceptable classification
level , in spite of that the value achieved about proposed
method is high and it will support to adopts another higher
classification model like SVM.

TABLE I
COMPARISON BETWEEN THE PROPOSED FEATURE VECTOR Fig. (13). The diagram for classifying the dataset patterns.
ALGORITHMS. AND OTHER ALGORITHMS

Method used to construct


Classification results
feature vector
Zerink moment 67%
Hu’s moment 73%
projection histogram 81%
Polygon approximation 91%

The following figure shows some of matching results


obtained from the classification.

Fig. (12). The proposed system output.

Fig. (14). The framework of the proposed system.


VI. CONCLUSIONS [14] Gaurav Y. Tawde , **Mrs. Jayashree M. Kundargi.” An
Overview of Feature Extraction Techniques in OCR for
A new recognition method is proposed for images of
Indian Scripts Focused on Offline Handwriting”
Assyrian character cuneiform tablets, which depends on an International Journal of Engineering Research and. Vol. 3,
OCR technique with polygon approximating to identify Issue 1, January -February 2013, pp.919-926.
cuneiform symbol features. The proposed model was tested [15] Hamid AbbasiMohammad Olyaee,Hamid Rezahafari.”
and compared with results from other approaches. Rectifying Reverse Polygonization of Digital Curves
Depending on the method used to construct the feature forDominant Point Detection IJCSI International Journal of
vectors, the results achieve here are optimal classification Computer Science Issues, Vol. 10, Issue 3, No 2, May 2013
rates of 91%. The classification task is implemented through
a new proposed virtual dataset containing 16 patterns that
reflect all possible configurations of cuneiform symbols.
The spots problem associated with the binarization
technique is solved by a connected-component labeling
algorithm using a morphology concept.

REFERENCES

[1] Y. Hilal., M.R. Abdul., Haithem.Alani., “Cunform Symbols


Recognition Using Intensity Curve”. The international Arab journal
information Technology vol. 3 No 3, July 2006.
[2] [Sean Eron Anderson, Marc Levoy. “Unwrapping and Visualizing
Cuneiform Tablets” IEEE Computer Graphics and
Applications Vol.22,No. 6, November/December, 2002, pp. 82-88 .
[3] Fahimeh Mostofi , Adnan Khashman . Intelligent Recognition of
Ancient Persian Cuneiform Characters . Intelligent Systems
Research Center, 1Department of Computer Engineering. August
2014.
[4] Leonard Rothacker, Denis Fisseler, Gerfrid G.W. Müller”
Retrieving Cuneiform Structures in Segmentation-free Word
Spotting Framework”Proceedings of the 3rd International
Workshop on Historical Document Imaging and Processing.
August , 22, 2015
[5] Jonathan Cohen, Donald Duncan, Dean Snyder, Jerrold
Cooper, “iClay: Digitizing Cuneiform” The 5th International
Symposium on Virtual Reality, Archaeology and Cultural
Heritage.
[6] ] D. Fisseler1, F. Weichert1, G.G.W. M ¨uller2, M. Cammarosano.”
Towards an interactive and automated script feature analysis of 3D
scanned cuneiform tablets” .Scientific Computing and Cultural
Heritage, 2013.
[7] N. VENKATA RAO, DR. A.S.C.S.SASTRY,
A.S.N.CHAKRAVARTHY,KALYANCHAKRAVARTHI P.”
OPTICAL CHARACTER RECOGNITION TECHNIQUE
ALGORITHMS” Journal of Theoretical and Applied Information
Technology may .2008
[8] S. M. Murtoza Habib .” BANGLA OPTICAL
CHARACTER RECOGNITION”. Department of Computer
Science and Engineering of BRAC University , December
2005.
[9] ] Jamileh Yousefi .” Image Binarization using Otsu
Thresholding Algorithm University of Guelph, Ontario, Canada.
April 18, 2011
[10] Dr. Neeraj Bhargava, Anchal kumawat, Dr. Ritu Bhargava.”
Threshold and binarization for document image analysis using otsu’s
Algorithm” International Journal of Computer Trends and
Technology (IJCTT) – volume 17 Number 5 Nov 2014.
[11] Lifeng He,‫כ‬, Yuyan Chao, Kenji Suzuki, Kesheng Wu .” Fast
connected-component labeling” . Pattern Recognition 42 (2009)
[12] Rafaela C Gonzales. “Digital Image Processing” 2002 by Prentice-
Hall, Inc.
[13] Dewi Nasien.” A Review on Feature Extraction and Feature
Selection for Handwritten Character Recognition. (IJACSA)
International Journal of Advanced Computer Science and
Applications,Vol. 6, No. 2, 2015

You might also like