You are on page 1of 5

2010 The 3rd International Conference on Machine Vision (ICMV 2010)

Handwritten Uighur Character Segmentation and Performance Evaluation


Jing Li1 Zhaoyang Lu1 Adili Yimiti1,2 Fuxiu Tan1
School of Telecommunications Engineering The State Key Laboratory of Integrated Services Networks Xidian University, Xian, 710071, China 2 XinJiang Normal University, Urumqi, 830054,China jinglixd@mail.xidian.edu.cn
AbstractRecognition of handwritten Uighur word is important for Uighur information automation and new generation handwritten input system development on mobile platform. Robust and accurate handwritten character segmentation algorithm provides an important prerequisite for Uighur recognition. Based on the comprehensive consideration of computation, robustness and the characteristics of the text itself, a simple but effective handwritten Uighur character segmentation algorithm is proposed. Furthermore, we develop an Uighur input system on the intelligent mobile platform, and construct a medium scale Uighur handwritten word database simultaneously. The segmentation algorithm is detailed evaluated on the database and the extensive experiments demonstrate the robustness and efficiency of the proposed algorithm. Keywords-Uighur recognition; character segmentation; mobile phone; performance evaluation
1

Figure 1. The 128 Uighur alphabets

I.

INTRODUCTION

Uighur belongs to Altaic Turkic, modern Uighur is the Arabic alphabet-based phonetic system. Uighur has 32 letters, including 24 consonants, 8 vowels, forms a total of 128 characters (See Figure 1). Only in Xinjiang region 900 million people use Uighur language. Handwritten text is one of the most effective ways for people exchanging information. Automatic handwritten text recognition is a very natural way for human computer interaction and communication. Due to the rapid advances in computer technology, handwritten character recognition technique got a significant development[1,2]. However, but compared to Chinese, Latin, Japanese and other major character recognition system[3,4], the development of handwritten Uighur recognition technology has lagged. Our objective is to develop a handwritten Uighur word recognition system on the mobile phone. Broadly speaking, word recognition approach contains two classes [5]. The first way is to do the character segmentation as the pre-processing step and input each character into the recognition model. It is obvious that how difficult the segmentation stage is especially in the case of handwritten script. Hence, there are always some researches paying attention on the segmentation approach [6-13], including offline and online words. Another kind of methods is known as the global approach in which the recognition is globally performed on the whole representation of words to avoid the difficulties of the segmentation stage. However, this approach is out of the

Figure 2. Handwrrittn Uighur word on the mobile phone

scope of this paper, for more details on the holistic approach see [14]. In this paper, we address the problem of handwritten Uighur character segmentation on mobile phone. See Figure 2. It is the first and key step of handwritten Uighur recognition. We need to segment the Uighur word into each character precisely and then send it to character recognition module individually. Comprehensively considering the computation and storage space of the smart phone, we develop an efficient character segmentation method which does not require complex calculations and can be easily ported to mobile phone. Furthermore, in order to evaluate the proposed algorithm objectively, we construct a handwritten Uighur word database to do the assessment. The rest of the paper is structured as follows: In Section II, we detail our character segmentation algorithm. Section III part A describes the construction of the handwritten Uighur word dataset and the work of ground-truth calibration. In part B, we show the segmentation results and the performance evaluation on the database. In Section IV we conclude this paper.

ISBN: 978-1-4244-8889-6

ICMV 2011

316

2010 The 3rd International Conference on Machine Vision (ICMV 2010)

Table 1 Uighur word segmentation algorithm Input Handwritten Uighur word ( see Figure 3) 1. Connected component analysis. 2. Main and additional strokes analysis. 3. Strokes assignment and baseline identification 4. Potential segmentation points selection. 5. Wrong segmentation points removal. Remove the points which are not belong to the baseline area. Remove the points which pass through a closed area. Remove the points whose continuity is not strong. 6. Adjustment the points according to the additional strokes. Segmentation Result(see examples in Figure 7)

4. The additional strokes do not contain the segmentation points. However they are the important assistance to adjust and correct character segmentation. The following is the detailed steps of the segmentation points extraction algorithm. A. Input Our algorithm takes in a binary image of text which obtained from a smart phone handwritten input system, while we develop a handwritten Uighur input system on a Samsung to provide the online and offline data of the text. Also, we use this system to build the handwritten Uighur database which will be introduced in the Section III later in detail. Here, in this paper, we only use the offline data, because we also plan to extend our algorithm to hand free text, such as some handwritten draft on paper. Figure 3 illustrates the given input image. B. Processing 1) Connected component analysis As we mentioned earlier, the connected component is a relatively robust character to separate the Uighur word into several strokes. So the first step is the connected components of black pixels are detected. See Figure 4. Each stroke will be shown in different color. Through the connected component analysis, we will obtain the width, height, area and location of every stroke. In addition, according to the area feature, any connect component whose area is lower than a threshold is considered as noise and be removed. In fact, the noise remove also can be preprocessing before connected component analysis, and we need to smooth the input image and reduce the noise affect.

Processing

Output

Figure 3. Image samples of handwrittn uighur word

II.

UIGHUR WORD SEGMENTATION ALGORITHM

In the Uighur word recognition system, the ultimate objective is to correctly segment and recognize the characters of a given word. Since Uighur words are cursive like as Arabic, it is difficult to determine where one character ends and where the next one begins, especially to the handwritten words. In fact, in handwritten writing, it is possible for a character to end after the beginning of a succeeding character, as adjacent character can overlap as well as touch (as can be seen in Figure 2). Although there are many difficulties in the segmentation task, we can also dug up some changeless characteristics in handwritten texts. Furthermore, considering the approach will be applied to the mobile platform, we also need to abandon the complex characteristics, to choose some simple but effective features. To the Uighur character segmentation, we notice that there four main changeless characteristics. 1. Although within a handwritten Uighur word, the connected component analysis is not always correct to apply the right location of the characters, however, it is obviously a relatively robust character and can help us to separate the word into strokes as precise as possible. 2. A distinguishing feature of Uighur writing is the presence of the baseline. The baseline is a line that passes through most strokes of a word and almost all segmentation points between characters fall on this line. 3. A correct segmentation points also between two local high points and it must not lie on a closed curve.

Figure 4. Connected component analysis. The first row displays the origional input handwritten Uighur word. The second and thrid row show the connected component analysis results with various bounding boxes.

2) Strokes analysis and subword assignment

317

2010 The 3rd International Conference on Machine Vision (ICMV 2010)

In this step, the system determines which connected components are main strokes and which are additional strokes (upper or bottom). For in a Uighur word, we know that the segmentation points must belong to the main stroke and fall near the baseline. Hence, correctly distinguish the main strokes from all strokes is an important step. Uighur characters are connected on an imaginary line called baseline. There are many ways to find the baseline, and in this system, the input image is written on the phone screen, and the invariant of input image is not too large for the writing interface size limitation, hence we use the horizontal projection to detect the baseline of the Uighur word. We notice that all the main strokes will pass through or connect with the baseline. Hence, we define two rules to identify the main strokes of a Uighur word. The External rectangular box of each stroke pass through the baseline and the Area feature of the stroke cannot be lower than a threshold T_L; The Area feature of the stroke is higher than a threshold T_H; Besides the main strokes, we consider the remaining strokes as the upper additional strokes if the location of this stroke is high than the baseline; otherwise it will be considered as the lower additional strokes. After the strokes analysis, we need to separate them into subwords. Each subword contains one main stroke at least, and may contain some additional strokes. Here, if one additional stroke is closer to a main stroke, we will assign it to this main stroke. 3) Potential segmentaion points selection In a Uighur word, we know that the segmentation points always near the baseline and the most significant feature that these points are located between two high points. However it not means that every two high points include a correct segmentation points. So in this step, we first need to find high pints in a Uighur word and then obtain oversegmentation points. In the following step, we can remove some false points based on some rules which decided by the Uighur language. Here, we use two simple method combined to detect the highpoints. The first is Harris corner detection and the other is local high points location. These two methods can describe the high points in handwritten Uighur word and it provide important information to find the segmentation points. Figure 5 shows the highpoints detection results. We can notice that this high points detection method is effective.

4) Remove wrong segmentation points and adjustment After we get the potential segmentation points, there is an important post-processing to remove wrong points and adjust the segment line based on the additional strokes. Here, we decide three rules to remove wrong points according to the features of Uighur language. The rules are as following:
Remove the points not belonging to the baseline area. Remove the points which pass through a closed area. Remove the points whose continuity is not strong

The first rule is easy to understand, because as we point out above, in a Uighur word, the segmentation points always belong to the baseline area. So this rule can effectively remove some wrong points as figure 6(a) shows. The green line is the wrong segmentation line which has been removed. To the second rule, we notice that there are several wrong points which are between two high points, however they locate on a circle and are not correct segmentation points. We also need to remove them. See figure 6(b). The last rule is to remove some noise which continuity is not good. Besides removing the wrong segmentation points, in our algorithm, we also adjust these points to more precise location based on the additional strokes. For in a Uighur word, every additional strokes must be included in a character, it will not be separated in two parts. Hence, if one segmentation line passes through an additional stroke, it must need to be adjusted. Here, we will indentify where the gravity center of the stroke locate and then decide how to adjust the segmentation points. See Figure 6(c). The green line has been adjusted to the position of the red line.

(a)

(b)
Figure 6. Removement and adjustement

(c)

Figure 5. Highpoints detection. The red circles display the detection result

C. Output After the adjustment, the remaining segmentation lines provide the segment location and according to these lines we separate the Uighur word into characters. These characters will be sent into the character recognition model .See Figure 7. Also we associate information with each character detected during segmentation as follows: The location of the character in the whole Uighur word. For the shape of a Uighur character depends on its position in the word. This information can provide some assistant information for character recognition. Whether and how much the character owns additional strokes, including the upper additional strokes and the bottom additional strokes. In Figure 7, we show the character segmentation results of our algorithm. The first row shows the location of the segmentation line and the second row presents every separated character from the input handwritten Uighur word. These are the input of the recognition model in the whole Uighur word recognition system.

318

2010 The 3rd International Conference on Machine Vision (ICMV 2010)

B. Performance evaluation rules We first identify which segmentation line is correct. Here, we use the distance between two lines as the criterion. If the distance is close enough, we consider them as a match pair. Then, four conventional indexes[15] are selected to evaluate the segmentation performance. Detection Rate (DR) = TP / (TP+FN) False Negative Rate (FNR) = FN / (FN+TP) False Alarm Rate (FAR) = FP / (TP+FP) Correct Rate(CR) = TP / (FP+TP) Here,TP,TN,FP,FN present the number of segmentation lines of the True positive, True Negative, False Positive, False Negative separately. This evaluation method is simple and intuitive and it can able to effectively evaluate the algorithm in the detection rate and false alarm rate performance. Figure 8 shows some character segmentation result of our method. The black lines give the correct location of every character, the red lines is the result of the proposed method. For show the comparison between the algorithm result and the truth labeling, we show the red line only in the top part. We can see that the segmentation result is satisfied. Through the analysis of the Uighur word and character, the proposed method can deal with the difficulties in segmentation. In Figure 9, we show the whole evaluation of the method. The test database includes 1500 images, and in these Uighur word, there are 8592 segmentation lines exist. The output of our segmentation algorithm is 9816 where 8056 lines are correct. The following is the four indexes and we show them in Figure 9. IV. CONCLUSIONS Uighur word segmentation is the foundation and an essential step for intelligent Uighur document information processing. In this paper, we present a novel algorithm which can segment complex handwritten Uighur character effectively and efficiently and also give the performance evaluation criterion of word segmentation. Moreover, to evaluate the performance of the proposed algorithm, we develop a mobile phone based handwritten Uighur word capture system, and create a medium scale Uighur handwritten word database with 1,500 word samples and manually labeled ground truth. Experiments on our dataset with quantitative analysis demonstrate the effectiveness of the approach. Future work will focus on integrate the information of online handwritten into the current algorithm, and develop a Uighur word recognition system on mobile phone.

Figure 7. Character segmentation result

III.

EXPERIMENTS AND PERFORMANCE EVALUATION

In order to evaluate the character segmentation algorithm, there are usually two ways for performance evaluation: First, according to the true value calibration of the word and do assessment on the segmentation algorithm itself. Second, by recognizing separated characters to evaluate the segmentation algorithm. The second assessment gets the overall performance of the two models: segmentation and recognition, not simply on the character segmentation algorithm only. So, in this paper, we adapt the first way to assess the segmentation algorithm separately. We will calculate four indexes as described in the following part B. A. Database and Ground truth labelling Now, we have collected 1,500 handwritten Uighur word samples on mobile phone platform and all the Uighur words are free writing in order to guarantee as much as possible the broad and representative of sample. Figure 3 shows some samples of our database. Since we adapt the first way to evaluate the segmentation method only, we need to obtain the true value of the Uighur Database. For the word recognition objective, we design four types of information will be obtained by the software we developed. The first is the correct location of every Uighur character; the second is the word meaning; the third is the number of characters and the forth is the type of every character. In this paper, we evaluate the segmentation performance, so only the location of every character will be used. We develop an Uighur input system on the intelligent mobile platform, and construct a medium scale Uighur handwritten word database simultaneously.

319

2010 The 3rd International Conference on Machine Vision (ICMV 2010)

Figure 8. Character segmentation result (red line) and true value of the handwritten Uighur word(black line) [6] L.Lorigo, V.Govindaraju. Segmentation and pre-recognition of arabic handwriting. In Proceedings of the International Conference on Document Analysis and Recognition,pp.605-609.2005 S.Wshah, Z.X.Shi,V. Govindaraju. Segmentation of arabic handwriting based on both contour and skeleton segmentation.10th International Conference on Document Analysis and Recognition. L. Lorigo and V. Govindaraju. Off-line arabic handwriting recognition: a survey.Department of Computer Science and Engineering, University at Buffalo Technical Report, 2005 S.Touj, N.Essoukri B.Amara, H.Amiri.Two approaches for Arabic Script recognition-based segmentation using the Hough Transform. International Conference on Document Analysis and Recognition, pp:654-658, 2007 M. Mostafa, An Adaptive Algorithm for the Automatic Segmentation of Printed Arabic Text, 17th National Computer Conference, Madinah, Saudi Arabia, pp. 437-444, 2004 L. Zheng, A. H. Hassin, Z. Tang. A new algorithm for machine printed arabic character segmentation.Pattern Recognitoin Letters, 25(15), pp.1723-1729, 2004 T. Sari, L. Souici, M Sellami. Off-line Handwritten Arabic Character Segmentation and Recognition System.8th International Workshop on Frontiers in Handwriting Recognition, pp. 452-457, 2002 A. A. Mohamed, B. J. Kasmiran, A. S. Salina. Features extraction method for arabic characters based on pixel orientation technique. 5th Wseas Conf. on Computational Intelligence, Man-Machine Systems and Cybernetics, Venice, Italy, November, 2006 B. Al-Badr and R. Haralick. A segmentation-free approach to text recognition with application to arabic text. Intemational Journal on Document Analysis and Recogniiton.1(3), pp.147-166, 1998. F Bashir, F Porikli. Performance evaluation of object detection and tracking systems. IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, June, 2006

[7]

[8]

[9]

[10] Figure 9. Performance evaluation of Uighur word segmentation

ACKNOWLEDGMENT This work is supported by Natural Science Basic Research Plan in Shaanxi Province of China (Program No.2009JQ8019), Basic Science Research Fund in Xidian University (Program No.K50510010007), Natural Science Foundation of China (Program No.60872141), and the Foundation of State Key Laboratory of Integrated Services Networks (Program No.ISN090302). REFERENCES
[1] J. Sternby, J. Morwing, J. Andersson, C. Friberg. On-line arabic handwriting recognition with templates. Pattern Recognition, pp.3278-3286, 2009 M. S. Khorsheed.Off-Line Arabic Character Recognition-A Review. Pattern Analysis & Applications, May 2002 F.Faradji, K.Faez, M.H.Mousavi. An hmm-based online recognition system for farsi handwritten words. International Conference on Intelligent and Advanced Systems 2007 R. Azmi, E. Kabir. A new segmentation technique for omni-font farsi text. Pattern Recognition Letters, pp.97-104, 2001 A.M.Zeki. The segmentation problem in arabic character recognition the state of the art. First International Conference on Information and Communication Technologies, August,2005

[11]

[12]

[13]

[14]

[15]

[2] [3]

[4] [5]

320