You are on page 1of 5

LATENT SUPPORT VECTOR MACHINE FOR SIGN LANGUAGE RECOGNITION WITH KINECT Chao Sun1,2 , Tianzhu Zhang1,2 , Bing-Kun

Bao1,2 , Changsheng Xu1,2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2 China-Singapore Institute of Digital Media, Singapore
ABSTRACT In this paper, we propose a novel algorithm to model and recognize sign language with Kinect sensor. We assume that in a sign language video, some frames are expected to be both discriminative and representative. Under this assumption, each frame in training videos is assigned a binary latent variable indicating its discriminative capability. A Latent Support Vector Machine model is then developed to classify the signs, as well as localize the discriminative and representative frames in videos. In addition, we utilize the depth map together with color image captured by Kinect sensor to obtain more effective and accurate feature to enhance the recognition accuracy. To evaluate our approach, we collected an American Sign Language (ASL) dataset which included approximately 2000 phrases, while each phrase was captured by Kinect sensor and hence included color, depth and skeleton information. Experiments on our dataset demonstrate the effectiveness of the proposed method for sign language recognition. Index Terms Sign Language Recognition, Latent SVM, Kinect sensor. 1. INTRODUCTION Sign Language is a kind of visual language via hands and arm movements as well as facial expressions and lip motions. It is one of most natural means of exchanging information for deaf and hearing impaired persons. The goal of sign language recognition is to translate sign language into text efciently and accurately. Currently, automatic sign language recognition is still in its infancy, roughly decades behind automatic speech recognition [1]. It corresponds to a gradual transition from isolated to continuous recognition for small vocabulary task. There are two challenges in sign language recognition. One is how to efciently capture useful information from signers, and the other is how to model different signs and measure their similarities for recognition. To efciently capture useful information from signers, different kinds of sensors are explored varying from tracking systems, like data gloves [2], to computer vision techniques using camera [3] and motion capture systems [4]. Till now, commercially available depth camera systems were expensive, and only a few researchers have used depth information to recognize hand pose. Fortunately, the release of Microsoft Kinect sensor has provided a low cost and off-the-shelf choice for depth sensors. The conjunction of depth map and color image from Kinect sensor could produce great contribution for sign language recognition. In this paper, we utilize Kinect sensor to track hands positions and get some novel 3-D features, which are quite useful and hence improve the performance of recognition. To model varying signs, many kinds of models have been developed and applied. Yang et al. [5] used a Time Delay Neural Network to recognise American Sign Language. Vogler et al. [6] utilized Hidden Markov Models to perform word-level sign language recognition, whose performance quite relies on the states selection in HMM model. In other way, many approaches treated the problem as gesture recognition and mainly focused on robust extraction of manual features or statistical modeling of signs. Some models, like [7] [8], have been introduced in. Recently, the Latent Support Vector Machine (SVM) formalism has been successfully applied for many tasks [9] [10] [11]. An advantage of latent SVM and its variants is that they allow for weak supervision of latent parts within an element to be recognized. This characteristic is also suitable for sign language recognition. After observation on sign language videos, we noticed that, in a video, some frames are more discriminative and representative than others. Users could recognize a sign language video correctly only according to those discriminative and representative frames, in spite of other ones. As shown in Fig. 1, each sign language video contains a sequence of frames. However, lots of frames from different videos look like very similar, only a part of frames in each video are specic. These specic frames are discriminative and representative to each video, and people could recognize a video correctly only according to these frames. Inspired by this observation, we introduce Latent SVM model into the sign language recognition. The discriminative frames are treated as latent variables in this model. After model learning and inference, we can correctly recognize the sign of a video, and nd out
1

978-1-4799-2341-0/13/$31.00 2013 IEEE

4190

ICIP 2013

Fig. 1. Examples of frames in some sign language videos. The leftmost column lists the names of signs, followed by the frame sequences sampled from videos, respectively. Images in red line bounding box are discriminative frames. the most discriminative and representative frames within it. The contributions of this work are threefold. (1) We collect a large dataset with ground-truth labels for research on sign language recognition. (2) We adopt the Latent Latent Support Vector Machine for sign language modeling, which can localize the representative frames in video and classify the sign simultaneously. (3) By using the information from Kinect sensor, our proposed method can improve the performance of recognition signicantly. The rest of paper is organized as follows: In Section 2 we elaborate our sign language recognition based on Latent SVM. Experimental results are shown in Section 3. We conclude the paper in Section 4. 2. SIGN LANGUAGE RECOGNITION VIA LATENT SVM Our task is to develop a learning framework for application to sign language recognition in videos containing signs. This model should generate accurate recognition of sign language videos, and in addition nd the frames within each video that are discriminative and representative to each sign. In this section, we elaborate our frame-based latent SVM model. 2.1. Frame-based Latent SVM Latent SVM [12] provides a framework where we can treat the desired state value as latent variables and consider different correlations into potential functions in a discriminative manner. In our work, the desired state is the discriminative capability of each frame in videos. Three types of potential functions are specially formulated to encode the latent variables representing frames into a unied learning framework. The best congurations of latent variables for all frames are searched by optimization, and the sign language videos are classied. More formally, each sign language video x is to be classied with an semantic label y , where y {1, 2, , L}, and

L is the quantity of all sign words in our dataset. Each video consists of a set of frames, whose visual feature vector are denoted as xi , i {1, 2, , N }. For each frame, the discriminative capability is encoded in a latent variable zi Z {0, 1}, where zi = 1 means the i-th frame is discriminative and should be representative for sign language recognition, and zi = 0 otherwise. Therefore, z = {z1 , z2 , , zN } specify the discriminative frames within each training video. In the following we will introduce how to incorporate z into the proposed model and how to infer it along with model parameter learning. The goal is to learn a discriminative function f over a sign language video x and its label y , where are the model parameters. We use f (x, y ) to indicate the compatibility among the visual feature x, the sign label y , and the latent variables z. For scoring a video x with a class label y with the latent variable conguration z, we take f (x, y ) as form of fz (x, y ) = max T (x, z, y ), which is dened by combining different potential functions: T (x, z, y ) = +
N

T (xi , zi ) + T (zi , zj , y )

N i=1

T (zi , y ) (1)

i=1

(i,j )

In this form, parameter vector is the concatenation of the parameters in all the factors. The model presented in the above equation simultaneously considers the three potential functions, whose details are described in the following. Unary Potential T (xi , zi ): This potential models frame discriminative capability. Here (xi , zi ) represents a certain mapping of the visual feature xi and the mapping result depends on the latent variable zi . Model parameter encodes the weight for different latent variable values. Specifically, it is parameterized as: T T (xi , zi ) = b 1(zi = b) xi (2)
bZ

where 1() is the indicator function. Unary Potential T (zi , y ): This potential function models the compatibility between sign label y and latent variable zi . That is, how likely a sign language video with class label y contains a frame with latent variable zi . It is dened as T (zi , y ) = a,b 1(y = a) 1(zi = b) (3)
aL bZ

The parameter a,b measures the compatibility between y = a and zi = b. After model learning, we select the latent vari able zy for location y as the latent discriminative label ac cording to a,b , i.e., zy = arg maxb 1(zi = b). Frames
labeled with latent variable zy are treated as discriminative and representative ones. bZ

4191

Pairwise Potential T (zi , zj , y ): Intuitively, keyframes within the same video should have similar discriminative capability, the latent variables for those keyframes are dependent. Hence, we assume that there are certain constraints between some pairs of latent variables (hi , hj ). This pairwise potential function models the compatibility between class label y and the dependence of latent variables zi , zj . That is, how likely a video with class label y contains a pair of frames with latent variables zi and zj . It is dened as: T (zi , zj , y ) = a,b,c 1(y = a) y L bZ cZ (4) 1(zi = b) 1(zj = c) where model parameter a,b,c denotes the compatibility between class label y = a and latent variable conguration zi = b and zj = c. 2.2. Model Learning Let x(i) , y (i) (i = 1, 2, . . . , K ) be a set of K training videos, our target to learn the model parameter that discriminates the correct sign label y . Here the discriminative latent variables are unobserved and will be automatically inferred along with model learning. We adopt the latent SVM formulation [12][13] to learn the model as follows: min
K 1 2 w + C1 i 2 i=1 z

We use the non-convex bundle optimization in [14] to solve Eq.(7), that is, the algorithm iteratively builds an increasingly accurate piecewise quadratic approximation of L(w) based on its subgradient w L(w). The key issue is to compute the subgradients w L(w). We dene: z(i) = arg max T (x(i) , z, y ), i, y L,
z

= arg max T (x(i) , z, y (i) ), y z ( ) (i) y = arg max 0/1 (y, y (i) ) + max T (x(i) , z, y )
(y ) y z

(8) to compute L( ). Using the algorithm in [14], we can optimize Eq.(5) and output the optimal model parameter . During each iteration, we can also infer the latent variables z as follow. z(y) = arg max T (x(i) , z, y (i) ), y
z

(9)

This is a standard max-inference problem and we use loopy belief propagation [15] to approximately solve it. 2.3. Recognition Given the learned model parameter , we can directly apply it to perform sign language recognition on test videos xt . The procedure will score a sign language video and provide the discriminative frames within it. The class label y and latent keyframes z are labeled as follows: (y , z ) = arg max
y

, 0

s.t. max T (x(i) , z, y (i) ) max T (x(i) , z, y )


z

(5)

(y, y ) i , i, y L
(i)

{ } max T (xt , z, y )
z

(10)

where C1 is the trade-off parameter similar to that in SVMs, i is the slack variable for the i-th training example to handle soft-margin. Such an objective function requires that the score for ground-truth label y (i) is much higher than for other labels. The difference is recorded in a 0-1 loss function (y, y (i) ): { 1 if y = y (i) ( i) 0/1 (y, y ) = (6) 0 otherwise The constrained optimization problem in Eq.(5) can be equivalently written as an unconstrained problem:

3. EXPERIMENTS In this section, we rst introduce our self-built sign language dataset, then conduct recognition on this dataset to validate the efciency and effectiveness of our proposed method. 3.1. Kinect Sign Language Dataset

Currently, these is no public kinect sign language dataset. The existing public sign language datasets are totally based on 2D camera, which lack the depth information and thus can not be used to evaluate the proposed method. In this situation, we K built the kinect sign language dataset by ourselves. 1 2 min L(w) = + C1 Gi ( ) Our dataset includes 73 American Sign Language signs, 2 i=1 ( ) while each sign corresponds to a vocabulary, as illustrated in Fig 1. We recruited nine participants, each of which stood where Gi ( ) = max 0/1 (y, y (i) ) + max T (x(i) , z, y ) y z in front of Kinect sensor and performed all the signs three times. A total of 1971 phrases were collected, each of which max T (x(i) , z, y (i) ) z includes a set of color image, a set of depth map, and a set of (7) skeleton information.

4192

Table 1. Comparison of different methods on our dataset . The percentages shown are the average accuracies per video. Methods Mean Accuracy Feature Global BoW SVM [17] 65.6% HOG LSC [18] 74.1% HOG Our model: Latent SVM 82.3% HOG Global BoW SVM [17] 70.2% HOG + Kinect LSC [18] 79.1% HOG + Kinect Our model: Latent SVM 86.0% HOG + Kinect Fig. 2. An illustration of each output from Kinect 3.2. Features In our experiments, we adopt two different features including HOG features and Kinect features. The HOG features can describe the appearance information. Based on the output of Kinect, we can catch the position of hands, and obtain their shape information and motion features. In addition, we can also estimate body pose by using of Kinect features. An illustration of color image, depth map, and skeleton information is shown in Fig 2. HOG Features:To describe the appearance information, we adopt HOG [16]. Based on the output of Kinect, it is easy to obtain the mask image and crop the foreground to extract the HOG. Kinect Features:The Kinect features include body pose, hand shape and hand motion features. The body pose features are the combination of unit vectors, joint angles, and distance of skeleton points, e.g. elbows, shoulders, wrists and hands. The hand shape features are HOG features extracted from a patch of hand in every color frame. The hand motion features are Optic Flow (OP) features between adjacent frames. 3.3. Recognition Results To evaluate our Latent SVM Model, we implemented it according to the proposed algorithm. Besides, we also conducted comparing experiments on two baselines. One is the method proposed in [17]. We extracted a global bag of words (BoW) feature for every whole sign language video and then trained a multi-class linear SVM classier. Another one is followed [18], which is called local soft-assignment coding method. In addition, to prove the effectiveness of Kinect sensor, we also preformed comparing experiments. We design two different cases. For the rst case, all methods are only based on HOG feature of color image. For the other case, all methods utilize both HOG feature and features from depth map and skeleton information. The recognition accuracies are shown in Table 1. From results we can infer that: (1) Whether using novel features from Kinect or not, our Latent SVM outperforms all other methods. This proves the effectiveness of our model. (2) When additionally using novel features from Kinect, both all baselines

Fig. 3. Illustrative results of selected discriminative frames. and our model outperform the ones with only HOG feature of color image. It is proved that features from Kinect sensor can improve recognizing performance, and consequently, Kinect sensor are quite suitable for sign language recognition. In addition, our model could nd out the discriminative and representative frames of each sign language videos, which are indicated by the latent variables. Some illustrative results are shown in Fig 3. 4. CONCLUSIONS We presented a Latent SVM model for sign language recognition. This model could effectively recognize the sign language videos, and nd out the discriminative and representative frames of each video, simultaneously. Moreover, we utilized the Kinect sensor to efciently capture useful information from signers and hence improve the recognition accuracy. Experimental results demonstrated the effectiveness of our method.

5. ACKNOWLEDGEMENT This work is supported in part by National Basic Research Program of China (No. 2012CB316304), National Natural Science Foundation of China (No. 61225009), and Microsoft Research Asia UR Project. This work is also supported by the Singapore National Research Foundation under its International Research Centre@Singapore Funding Initiative and administered by the IDM Programme Ofce.

4193

6. REFERENCES [1] U. Von Agris, J. Zieren, U. Canzler, B. Bauer, and K.F. Kraiss, Recent developments in visual sign language recognition, Universal Access in the Information Society, vol. 6, no. 4, pp. 323362, 2008. [2] R.H. Liang and M. Ouhyoung, A real-time continuous gesture recognition system for sign language, in Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on. IEEE, 1998, pp. 558567. [3] T.E. Starner, Visual recognition of american sign language using hidden markov models., Tech. Rep., DTIC Document, 1995. [4] J.L. Hernandez-Rebollar, N. Kyriakopoulos, and R.W. Lindeman, A new instrumented approach for translating american sign language into sound and text, in Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on. IEEE, 2004, pp. 547552. [5] M.H. Yang, N. Ahuja, and M. Tabb, Extraction of 2d motion trajectories and its application to hand gesture recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 8, pp. 1061 1074, 2002. [6] C. Vogler and D. Metaxas, Adapting hidden markov models for asl recognition by using three-dimensional computer vision methods, in Systems, Man, and Cybernetics, 1997. Computational Cybernetics and Simulation., 1997 IEEE International Conference on. IEEE, 1997, vol. 1, pp. 156161. [7] Tianzhu Zhang, Changsheng Xu, Guangyu Zhu, Si Liu, and Hanqing Lu, A generic framework for event detection in various video domains, in Proceedings of the international conference on Multimedia. ACM, 2010, pp. 103112. [8] Tianzhu Zhang, Jing Liu, Si Liu, Changsheng Xu, and Hanqing Lu, Boosted exemplar learning for action recognition and annotation, Circuits and Systems for Video Technology, IEEE Transactions on, vol. 21, no. 7, pp. 853866, 2011. [9] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatively trained part-based models, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 9, pp. 16271645, 2010. [10] T. Lan, Y. Wang, and G. Mori, Discriminative gurecentric models for joint action localization and recognition, in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 20032010.

[11] B. Yao and L. Fei-Fei, Modeling mutual context of object and human pose in human-object interaction activities, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1724. [12] P. Felzenszwalb, D. McAllester, and D. Ramanan, A discriminatively trained, multiscale, deformable part model, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 18. [13] C.N.J. Yu and T. Joachims, Learning structural svms with latent variables, in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 11691176. [14] T.M.T. Do and T. Arti` eres, Large margin training for hidden markov models with partially observed states, in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 265272. [15] K.P. Murphy, Y. Weiss, and M.I. Jordan, Loopy belief propagation for approximate inference: An empirical study, in Proceedings of the Fifteenth conference on Uncertainty in articial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 467475. [16] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in CVPR, 2005. [17] H. Wang, M.M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al., Evaluation of local spatio-temporal features for action recognition, in BMVC 2009-British Machine Vision Conference, 2009. [18] L. Liu, L. Wang, and X. Liu, In defense of softassignment coding, in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 24862493.

4194

You might also like