You are on page 1of 6

A supervised approach to support the analysis and the classification of non verbal humans communications

Vitoantonio Bevilacqua12*, Marco Suma1 , Dario DAmbruoso1, Giovanni Mandolino1, Michele Caccia1, Simone Tucci1, Emanuela De Tommaso1, Giuseppe Mastronardi12
2 1Dipartimento di Elettrotecnica ed Elettronica, Polytechnic of Bari, Italy, e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Italy *corresponding author:

Abstract. Background: It is well known that non verbal communication is sometimes more useful and robust than verbal one in understanding sincere emotions by means of spontaneous body gestures and facial expressions analysis acquired from video sequences. At the same time, the automatic or semi-automatic procedure to segment a human from a video stream and then figure out several features to address a robust supervised classification is still a relevant field of interest in computer vision and intelligent data analysis algorithms. Materials and Methods: We obtained data from four datasets: first dataset contains 100 images of humans silhouettes (or templates) acquired from a video sequence dataset, second dataset contains 543 images of gestures from a preregistered video of MotoGp driver Jorge Lorenzo, the third one 200 images of mouths and finally the fourth one 100 images of noses; third and fourth datasets contain images acquired by a tool implemented from the authors and also samples available in literature in public databases. We used supervised methods to train the proposed classifiers and, in particular, three different EBP Neural-Network architectures for humans templates, mouths and noses and J48 algorithm for gestures. Results: We obtained on average a 80% correct classification for binary classifier of humans templates (no false positives), 90% correct classification for happy/non happy emotion, 85% of binary disgust/non disgust emotion and 80% correct classification related to the 4 different gestures.
Keywords: Neural Network, Emotions Recognition, Humans Silhouetts, Gesture Recognition, Facial Expressions Recognition, Human Detection, Hands, Action Units, Centre of Gravity, Pose Estimation.

1 Introduction
Good communication is the foundation of successful relationships, both personally and professionally. But we communicate with much more than words. In fact, many researches show that the majority of our communication is nonverbal. Nonverbal communication, or body language, includes facial expressions, gestures, eye contact, posture and even the tone of our voice. Although the details of his theory have evolved substantially since the 1960s, Ekman remains the most vocal proponent of

the idea that emotions are discrete entities [1]. Unlike some forms of nonverbal communication, facial expressions are universal. About gestures recognition we consider how the way we move communicates a wealth of information to the world. This type of nonverbal communication includes our posture, stance, and subtle movements. Gestures are omnipresent in our daily lives. However, the meaning of gestures can be very different across cultures and regions, so it is important being careful to avoid misinterpretation. Using these ideas, we want to provide an automatic system which is able to evaluate emotions in particular situations (videoconference, meetings, neurological examination, investigation).

2 Materials
Materials for all the four datasets have been collected with the goal of increasing the variance of their samples and then supporting the amount of information in the training examples necessary for the proposed supervised classifiers. 2.1 Humans silhouettes The humans silhouettes used in this paper comes from those walking in a video stream dataset where the training examples consist of only 20 different silhouettes binary images obtained after a pre-processing phase of background subtraction. By this methods the training examples consist in each of a number of different and several human silhouettes extracted from each frame.

Fig. 1. a) and b) samples frames and c) 4 different examples of humans silhouettes with their several dimensions and behaviours.

2. 2 Facial Expressions First of all we explain the concept of Action Units (AUs) as minimal facial actions not separable, elements for the construction of facial expressions. Combination of these, with different intensities, generate facial expression. According to our previous work [2] we can assert that, generally, prescinding other AUs, the presence of AU-10 discriminates unequivocally disgust emotion; the presence of AU-12 or AU-13 discriminates unequivocally happy emotion. For this reason we are able to recognize two of the six primary emotions declared by Paul Ekman: happy and disgust emotions. To extract middle and lower part of the face we have used our tool;

moreover we have used public databases of faces [3] and then we have taken our regions of interest. 2.3 Gestures Each frame of the video has a resolution of 640x480 pixel. As for the automatic classification of gestures, the research has been based on different studies by psychologist David McNeill [4], who divides them into four main categories: - deictic gestures: typical indicating movements, usually emphasized by the movement of fingers or by other parts of the body that can be used for this purpose. - iconic gestures: gestures that express formal relation in respect to the semantic content of discourse. They mainly occur in the area occupied by the torso of the prototype being focused; - metaphoric gestures: they represent real figures. These refer to abstract concepts, as moods or language. The density of such gestures is concentrated in the lower part of the torso; - beat gestures: these may be recognized by only focusing the attention on the characteristics of their movements. It has been decided to monitor the movement of the center of gravity (CG) of the hands in each frame so as to be able to calculate various parameters of evaluation, such as the velocity with which gestures are made.

3 Methods
The application of supervised neural network using Error Back Propagation algorithm gives easier solution to complex problems such as in correct classification of silhouettes shapes, facial expression and gestures. Advantages of neural networks include their high tolerance to noise as well as their ability to classify patterns not used for training. In particular we implemented neural networks supervised classifier for the classification of silhouettes, mouths and noses emotions features and the J48 classifier for gestures. 3.1 Silhouettes classification The neural network classifier is a two layers feed-forward with 396 inputs (corresponding to 33*12 dimensions of the smallest figure previously resized to contain the smallest human silhouette) with 6 logistic neurons in the first layer and 1 neuron as output. The images passed to the neural networks have the following characteristics: the height bigger than the width, the ratio between height and width ranging 1.9 and 4, the height bigger than 33 pixels and the width bigger than 12 pixels. All images are divided in more images and then each image contains a singular human silhouette always resized to 33*12 pixels in order to have the same number of

inputs for each neural network classification sample. This procedure guarantee a constant number of neural networks input. In any case to achieve good performance in terms of generalization the training set is selected with large variability in terms of positive poses and movements that are people not staring the cameras (not frontal images), people with their arms far or closed to the body, people not very well identified owing of the presence of just one arm and negative ones that are objects similar to people used as contrary examples. 3.2 Facial Expression classification We have realized two NNs, that work in parallel; the first one receives the form of the mouth: in happy expressions the mouth should be open, the teeth should be visible and its shape is curved (AU-12, AU-13); the second one receives the nose: in disgust expressions nasolabial furrows are visible (AU-10).

M 50x1
8x8 ROI gray-scale extraction conversion vectorization

M 50x1

Fig. 2. Segmentation and vectorization of the face.

Each bitmap gray-scale image is a band of 40x80 pixels which contains respectively the lower and the middle part of the face; to use it as input for the neural network they have been arranged in an array and then normalized, obtaining a 1x50 vector (a function calculates a mean value each 8x8 pixels). In case of no happy and no disgust expressions, the network returns 0 (zero); in the other case the network returns 1. To train the NN for the mouth, we have used a training set of 200 photos that are composed of 100 negative and 100 of positive examples in 20000 epochs. The NN comes with a structure of the first layer of 300 neurons, the second layer of 200 neurons, the third layer of 10 neurons and 1 output neuron (300x200x10x1). To train the NN for the nose, we have used a training set of 100 examples that are composed of 50 negative and 50 positive examples in 20000 epochs. The NN comes with a structure of the first layer of 400 neurons, the second layer of 80 neurons, the third layer of 10 neurons and 1 output neuron (400x80x10x1).

Fig. 3. mouths and noses from our tool (the first four images) and from public databases.

3.3 Gestures For gestures analysis the supervised classifier is implemented by means of J48 algorithm instead of using a EBP NN classifier. Rule induction systems are currently employed in several different environments ranging from loan request evaluation to fraud detection, bioinformatics and medicine [5]. In particular the main goal of this scheme is to minimize the number of tree levels and tree nodes, thereby maximizing data generalization. The input is a 10 elements array where the features are the x coordinate of the right hand CG, the y coordinate of the right hand CG, the x coordinate of the left hand CG; the y coordinate of the left hand CG, the position of the right hand (respect to the torso of the prototype being shot), the position of the left hand, the right/left hands slant (measured in radiant), the velocity of the movement of the right/left hand. To find CG, frames have been processed according to the follow workflow consisting of skin detection by color-space conversion from RGB to HSV, background subtraction technique to exalt only the hands region, image smoothing and binarization, tracing of rectangles that contain hands, CG identification, edge and features detection; template matching to notice resting position of hands; gestures classification and storing data on .csv file.

4 Experimental results
In this paper we have presented a system that recognizes separately shapes, two of six primary emotions and analyzes information derived from gestures. The complete project expects to recognize all primary emotions. In particular in the following we show, separately, results related to facial expressions, gestures and silhouettes. About facial processing, using about 150 test images, the results of NNs have achieved about 90% for happy/no-happy emotion and 85% for disgust/no-disgust emotion of success rate. We can assert that the results are reliable, also because in some particular cases nor human beings can distinguish exactly emotions. About gestures, the confusion matrix is shown in Table 1. The NN has correctly classified approximately 80% of gestures. The network has specifically been able to label metaphoric gestures in a precise way. Performances are not optimal as for the recognition of deictic gestures and beat gestures. Iconic gestures are not present in the preregistered video.
Table 1. Confusion matrix of data set for gestures. Deictic (A); Spontaneous (B); Beat (C); not recognized (D); Metaphoric (E) A B C D E A 53 71 3 0 0 B 6 280 10 3 1 C 1 9 54 1 0 D 1 2 1 20 0 E 0 1 1 0 25

Neural network shows on average good results in terms of false positives and then in the following figure are reported detected Vs total humans per each frame.

4 humans Vs 5 humans.

4 humans Vs 6 humans

4 humans Vs 6 humans

3 humans Vs 4 humans.

2 humans Vs 5 humans

2 humans Vs 5 humans.

Fig 4. Detected Vs total humans number per each tested frame.

The goal of this paper is to investigate emotion-related and realize a system to recognize separately emotional patterns of the body and face using Neural Networks. The research aims at developing an intelligent system that can interpret intellectual conversation between human beings. When we interact with others, we continuously give and receive countless wordless signals. The nonverbal signals we send either produce a sense of interest, trust, and desire for connection, or they generate disinterest, distrust, and confusion. The analyzed gestures and facial emotions represent non-verbal communication; they provide the user to what the speaker is saying, thus helping the listener to interpret the meaning of words. Future works forecast the design of a new multimodal system performing at the same time emotion recognition by means other several facial bands (eyebrows band eyes bands), gestures recognition and human silhouettes.

[1] Paul Ekman, FACS: Facial Action Coding System, Research Nexus division of Network Information Research Corporation, Salt Lake City, UT 84107, (2002) [2].V. Bevilacqua, D. DAmbruoso, G. Mandolino, M. Suma, A new tool to support diagnosis of neurological disorders by means of facial expressions, IEEE Proc. of MeMeA pp 544-549 [3]. [4]. [5].F. Menolascina, V. Bevilacqua et al . Novel Data Mining Techniques in aCGH based Breast Cancer Subtypes Profiling: the Biological Perspective Proc. of IEEE Symp. on Comp. Intelligence in Bioinformatics and Comp. Biology (CIBCB 2007) pp.9-16