You are on page 1of 3

Emotion Recognition using Face Images

Fatma Guney
Bogazici University
Computer Engineering
Istanbul, Turkey 34342
Email: guneyftm@gmail.com

AbstractIn this study, I present a real-time facial expression


analysis system that I developed for the course project of Data
Mining for Visual Media. Specifically task is to train a system that
could recoginze six basic emotion types, which are anger, disgust,
fear, happiness, surprise and sadness, plus neutr expression. For
this task, I employed a local appearance-based representation
using Discrete Cosine Transform (DCT), which has been shown
to be very effective in real-time processing and robust against
lighting changes. Using these features, I trained one-versus-all
support vector machine classifiers for each emotion type.

I. I NTRODUCTION
A persons face changes according to emotions or internal
states of the person. Face is a natural and powerful communication tool. Analysis of facial expressions through moves
of face muscles leads to various applications. Facial expression recognition plays a significant role in Human Computer
Interaction systems. Humans can understand and interpret
each others facial changes and use this understanding to
response and communicate. A machine capable of meaningful
and responsive communication is one of the main focuses
in robotics. There are many other areas which benefit from
the advances in facial expression analysis such as psychiatry,
psychology, educational software, video games, animations, lie
detection or other practical real-time applications.
The objective of this study is to get an overview of the
current methods and based on proposed solutions, to develop
a real-time facial expression recognition system to recognize
neutr expression and six prototypic expressions, happiness,
sadness, anger, surprise, disgust and fear. The approach of [1]
have been benefited during the study.
In this study, a facial emotion recognition system is developed. First, face and eye detection based on modified census
transform (MCT) are performed on the input image. After
detection, alignment based on eye coordinates of the detected
face image is applied to scale and translate face and reduce
variances in feature space. Aligned face image is divided
into local blocks and discrete cosine transform is performed
on these local blocks. Concatenating features of each block,
an overall feature vector is obtained. A scaling procedure is
applied to obtained overall feature vector before classification.
For classification purpose, one-versus-all Support Vector Machine (SVM) classifier is trained with cross validation.

facial expressions. According to literature reviews, steps of a


facial expression system can be summarized as face acquisition, facial data extraction or representation and facial expression recognition [2]. First step is face acquisition which is
composed of face detection and head pose estimation. Second
step is facial data extraction and representation, which can be
conveyed as geometric and/or appearance feature extraction.
Last step is facial expression recognition which can be categorized as frame based and sequence based classification. There
are many dimensions of facial expression analysis including
individual differences in subjects, intensity of an expression,
deliberate versus spontaneous expression, head orientation and
scene complexity, image acquisition and resolution, reliability
of ground truth, databases, and the relation to other facial
behaviors or non-facial behaviors. Multimodal approach that is
the fusion of, for example, audio and visual expression systems
is an example methodology for the combination of facial and
non-facial behaviors [3].
In the scope of facial expression analysis, there are deeper
level studies based on facial muscles [1]. Facial parameterization to encode the movements of facial muscles is a highly
studied subject in both psychology and computer science.
Most common coding scheme is Facial Action Coding System
(FACS) which defines action units corresponding to atomic
facial muscle actions and similarly there is MPEG-4 Facial
Animation Parameters (FAPs) [3].
III. M ETHODS
A. Face Detection and Alignment

Fig. 1.

Face detection and alignment examples

II. R ELATED W ORK


There are many studies in the literature that aims to recognize facial changes and motions from visual data to analyze

The face and eyes are automatically detected using a modified census transformation (MCT) based face and eye detector

[4], [?]. Alignment of detected face is necessary to decrease


the variation in feature space caused by pose, angle and scale
changes. Alignment is based on transformation of face in
euclidean space by using the detected eye coordinates. Face
is cut out and scaled according to the fixed distance between
two eyes and eyes are translated to the same fixed position
for all images. Some examples of detection and alignment are
shown in Fig.1

average intensity value of the block. From the remaining


DCT coefficients, five coefficients which contain the highest
information are extracted using zig-zag scanning as shown in
Fig.3. Finally, the DCT coefficients extracted from each block
are concatenated to construct the overall feature vector [6].

B. Local Appearance-based Face Representation

Fig. 3.

Zig-zag scanning order used to select the DCT coefficients.

C. Emotion Recognition using SVM

Fig. 2.

Local appearance-based feature extraction scheme using DCT.

For face representation a local appearance-based approach is


used. In local appearance-based approach, face is divided into
non-overlapping blocks and feature extraction is performed
on these blocks instead of whole face as can be seen in
Fig.2. When there is a change in appearance of face due to
occlusion, using local blocks provides advantage because only
the related region of block or blocks are affected. In case of
applying feature extraction on entire face, entire representation
is affected by changes [6].
For feature extraction on local regions, discrete cosine
transform (DCT) is used. DCT is a signal analysis tool
which is frequently used in facial image analysis because it
provides frequency information in a compact representation.
DCT is also frequently preferred in real-time applications due
to its fast computation. DCT representation is shown to be
robust to lighting changes and scaling variations due to its
decomposition capability, that is elements sensitive to lighting
changes and scaling variations can be removed. For example,
first coefficient represent the average intensity value of the face
image which can be directly effected by illumination variations
[6]. A detected and aligned face image is divided into blocks
of 8x8 pixels size. Each block is then represented by its
DCT coefficients. 8x8 is chosen as size in face recognition
applications, because it is small enough to provide stationary
within the block with a simple transform complexity and big
enough to provide sufficient compression [6].
After selecting coefficients a two-step normalization is applied to features of each block. Firstly, blocks with different
brightness levels may have DCT coefficients with different
value levels. Local feature vectors magnitude is normalized to
unit norm. Secondly, first coefficients have higher magnitudes,
therefore each coefficient is divided by its standard deviation
to balance the contribution of coefficients [6].
Top-left DCT coefficient which is the first one, is removed from the representation since it only represents the

Emotion recognition is modeled as a single one-versus-all


problem for each emotion type. For each SVM classifier, a
model for probability estimates is trained. All frames of videos
that are labeled with the emotion class are trained as positive
samples and samples of other classes as negative.
For classification, each frame of test video is classified by
all trained classifiers. Emotion with the highest probability is
estimated as emotion type for each frame. Then, voting among
the emotion types returned for frames of video determines the
emotion type for that video.
Scaling before applying SVM is very important to avoid
features with huge ranges from dominating others [7]. Mean
and standard deviation values are determined for each feature
index. In training, each sample are scaled by making each
feature zero-mean and unit-variance over all feature vectors.
Normalization parameters are saved during training and applied in testing before classification.
In this study, SVM with a radial basis function (RBF) kernel
is used [1]. RBF kernel transforms the features into a higher
dimensional space, so that they can be then linearly separated
by a hyperplane.
For the optimization of parameters slack variable C and
kernel parameter , five-fold cross validation is used. Gridsearch on C and using cross-validation is performed as
recommended in [7]. In grid search, pairs of (C, ) are tried
and the one with the best cross-validation accuracy is picked.
IV. E XPERIMENTS
A. Dataset
FGnet-Facial Expression and Emotion Database is used in
experiments. It is an unpublished database with spontaneous
and natural facial expressions. There are video sequences for
6 basic emotions plus neutral. Videos are in high quality, good
lighting conditions and constant background. The database
contains videos gathered from 19 different individuals, each
performing all six desired expressions and an additional neutral sequence three times. There are 21 sequences recorded for
each individual, which is in total there are 399 sequences in
the database.

TABLE III
C ONFUSION M ATRIX

Two records of each sample for each emotion and neutr are
used in training and validation, set of remaining sequences
constitute the dataset for tests.
B. Experimental Setup
For emotion recognition, each detected face is scaled to
64x80 pixels and aligned so that eye row is 35 and distance
between eyes is 32 pixels [?], [4]. DCT is performed on
blocks of 8x8 pixels. For each block the first 10 coefficients
in the zig-zag scanning order are kept, leading to a 8x10x10
= 800 dimensional feature vector. After scaling the overall
feature vector, SVM parameter optimization is performed by
grid-search using five-fold cross validation. Grid search is
performed with C = 2k and = 2l with k = 3, 1, 1
and l = 16, 14, ..., 7.
C. Results
TABLE I
PARAMETERS OF SVM
Emotion

Anger

210

21

Disgust

210

21

Fear

210

21

Happiness

212

21

Sadness

210

21

Surprise

210

21

Results of SVM parameter optimization using grid search


can be seen in TableI.
For training step, cross validation error rates for each
emotion type is shown in TableII. Error rate is calculated as
ErrorRate =

FP + FN
TP + TN + FP + FN

(1)

where T P is the number of correctly classified positive


samples, F P the number of samples that have been classified
incorrectly as positives, T N the number of correctly classified
negative samples, and F N the number of samples that have
been classified incorrectly as negatives.
TABLE II
E RROR RATES FOR CROSS VALIDATION
Emotion/Fold

Fold-1

Fold-2

Fold-3

Fold-4

Fold-5

avg

Anger

0.19

0.27

0.12

0.26

0.13

0.19

Disgust

0.22

0.24

0.24

0.21

0.20

0.22

Fear

0.18

0.16

0.10

0.13

0.32

0.17

Happiness

0.17

0.08

0.10

0.06

0.12

0.10

Sadness

0.45

0.35

0.24

0.28

0.25

0.31

Surprise

0.33

0.10

0.23

0.10

0.08

0.16

From test results, a confusion matrix is constituted with each


emotion type as in TableIII.

Anger

Disgust

Fear

Happiness

Sadness

Surprise

Anger

13

Disgust

18

Fear

12

Happiness

17

Sadness

16

Surprise

15

V. D ISCUSSION AND C ONCLUSION


A real-time emotion recognition system using face data
is proposed and developed. Given an input image or video
sequence, first faces are detected and aligned for the feature
extraction. Local appearance-based feature extraction using
DCT is applied to each aligned face. Features of each block
are concatenated to obtain the overall feature vector and each
feature is scaled before applying SVM. For classification, oneversus-all SVM classifier is trained and used. Obtained results
show that SVM classification using local appearance-based
features give very promising results for emotion recognition
problem. Validation accuracies are not that high due to the
spontaneous expressions. For example, making people feel
what is wanted when they are sitting in fornt of a screen is
not an easy task. Especially some classes like sadness or fear
does not seem that real in records, consequently their results
are lower compared to other emotion types. Testing results are
quite good. Fear is generally misclassified as surprise or other
way due to the fact that similar facial muscles are used for
these expressions.
R EFERENCES
[1] Gehrig, T., Ekenel, H.K., A Common Framework for Real-Time Emotion
Recognition and Facial Action Unit Detection IEEE Workshop on CVPR
for Human Communicative Behavior Analysis, June 2011.
[2] Tian, Y.L., Kanade, T. Cohn, J.F. (2005). Facial Expression Analysis, In:
Handbook of Face Recognition, Li, S.Z. Jain, A.K., (Eds.), pp. 247-276,
Springer, New York, USA.
[3] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A Survey of Affect
Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(1):3958,
2009.
[4] CVHCI lab. Okapi (open karlsruhe library for processing images) c++
library.
[5] C. Kublbeck and A. Ernst. Face detection and tracking in video sequences
using the modified census transformation. Image and Vision Computing,
24(6):564572, June 2006.
[6] H. K. Ekenel. A Robust Face Recognition Algorithm for Real-World Applications. PhD thesis, Universitat Karlsruhe (TH), Karlsruhe, Germany,
Feb. 2009.
[7] Chih-Chung Chang, C.J.L.: Libsvm: a library for support vector machines
(2001), http://www.csie.ntu.edu.tw/ cjlin/libsvm

You might also like