You are on page 1of 6

Volume 2, Issue 3, March 2012

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering


Research Paper Available online at: www.ijarcsse.com

Use of SVM Classifier & MFCC in Speech Emotion Recognition System


Bhoomika Panda* , Debananda Padhi2 , Kshamamayee Dash3 , Prof. Sanghamitra Mohanty4
P.G. dept. Of CS&A,Utkal University , Vani Vihar, Bhubaneswar, Odisha,India bhoomika.panda@gmail.co m*
Abstract Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human omputer Interaction (HCI) with wide range of applications. The speech features such as, Mel Frequency cepstru m coefficients (MFCC ) is extracted from speech utterance. The Support Vector Machine (SVM) i s used as classifier to classify different emotional states such as anger, happiness, sadness, neutral, fear, from a database of emotional speech collected from various emotional drama sound tracks. The S VM is u sed for classification of emotions. It gives 93.75% classification accuracy for Gender independent case 94.73% for male and 100% for female speech. Keywords S peech emotion, Emotion Recognition, S VM, MFCC, Emotion verication

I. INTRO DUCTION Automatic Speech Emotion Recognition is a very recent research topic in the Hu man Co mputer Interaction (HCI) field. As computers have become an integral part of our lives, the need has risen for a more natural co mmunicat ion interface between humans and computers. To achieve this goal, a computer would have to be able to perceive its present situation and respond differently depending on that perception. Part of this process involves understanding a users emotional state. To make the hu man-co mputer interaction more natural, it would be beneficial to give computers the ability to recognize emotional situations the same way as human does. Automatic Emotion Recognition (AER) can be done in two ways, either by speech or by facial expressions. In the field of HCI, speech is primary to the objectives of an emotion recognition system, as are facial expressions and gestures. Speech is considered as a powerful mode to co mmunicate with intentions and emotions. In the recent years, a great deal of research has been done to recognize hu man emot ion using speech information [1], [2]. Many researcher exp lored several classificat ion methods including the Neural Network (NN), Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Maximu m Likelihood Bayes classifier (M LC), Kernel Regression and Knearest Neighbors (KNN), Support Vector Machine (SVM) [3], [4]. The Support Vector Machine is used as a classifier for emotion recognition. The SVM is used for classificat ion and regression purpose. It performs classification by constructing an N-d imensional hyper planes that optimally separates the data into categories. The classification is achieved by a linear or nonlinear separating surface in the input feature space of the dataset. Its main idea is to transform the orig inal input set to a high dimensional feature space by using a kernel function,

and then achieve optimu m classification in this new feature space. In this paper in section II Requirement of Speech Emot ion recognition is given, in section III SER Feature Extract ion Technique is given, in section IV about emotion Database & section V gives details about SVM classification and at last in section VI & VII Experimental Results & conclusion and future works are given respectively. II. R EQ UIREMENT O F SER In Human-Robotic Interfaces: robots can be taught to interact with humans and recognize hu man emotions. So that robotic pets, should be able to understand not only spoken commands, but also other informat ion, such as the emotional and health status of its human co mmander and can act accordingly. In smart call-centers: SER helps to detect potential problems that arise fro m an unsatisfactory course of interaction. A frustrated customer is typically offered the assistance of human operators or some reconciliat ion strategy. In intelligent spoken tutoring systems: detecting and adapting to students emotions is considered to be an important strategy for closing the performance gap between human and co mputer tutors as students emotions can impact their perfo rmance and learning. III. SER FEATURE EXTRACTIO N In previous works several features are ext racted for classifying speech affect such as energy, pitch, formants frequencies, etc. all these are prosodic features. In general prosodic features are primary indicator of speakers emotional state. Here in feature ext raction process two features are extracted Mel Frequency Cepstral Coefficient (MFCC). Fig. 1 shows the MFCC feature ext raction process. As shown in Fig. 1 feature extraction process contains following steps:

Vo lu me 2, Issue 3, March 2012 Pre-processing : In the pre-processing stage first each signal is de-noised by soft-thresholding the wavelet coefficients, and since the silent parts of the signals do not carry any useful informat ion, those parts including the leading and trailing edges are eliminated by thresholding the energy of the sign al. The signals are div ided into frames using a Hamming window of length 23 ms.
Emotional S peech

www.ijarcsse.com into the first number of co mponents and shortens the vector to number of co mponents. Feature Labelling : In Feature labelling each extracted feature is stored in a database along with its class label. Though the SVM is b inary classifier it can be also used for classifying mu ltip le classes. Each feature is associated with its class label e.g. angry, happy, sad, neutral, fear.
Input Emotional S peech

MFCC

Preprocessing

DCT Feature extraction LogMel S pectrum

Framing

Log Feature Labeling Mel S pectrum S VM Training Mel Filter bank & Frequency Wrapping S VM Classification

Windowing

FFT

Magnitude S pectrum

Figure 1: Block Diagram of Speech Emotion Recognition System Feature Extraction

Output Emotion Class

Framing : It is a process of segmenting the speech samples obtained from the analog to digital conversion (ADC), into the small frames with the time length within the range of 20-40 ms. Framing enables the non stationary speech signal to be segmented into quasi-stationary frames, and enables Fourier Transformat ion of the speech signal. It is because, speech signal is known to exh ibit quasi-stationary behavior with in the short time period of 20-40 ms. Windowing: Windowing step is meant to window each individual frame, in order to min imize the signal discontinuities at the beginning and the end of each frame. FFT: Fast Fourier Transform (FFT) algorith m is ideally used for evaluating the frequency spectrum of speech. FFT converts each frame of N samples fro m the t ime domain into the frequency domain. Mel Filterbank and Frequency wrapping: The mel filter bank [8] consists of overlapping triangular filters with the cutoff frequencies determined by the center frequencies of the two adjacent filters. The filters have linearly spaced centre frequencies and fixed bandwidth on the mel scale. Take Logarithm: The logarith m has the effect of changing mu ltip licat ion into addition. Therefore, this step simply converts the mu ltiplication of the magnitude in the Fourier transform into addition . Take Discrete Cosine Transform: It is used to orthogonalise the filter energy vectors. Because of this orthogonalizat ion step, the information of the filter energy vector is co mpacted

Figure 2: Block Diagram of Speech Emotion Recognition System using SVM Classification

Feature Selection : The performance of a pattern recognition system highly depends on the discriminate ability of the features. Select ing the most relevant subset fro m the orig inal feature set, we can increase the performance of the classifier and on the other hand decrease the computational co mplexity. We are using the forward selection method for each single binary classifier in our system in o rder to select the most efficient subset of features. At each step the variable which increases the performance of the classifier the most, is added to the feature subset. Fig. 2 illustrates the concept. Classification : The recognition of human emotion is essentially a pattern recognition problem. We are using LSSVM (descried in next section) as a classifier in this research. Since we are dealing with a mu lt i-class classification problem, we need a method to extend our two-class support vector classification methodology to a mu lti-class problem. There are different ways for mu lti-category SVM ment ioned in the literature among which one-against-all and one-against-one (pairwise) are the most popular ones. In this paper we are comparing the results achieved by one-against-all, fu zzy oneagainst-all, pairwise, and fuzzy pairwise [17]. For the purpose of comparat ive study we are also applying a Linear Classifier with gradient descent optimizat ion algorith m.

2012, IJARCSS E All Rights Reserved

Page | 226

Vo lu me 2, Issue 3, March 2012 IV. . THE EM OTION DATABASE An important issue to be considered in the evaluation of an emotional speech recognizer is the degree of naturalness of the database used to assess its performance. Incorrect conclusions may be established if a low-quality database is used. Moreover, the design of the database is critically important to the classication task being considered. For example, the emotions being classied may be infant-directed; e.g. soothing and prohibition or adult-directed; e.g. joy and anger. In other databases, the classication task is to detect stress in speech. The classication task is also dened by the number and type of emotions included in the database. Databases can be divided into three type: databases of acted emotions, is obtained by asking an actor to speak with a predefined emotion. databases coming fro m real-life systems (for example call-centers) databases with elicited emotions, where emotions are provoked and self-report is used for labelling control. Different types of databases are suitable for different purposes. In our work we have decided to create a database of type one. We have recorded about 20 persons voices in some five different predefined emotions. Then we will calcu late their MFCC and create a database. V. SVM CLASSIFICAT ION Support vector machine (SVM) is an eective approach for pattern recognition. Here, the basics of the SVM briey will be presented. These basics of the SVM can be found in references Cristianini and Shawe-Taylor (2000), Ferna ndez Pierna, Baeten, Michotte Renier, Cogdill, and Dardenne (2004), Huang and Wang (2006), Kecman (2001) and Scho lkopf and Smo la (2000), Burges (1998). In SVM approach, the main aim of an SVM classier is obtaining a function f(x), which determines the decision boundary or hyperplane (Ferna ndez Pierna et al., 2004). This hyper-plane optimally separates two classes of input data points. This hyperplane is shown in Fig. 1.Where M is marg in, wh ich is the distance fro m the hyperplane to the closest point for both classes of data points (Ferna ndez Pierna et al., 2004; Gunn et al., 1998). In SVM, the data points can be separated two types: linearly separable and non-linearly separable (Ferna ndez Pier-na et al., 2004). For a linearly separable data points, a training set of instance-label pairs (x k,yk), k =1,2,3, ... , t where x k Rn and

www.ijarcsse.com

--(4) This optimizat ion problem is named as quadratic optimizat ion problem. For solving of this problem, the Lagrange function is used. Here, aim of this optimization is to nd the appropriate Lagrange mult ipliers ( ak) --(5)

where w represents a vector that denes the boundary, x k are input data points, b represents a scalar threshold value (Jack et al., 2002). a k is Lagrange mult iplier and it must be a k>= 0. For appropriate a k, L function should be minimized with respect to the w and b variables. At the same time, this function should be maximized with respect to the nonnegative dual variable a k (Huang & Wang, 2006). If derivative of this L function performs, the Eqs. (6) and (7) are obtained. ---(6) ----(7)

Here, there is only one min imu m solution (Cornue jo ls, 2002; Ferna ndez Pierna et al., 2004). Therefore, there are not problems with local minima in SVM. In back-propagation articial neural network (ANN), there is some local min ima problems. Th is situation is advantage of support vector mach ines over back-propagation ANN. Eq. (5) is to its dual problem under the KarushKuhnTucker situations (Cornue jo ls, 2002; Ferna ndez Pierna et al., 2004; Gunn et al., 1998; Kuhn & Tucker, 1951) as below :

---(8)

yk { +1, -1}, the data points can be classied as :


---(2) where <w ,xk> shows the inner product of w and xk. The inequalities in Eq. (2) can be co mbined as in ---(3) The SVM classier p laces the decision boundary by using maximal margin among all possible hyper planes. For maximize the M, ||w|| has to minimize subject to conditions given as:

The a k Lagrange multip lier is calculated by using: ---(9) If Eqs. (6) and (7) are replaced into Eq. (5), L function is changed to the dual Langrangian (dL) should be maximized with respect to non-negative ak.dL function is given as below ---(10) This situation is a standard quadratic optimization problem. For solving this dual optimization problem, most appropriate w and b parameters of optimal hyperplane are estimated with

2012, IJARCSS E All Rights Reserved

Page | 227

Vo lu me 2, Issue 3, March 2012 respect to a k. The acquired optimal hyperplane f(x) can be given as below :

www.ijarcsse.com functions should be properly set for increasing the classication accuracy of SVM (Ferna ndez Pierna et al., 2004; Huang & Wang, 2006). Linear kernel function: --(17)

---(11) If xk input data point has a non-zero Lagrange mult iplier (ak), this xk is called support vector. Using of the data points outside support vectors is not necessity for calculating the f(x). If data points are non-separable and non-linear SVM is used, Eq. (2) should be changed as : --(12) Exponential rad ial basis function (ERBF) kernel: where x k is non-negative slack variab les and xk >= 0, k =1, ... , t. This xk variable keeps the constraint violation as small as and provides the min imu m training error (Huang & Wang, 2006). Thus, Eq. (4) can be rewritten as Eq. (13) ion: --(21) Bspline kernel function: --(22) Spline kernel function: --(14) --(23) where C is soft marg in constant penalty parameter, wh ich is upper bound on a k, is determined by the user (Huang & Wang, 2006; Ferna ndez Pierna et al., 2004). The optimal decision function (f(x)) is the same as Eq. (11). If non-linear SVM is used, the training data points in input space are transformed to a higher dimensional feature space by using a kernel function (). Thus, the inner p roducts in the optimizing problem (10) are substituted by the kernel function as below: --(20) Fourier kern el funct Polynomial kernel function: --(18) Radial basis function (RBF) kernel: --(19)

--(13) Here, this optimization problem can be solved as separable case

Sig mo id kernel function: --(24)

---(15)

Thus, optimal hyperplane f(x) of a non-linear SVM can be given as below:

---(16)

where SV is support vector number. Many kernel functions, which are linear, polynomial, rad ial basis function (RBF), sigmoid, spline, Fourier, bspline, and exponential radial basis function (ERBF) are used for SVM (Gunn et al., 1998; Huang & Wang, 2006). These kernel functions are g iven in Eqs. (17)(24), respectively. The kernel parameters in these kernel

where d parameter is degree of kernel function for polyno mial, bspline and Fourier kernel functions, respectively. (sig ma)

2012, IJARCSS E All Rights Reserved

Page | 228

Vo lu me 2, Issue 3, March 2012 parameter is width of RBF and ERBF kernel functions, respectively. d1 parameter is scale of sig moid kernel function.
Angry Angry Sad Figure 3:Separation of two classes by SVM Happy Neutral Fear Sad Happy

www.ijarcsse.com
Neutral

Fe ar

100 0 16.66 0 0

0 100 0 0 0

0 0 83.34 0 0

0 0 0 100 14.85

0 0 0 0 85.15

VI. EXPERIMENTAL RESULT S


Frame blocking of the Speech Signal 0.015

Frame blocking of the Speech Signal 0

-0.005

-0.01 0.01 -0.015 0.005 -0.02 0 -0.025 -0.005 -0.03 -0.01

0.5

1.5

2.5 x 10
4

Hamming Window

(d)

-0.015

0.5

1.5

2.5 x 10
4

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

(a)
Hamming Window 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

50

100

150

200

250

300

(e)
Hamming Window applied to each frame 0
0 50 100 150 200 250 300

(b)
Hamming Window applied to each frame 0.015

-0.005

-0.01

0.01

-0.015

0.005

-0.02
0

-0.025
-0.005

0.5

1.5

2.5 x 10
4

(f)

-0.01

-0.015

(c) Figure 4: (a) Frame blocking to Polite speech, (b) Hamming window (c) hamming window applied to each block of polite speech (d) Frame blocking to angry speech (e) Hamming window (f) hamming window app lied to each block of angry speech
x 10
4

0.5

1.5

2.5

Table 1: Confusion M atrix of RBFSVM classifier (Gender Independent)


Emotion Angry Sad Happy Neutral Fe ar

Table 3: Confusion M atrix of RBFSVM classifier (Female)


Emotion Fear Angry Sad Happy Neutral Fear

Emotion Recogni tion (% )


Angry Sad Happy
Neutral

Emotion Recogni tion (% )


Sad Happy
Neutral

Angry

Fe ar

100 0 0 0 0

0 100 0 6.25 0

0 0 100 0 30.76

0 0 0 93.75 0

0 0 0 0 69.24

100 0 0 0 0

0 100 0 0 0

0 0 100 0 0

0 0 0 100 0

0 0 0 0 100

Table 2: Confusion M atrix of RBFSVM classifier (M ale)


Emotion

Table 4: Confusion M atrix of Polynomial SVM classifier (Gender Independent)


Emotion

Emotion Recogni tion (% )

Emotion Recogni tion (% )


Angry Sad Happy
Neutral

Fe ar

2012, IJARCSS E All Rights Reserved

Page | 229

Vo lu me 2, Issue 3, March 2012


Angry Sad Happy Neutral Fe ar

www.ijarcsse.com
0 0 0 100 0 0 0 0 0 76.92

100 0 0 0 7.69

0 100 0 0 0

0 0 100 0 15.18

Table 5: Confusion M atrix of Polynomial SVM classifier (M ale)


Emotion Angry Sad Happy Neutral Fe ar

Emotion Recogni tion (% )


Angry Sad Happy
Neutral

Fear

100 0 0 0 0

0 100 0 0 0

0 0 100 0 14.28

0 0 0 100 0

0 0 0 0 85.72

Table 6: Confusion M atrix of Polynomial SVM classifier (M ale)


Emotion Angry Sad Happy Neutral Fe ar

Emotion Recogni tion (% )


Angry Sad Happy
Neutral

Fear

100 0 0 0 0

0 100 0 0 0

0 0 100 0 0

0 0 0 100 0

0 0 0 0 100

VII.

CONCLUSION & FUTURE WORK

Although it is difficult to get a accurate result, but we can show the variations that occur when emotion changes. By using MFCC algorith m feature is extracted fro m wh ich we can observe how changes occur in pitch, frequency and other features when emotion changes .We have done Frame blocking and windowing steps of MFCC algorith m for a same voice and a same sentence in two different emotions and showed difference in p itch with change in emotion . By using some classifying algorith m we can classify different emotions. Some classifying algorith ms are SVM,K-M EAN etc. We propose to create a speech corpora to make a database which will be used later in classifying algorith m. We will use SVM to classify different emotions. ACKNOWLEDGEMENT To complete this work, I have got valuable suggestions and guidance from different experts of this field. The Book Digital Processing of SpeechSignals. by L.R. Rab iner and R.W. Schafer is very helpfu l fo r beginners to start research in this field. I also thankful to all the researcher and authors of the manuscript fro m where I got valuable information for complet ing this work. REFERENCES [1] Rabiner, L.R, Schafer, R.W, Digital Processing of Speech Signals, Pearson education, 1st Edition, 2004. [2] Ferna ndez Pierna, J. A., Baeten, V.,M ichotte Renier, A., Cogdill, R. P., & Dardenne, P. (2004). Co mb ination of

support vector mach ines (SVM ) and near-infrared (NIR) imaging spectroscopy for the detection of meat and bone meal (M BM) in compound feeds. Journal of Chemo metrics, 18(78), 341 349. [3] Gang, H., Jiandong, L., & Donghua, L. (2004). Study of modulation recognition based on HOCs and SVM. In Proceedings of the 59th Vehicular Technology Conference, VTC 2004-Spring. 1719 May 2004 (Vol. 2, pp. 898 902). [4] Guyon, I.,Weston, J., Barnhill, S., & Bapnik, V. (2002). Gene selection for cancer classication using support vector machines. Machine Learn ing, 46(13), 389 422. [5] S.Kim, P.Georgiou, S.Lee, S.Narayanan. Real-time emotion detection system using speech: Multi-modal fusion of different timescale features, Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece, 2007. [6] T.Bnziger, K.R.Scherer, The ro le of intonation in emotional expression, Speech Communication, Vol.46, 252-267, 2005. [7] F.Yu , E.Chang, Y.Xu, H.Shu m, Emotion detection fro m speech to enrich multimedia content, Lecture Notes In Computer Science, Vol.2195, 550-557, 2001 [8] Barra R., Montero J.M., Macias -Guarasa, DHaro, L.F., San-Segundo R., Cordoba R. Prosodic and segmental rubrics in emotion identification. Proc. ICASSP 2005, Philadelphia, PA, March 2005. [9] Boersma P. Praat, a system for doing phonetics by computer. Glot International,vol. 5, no 9/10, pp. 341345, 2001. [10] V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [11] Han.J, Kamber.M. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, second edition, 2006. [12] Liang, H., & Nartimo, I. (1998). A feature extraction algori thm based on wav elet packet decomposition for heart sound signals. In Proceedings of the IEEE-SP International Symposium (pp. 9396). [13] Mustafa, H., & Doroslovaki, M. (2004). Digital modulation recogni tion using support vector machi ne classier. In Si gnals, systems and computers, conference record of the 38th asilomar conference (Vol. 2, pp. 22382242). [14] M. D. Skowronski and J. G. Harris, Increased MFCC Filter Bandwidth for Noise-Robus t Phoneme Recognition, Proc. ICASSP-02, Florida, May 2002. [15] Cowie, R., Douglas-Cowi e, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G., Emotion recognition in human-computer interaction, IEEE Signal Processing magazine, Vol. 18, No. 1, 32 -80, Jan. 2001. [16] D. Ververid is, and C. Kotropoulos, Automatic speech classification to five emotional states based on gender informat ion, Proceedings of the EUSIPCO2004 Conference, Austria, 341-344, Sept. 2004.

2012, IJARCSS E All Rights Reserved

Page | 230

You might also like