You are on page 1of 6

Assimilate the Auditory Scale with Wavelet Packet Filters for Multistyle Classification of Speech Under Stress

Nurul Aida Amira Bt Johari, M.Hariharan, A.Saidatul, Sazali Yaacob


School of Mechatronic Engineering University Malaysia Perlis Perlis, Malaysia cintan.jerit@gmail.com, hari@unimap.edu.my
AbstractNowadays, people are having high stress level due to highworkload stress, emergency phone call and multitasking. Emotional/stress of a person affects his/her performance in daily life and speech production. The research for understanding the human emotional/stress states using speech has undergone research and development in the past two decades. This paper presents a feature extraction method based on wavelet packet decomposition for detecting the emotional or stressed states of the person. Three different wavelet packet filter bank structures are design based on Bark scale, Mel Scale and Equivalent Rectangular Bandwidth (ERB) Scale. Linear Discriminant Analysis (LDA) based classifier and Support Vector Machine (SVM) are employed as classifier to identify the emotional/stressed states of a person. In this study speech samples are taken from Speech Under Simulated and Actual Stress (SUSAS) database. Experimental result shows that the suggested method can be used to identify the stress and emotional state of a person. Keywords- Emotional/Stressed states, Wavelet packet transform, Linear Discriminant Analysis, Support Vector Machine, stress classification

I.

INTRODUCTION

One important challenging research today was the recognition of Emotion and stress through the speaker speech. It was an important application which helps in identifying the speaker stress emotional conditions. It was one of the humancomputer interaction or affective computing. Recognition of the human affective states is rapidly gaining interests among researches and industrial developers since it has a broad range of applications. The study useful in applications such robot recognize emotion, metropolitan emergency telephone system to direct the emotional telephone calls to priority operator and potentially usage in multimedia as interactive voice response system, air craft voice communication monitoring and as psychiatrics diagnosis.[1-3]. The users stress and emotional state have been analyzed using speech pattern. Vocal parameter and prosody features such as fundamental frequency, intensity (energy) and speaking rate are strongly related with the emotion expressed

in speech[4-12]. Many studies have shown distinctive differences in phonetic features between normal and speech produced under stress[4-12] and classifiers for example Hidden Markov Model [6-9] and, Neural Networks [10-12]. Researchers have proposed different speech features, the most common features are MFCC (Mel-Frequency Cepstral Coefficients), Pitch, LPC (Linear Prediction Coefficients), autocorrelation coefficients and Teager Energy Operator (TEO) based features [5,13]. Up to now, researchers have not identified a specific feature set for the the recognition of emotional/stressed states through speech[12]. Wavelet transform is a promising tool for non-stationary speech analysis. It is capable in analyzing the speech signal both in time and frequency scale. Zhang Xueying and Jiao Zhiping[14] developed two filterbank structures based on barkscale and ERB scale for speech recognition application where the wavelet packet filter frequency bands are spacing closely to the Bark scale and ERB scale. The advantage of Wavelet Packet is able to partition the both low and high frequency bands. Bark scale[15] is a phsychoacoustical scale proposed by Eberhard Zwicker in 1961 and it has been named after Heinrich Barkhausen who proposed the first subjective measurements of loudness. In 1938, Fletcher determined the critical band concept which is the bandwidth of the human auditory filter at different characteristic frequencies along the cochlea path. He assumed that the auditory filters were rectangular and several physiologically motivated formulas have been derived for the ERB values [16]. S. Datta and coworkers [18] developed new filter structure using Mel-like Admissible Wavelet Packet Structure. The filter frequency band are spacing closely to the Mel scale. Here,[17] used two level WP of 32 order Daubechies and get by estimating minimum RMSE between centre frequency of Mel and Bark Scale and extract Log energy feature. The best result obtain for his study is ~95% for both PCA and LDA classifiers. Our paper investigates the usefulness of the three different wavelet packet filterbank structures which are based on Bark scale, Mel scale and ERB scale. Energy and entropy features were extracted from each subband wavelet packet coefficients. The simulation results show that the suggested methods can be used to identify the emotional/stressed states of a person.

The work is supported by the grant: FRGS-9003-00224 from the Ministry of Higher Education of Malaysia.

II.

DATABASE

A. Wavelet Transform The wavelet transform provides time frequency representation of the signal. It decomposes signal over dilated and translated wavelets. A wavelet is a waveform of effectively limited duration that has an average value of zero. Wavelet transform is defined as the convolution of a signal f(t) with a wavelet function (t) shifted in time by a translation parameter and dilated by a scale parameter [16]. The general definition of the wavelet transform is given as:

The database employed for this study is SUSAS[13]. It consists of stressed speech samples under simulated environment and actual environment. The Simulated subcorpus in SUSAS contain 11 speaking styles[7]. From five domains available in the SUSAS only three speech domain are used: talking styles, single tracking task and Lombard effect domain. Stressed Speech was uttered by 9 speakers representing different dialect, where three speakers used main USA dialects (General American, Boston and New York). In this experiment four speaking style considered are neutral, angry, lombard and loud. The SUSAS database contains of 35 isolated words and each style contains 2 recordings of the same word by each speaker. The total utterances use are 2524. All speech samples are recorded at 8kHz sampling frequency that is two time the frequency of the original speech to avoid aliasing and with the resolution of 16 bits per sample. Pairwise stress classification (neural and angry, neutral and loud, neutral and lombard) is carried out in this work. Where angry considered emotional speech, while loud and lombard as stressed speech. III. METHODOLOGY In this research, the total samples of 2524 stressed and emotional speech from utterances styles of angry, lombard, loud and natural as the reference state were be used.[280] Voice analysis detector (VAD) applied to the samples to discard unvoiced part speech data. The segmented voice portions are subjected to feature extraction using auditory wavelet packet filters. They are based on Bark Scale,Mel scale and ERB scale. The energy and entropy features are extracted from each wavelet packet subbands. SVM and LDA is used as a classifier. Fig. 1 depicts the feature extraction and classification phase of the multistyle pairwise stress speech classification.
Speech Signal

1 t b W ( a , b ) = f (t ) dt a a

(1)

where a and b are real and * denote complex conjugate and (t) is the wavelet function. The wavelet transform uses multiresolution technique by which different frequencies are analyzed with different resolution [18-20]. The discrete wavelet transform (DWT) of a sampled sequence fn =f(nT) with sampling period T is computed as: N 1 j j/2 * m n (2) DWT f[n, a ] = f[m] a j m =0 a where m and n are integers. The value of a is equal to 2.

Voiced/Unvoiced Segmentation Wavelet Packet Filters (Energy and Entropy) Classification using LDA and SVM

B. Wavelet Packets In DWT decomposition procedure, a signal is decomposed into two frequency bands such as lower frequency band (approximation coefficients) and higher frequency band (detail coefficients). Low frequency band is used for further decomposition. Hence DWT gives a left recursive binary tree structure. In Wavelet Packet (WP) decomposition procedure, both lower and higher frequency bands are decomposed into two sub-bands. Thereby wavelet packet gives a balanced binary tree structure. In the tree, each subspace is indexed by its depth i and the number of subspaces p. The two wavelet packet orthogonal bases at a parent node (i,p) are given by the following forms [18-20] 2p p i ( k ) = l[ n] ( k 2 n) (3) i i +1 n= where l[n] is a low pass(scaling) filter. 2 p +1 p i ( k ) = h[ n] ( k 2 n) (4) n = i i +1 where h[n] is the high pass(wavelet) filter. Wavelet packet decomposition helps to partition the high frequency side into smaller bands which cannot be achieved by using general discrete wavelet transform. Table I, II, and III gives the lower cut off frequency (LCF), higher cut off frequency (HCF) and bandwidth (BW) of all the three wavelet packet filters, where the frequency bands obtained from wavelet packet decomposition closely follows the Bark Scale (16 bands), Mel scale (20 bands) and ERB scale (19 bands). The speech

Fig. 1 Block diagram of the feature extraction and classification phase IV. DESIGN OF WAVELET PACKET FILTERS This section briefly explains the design of wavelet packet filters and the feature extraction using them.

samples are sampled at 8 kHz giving a 4 kHz bandwidth signal. The speech signals are filtered with the 16 Bark scale wavelet packet filters, 20 Mel scale wavelet packet filters and 19 ERB scale wavelet packet filters [14]. 4th order of the Daubechies wavelet is used. In this work, Daubechies wavelet has been chosen due to the following properties[18]: Time invariance, Fast computation, and Sharp filter transition bands. For a better representation of the sub-band signals, the energy and entropy features are often used[21,22]. Energy feature is extracted from wavelet packet coefficients using the equation (5): 1 n P 2 Energy = (5) C n n i =1 n,k

Where P is the scale index, n represents the number of decomposition level. k represents wavelet packet node. Shannon entropy can be computed using the extracted wavelet-packet coefficients, through the following equation (6): 2 n P P 2 Entropy = Cn,k log Cn,k (6) n i =1 n=1,2,, N k=0,1,, 2N -1 Where P is the scale index, n represents the number of decomposition level. A feature database is created, after the computation of energy and entropy measures from each subband wavelet packet coefficients and they are used as input features for the classifiers to distinguish the speech samples as neutral or angry, lombard and loud.

n=1,2,, N k=0,1,, 2N -1

TABLE I. Filter Number 1 2 3 4 5 6 7 8 Bark filters (Hz) LCF 0 100 200 300 400 510 770 920 UCF 100 200 300 400 510 770 920 1080 BW 100 100 100 100 110 260 150 160

FREQUENCY BANDS OBTAINED FROM BARK SCALE WAVELET PACKET DECOMPOSITION Wavelet Bark filters (Hz) Level /node 5,0 5,1 6,4 6,5 5,3 4,2 5,6 5,7 LCF 0 125 250 312.5 375 500 750 875 UCF 125 250 312.5 375 500 750 875 1000 BW 125 125 62.5 62.5 125 250 125 125 Filter Number 9 10 11 12 13 14 15 16 Bark filters (Hz) LCF 1080 1270 1480 1720 2000 2320 2700 3150 UCF 1270 1480 1720 2000 2320 2700 3150 3700 BW 190 210 240 280 320 380 450 550 Wavelet Bark filters (Hz) Level /node 4,4 4,5 4,6 4,7 4,8 4,10 4,11 3,7 LCF 1000 1250 1500 1750 2000 2500 2750 3000 UCF 1250 1500 1750 2000 2250 2750 3000 4000 BW 250 250 250 250 250 250 250 1000

TABLE II. Mel filters (Hz) LCF 0 100 200 300 400 500 600 700 800 900 UCF 100 200 300 400 500 600 700 800 900 1000 BW 100 100 100 100 100 100 100 100 100 100

FREQUENCY BANDS OBTAINED FROM MEL SCALE WAVELET PACKET DECOMPOSITION Wavelet Mel filters (Hz) Level /node 5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8 5,9 LCF 0 125 250 375 500 625 750 875 1000 1125 UCF 125 250 375 500 625 750 875 1000 1125 1250 BW 125 125 125 125 125 125 125 125 125 125 Mel filters (Hz) LCF 1000 1149 1320 1516 1741 2000 2297 2639 3031 3482 UCF 1149 1320 1516 1741 2000 2297 2639 3031 3482 4000 BW 145 171 196 225 259 297 342 392 451 518 Wavelet Mel filters (Hz) Level /node 5,10 5,11 4,6 4,7 4,8 4,9 4,10 4,11 3,6 3,7 LCF 1250 1375 1500 1750 2000 2250 2500 2750 3000 3500 UCF 1375 1500 1750 2000 2250 2500 2750 3000 3500 4000 BW 125 125 250 250 250 250 250 250 500 500

Filter Number 1 2 3 4 5 6 7 8 9 10

Filter Number 11 12 13 14 15 16 17 18 19 20

TABLE III. Filter Number 1 2 3 4 5 6 7 8 9 10 ERB filters (Hz) LCF 0 36 79 129 186 253 331 421 526 648 UCF 36 79 129 186 253 331 421 526 648 789 BW 36 42 49 57 66 77 90 104 121 141

FREQUENCY BANDS OBTAINED FROM ERB SCALE WAVELET PACKET DECOMPOSITION Wavelet ERB filters (Hz) Level /node 7,0 7,1 6,1 6,2 6,3 6,4 6,6 6,7 5,4 5,5 LCF 0 31.25 62.5 125 187.5 250 375 437.5 500 625 UCF 31.25 62.5 125 187.5 250 312.5 437.5 500 625 750 BW 31.25 31.25 62.5 62.5 62.5 62.5 62.5 62.5 125 125 Filter Number 11 12 13 14 15 16 17 18 19 ERB filters (Hz) LCF 789 953 1143 1364 1620 1918 2264 2665 3131 UCF 953 1143 1364 1620 1918 2264 2665 3131 3672 BW 160 190 220 256 297 344 401 465 541 Wavelet ERB filters (Hz) Level /node 4,3 5,8 5,10 4,6 4,7 4,8 4,10 4,11 3,6 LCF 750 1000 1250 1500 1750 2000 2500 2750 3000 UCF 1000 1125 1375 1750 2000 2250 2750 3000 3500 BW 250 125 125 250 250 250 250 250 500

V.

CLASSIFIERS

Several classifiers have been proposed in the area of classification speech under stress. In this paper, linear discriminant analysis based classifier and SVM are used as classifiers to test the effectiveness of the wavelet packet energy and entropy features. A. Support Vector Machine SVM is used as a classifier and it is relatively a new and promising method for solving nonlinear classification problems, function estimation and density estimation and pattern recognition tasks[23-24]. It has been originally proposed to classify samples within two classes. It maps training samples of two classes into a higher dimensional space through a kernel function. SVM seeks an optimal separating hyperplane in this new space to maximize its distance from the closest training point. While testing, a query point is categorized according to the distance between the point and the hyperplane. SVM models are built around a kernel function that transforms the input data into an ndimensional space where a hyperplane can be constructed to partition the data. Linear kernel, multilayer kernel and radial basis function (RBF) kernel are normally used by researchers[23-24]. In this work, RBF kernel function is used since it gives excellent generalization and low computational cost. In RBF kernel, 2 (sig2) is the importance parameters and it cause the changes in shape flexion of hyperplane. || x xi ||2 (7) K ( x, xi ) = exp 2 2 In this work, LS-SVMLab toolbox [25] is used to perform pairwise classification of speech under stress.

There are two parameters which are to be chosen optimally such as regularization parameter (, gam) and 2 (sig2) is the squared bandwidth of the RBF kernel to obtain better accuracy. The suitable value of regularization parameter (, gam) and 2 (sig2) are chosen optimally as 90 and 0.9 respectively to obtain better accuracy. B. Linear Discriminant Analysis Discriminant analysis is a statistical technique to classify objects into mutually exclusive and exhaustive groups based on a set of measurable object's features. It is also often called pattern recognition, supervised learning, or supervised classification. Linear discriminants(LD)[24] partition the feature space into the different classes using a set of hyperplanes. The parameters of this classifier model were fitted to the available training data by using the method of maximum likelihood. Using this method the processing required for training is achieved by direct calculation and is extremely fast relative to other classifier building methods such as neural networks. This model assumes that the feature data has a Gaussian distribution for each class. In response to input features, linear discriminants provide a probability estimate of each class. The final classification is obtained by choosing the class with the highest probability estimate. The LDA based classifier is designed using MATLAB 7.0. VI. RESULTS AND DISCUSSIONS There are many present study done relate to automatic speech for emotion and stressed recognition using SUSAS database. The researchers had present their results using several methods and techniques. In 1997,R. Sarikaya present his ungroup stress classification study on 11 emotion and stress styles speech from SUSAS database. He implemented Multilayer-Perceptron(MLP) with backpropagation training method

TABLE IV.

RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENERGY FEATURES Wavelet Packet Filter Bark Scale Mel Scale ERB Scale Classifiers LDA SVM LDA SVM LDA SVM Neutral Vs Angry 88.50 93.28 88.02 90.31 88.49 92.68 Neutral Vs Lombard 86.31 90.27 86.15 88.88 73.81 82.34 Neutral Vs Loud 91.63 95.43 91.35 95.43 93.25 96.62

TABLE V.

RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENTROPY FEATURES Wavelet Packet Filter Bark Scale Mel Scale ERB Scale Classifiers LDA SVM LDA SVM LDA SVM Neutral Vs Angry 90.95 92.29 90.20 91.50 91.31 91.60 Neutral Vs Lombard 91.31 91.46 91.47 91.86 90.28 92.65 Neutral Vs Loud 93.97 96.23 93.81 96.23 93.53 96.62

as the stress classifier. From the proposed four Subband based features where Subband Cepstral(SC) was notify the most promising feature that give 59.1% accuracy.[26] T L Nwe (2003) selected four speech style correspond from emotion category of anger and stress category of loud, lombard and clear speech in her system. The extracted feature based Log Frequency Power Coefficient(LFPC) and Teager Energy Operator based LFPC feature. He raise accuracy result of 87% and 89% for classification stress and emotion respectively.[27] Ling He(2009) used spectrograms which subdivide into three sets of alternative frequency band:critical band, Bark scale and equivalent rectangular bandwidth(ERB) and also the 12 log Gabor filters. The stress feature were compute by Gaussian Mixture Model (GMM).The result shown Log Gabor overcome alternative frequency between 40-80% classification rate using energy feature. In this study, we had developed three sets auditory wavelet Packet filterbanks to imitate frequency resolution in human auditory system. In the experiment to evaluate the recognition of speech emotion and stressed from simulated natural, angry ,Lombard and loud of SUSAS database, three different Wavelet analysis are create to present multiresolution capabilities of wavelet packet(WP) transform to derive the salient feature of stress and emotional content. At first we segmented only voiced speech parts from an input utterances by using end point detector base on zero-crossing rate(ZCR) and frame energy. This will ensure the useful information were analyses and noises were discard. In WP, the emotion/stress discrimination also done onto the high frequency subband. Thereby. lower as well higher frequency bands are decompose which giving a balance binary WP tree structure filterbank. Beside, the wavelet transform uses an adaptive window size which allocate more time to lower frequencies and less time to higher frequency, that the dynamic character are very important features to differentiate emotional/stressed speech according to a localize property in time(space) and as scale(frequency). The decomposition effectiveness of emotion/stress speech may be achieved by best basis(band) selection criterion. From the full j level wavelet packet decomposition there will be more than 22j-1 orthogonal bases. The bases contain the information of different frequency scales. For selection of bases have to be done for the voiced speech within a number of frequency bands, before feature extraction process was perform. This helps in getting emotional/stressed information of corresponding frequency. For that reason the proposed wavelet packet tree structure filter bank are derive. Therefore, we perform the investigation of wavelet packet analysis bands (filterbanks). In this study, first set of Wavelet packet frequency bands type introduce here was given as Bark Scale Wavelet Packet(Bark SWP). The Bark SWP consist 16 frequency bands that will represent 16 coefficients for volunteering the neutral, angry, Lombard and loud style speech characteristic. Bark SWP form bands of small bandwidth at the speech spectrum from 0 to 3kHz, while wider bandwidth at 3kHz. Second set frequency bands then was the Equivalent Rectangular Bandwidth scale Wavelet

Packet(ERB SWP). The ERB SWP form 19 frequency bands covered speech spectrum from 0 to 4kHz has been built the smallest bandwidth within its frequency bands compare to the two others type. The last set was the Mel scale Wavelet Packet (Mel SWP) frequency bands which adopt 20 filterbanks for emotion/stress speech to be investigate. All the bands in each Bark SWP,ERB SWP and Mel SWP, were selected from the entire set of wavelet packet analysis bands as which closely approximate the critical bands characterizing human auditory perception. Finally, Speech of the neutral, angry, lombard and loud speeches could be analyze and test by taking measure of prosody features like energy and entropy feature to characterize the emotional/stressed information from the sample of Neutral, Angry, Loud and Lombard utterances. Conventional validation scheme is used for testing the effectiveness of the results of the classifier. 80% of data are used for training and 20% of data are used for testing. Three experiments are conducted after extracting energy and entropy features.

TABLE VI. RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENERGY + ENTROPY FEATURES Wavelet Packet Filter Bark Scale Mel Scale ERB Scale Classifiers LDA SVM LDA SVM LDA SVM Neutral Vs Angry 91.10 94.26 90.43 91.50 91.98 95.65 Neutral Vs Lombard 90.79 93.05 91.42 95.23 91.62 94.84 Neutral Vs Loud 94.60 96.42 94.84 97.81 93.25 96.03

Firstly, the LDA and SVM classifier are trained and tested with energy features alone and the results are tabulated in Table IV. Second experiment was conducted using entropy alone and the results are tabulated in Table V. Third experiment was conducted using combination (energy + entropy) of features and results are shown in Table VI. From the Tables, it is observed that the SVM gives better classification for all the wavelet packet filters. Entropy features gives better classification accuracy compared to energy features. From the third experiment, it is observed that the combination of energy and entropy features gives very promising classification accuracy of more than 94% for all the pair wise stress speech using SVM classifier. In the recognition of text-independent SUSAS database of three emotional/stressed utterances the observation on the pattern of the result of all the three experiment, we noticed that the recognition of neutral speech as the reference speech had reduced the error in confusing emotional angry utterances. The Neutral vs Lombard pairwise text-independent is the very confusing and got higher error thus reduce the recognition rate during the classification. The combination energy and entropy

feature used to discriminate neutral and loud given the best result which is ~93-97 % across that the three wavelet packet filterbanks and the three set of the experiment proposed. The primary concern for the stressed speech is the Lombard utterances. The degradation of the correct result shown for Lombard effect using energy feature are observed in the experiment 1. In which accuracy of 73% result obtain when using WP filters that approximate the ERB scale. This due to the large different effect during pronouncing the word utterances generated by vocal fold and glottal that project a very high fundamental frequency during the list of the sound from larynge cavity. This result infers that there is significant degradation when spoken under stress condition like Lombard. VII. CONCLUSIONS This paper presents a simple feature extraction method based on three different auditory wavelet packet filterbank structures for multistyle classification of speech under stress. In order to test the effectiveness and reliability of the suggested LDA and SVM based classifier is used. Three experiments are conducted using the extracted features. The experimental results show that the suggested features give very promising classification accuracy of 91% for all the combination of emotional/stress speech classification. Wavelet Transform (WT) has the properties of time-frequency localization and multiresolution. The main reasons for WTs popularity lie in its complete theoretical framework, the great flexibility in choosing the bases or Wavelet Packets (WP) and the low computational complexity. The suggested method can be used to detect emotional/stressed states of a person. In the future work, feature reduction will be applied to reduce feature dimension and other classification algorithms will be developed to improve the current results with less computation. ACKNOWLEDGEMENT This work is supported by the grant: FRGS-9003-00224 from the Ministry of Higher Education of Malaysia. The authors wish to thank our Vice Chancellor Y. Bhg. Brig. Jen. Prof. Dato Dr. Kamarudin Hussin, for his valuable support during the research work. References [1] O. Kwon, K. Chan, J. Hao, and T. Lee, "Emotion recognition by speech
[2] [3] [4] [5] signals," in EUROSPEECH 2003, GENEVA, pp. 125-128, 2003. N. Mbitiru, P. Tay, J. Z. Zhang, and R. D. Adams, "Analysis of Stress in Speech Using Empirical Mode Decomposition," Proceedings of The 2008 IAJC-IJME International Conference, pp. 140-146, 2008. H. Selye, "Stress Management and Research Center, http://www.smrc.com.my/index.html, Retrieved on 01/12/2009 G. Zhou, J. H. L. Hansen, and J. F. Kaiser, "Nonlinear feature based classification of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 9, pp. 201-216, 2001. S. E. Bou-Ghazale and J. H. L. Hansen, "A comparative study of traditional and newly proposed features for recognition of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 429-442, 2000. T. L. Nwe, S. W. Foo, and L. C. De Silva, "Classification of stress in speech using linear and nonlinear features," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03),

[7]

[8]

[9] [10] [11] [12] [13]

[14] [15] [16] [17]

[18] [19] [20]

[21]

[22]

[23] [24] [25] [26] [27]

Hong Kong, pp. 1394-1398, 2003. T. L. Nwe, F. S. Wei, and L. C. De Silva, "Speech based emotion classification," Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology(TENCON'01), Phuket Island, Langkawi Island, Singapore, pp. 297301, 2001. B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," Proceedings of The 2003 International Conference on Multimedia and Expo(ICME'03), Baltimore, Maryland, USA, pp. 401404, 2003. J. H. L. Hansen and B. D. Womack, "Feature analysis and neural network-based classification of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 307-313, 1996. J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural Computing & Applications(Springer), vol. 9, pp. 290-296, 2000. C. H. Park and K. B. Sim, "Emotion recognition and acoustic analysis from speech signal," Proceedings of the International Joint Conference on Neural Networks, Portland, Oregon, USA, pp. 25942598, 2003. S. Casale, A. Russo, and S. Serrano, "Multistyle classification of speech under stress using feature subset selection based on genetic algorithms," Speech Communication (Elsevier), vol. 49, pp. 801-810, 2007. J. H. L. Hansen and S. E. Bou-Ghazale, "Getting started with SUSAS: A speech under simulated and actual stress database," Proceedings of the International Conference on Speech Communication and Technology (Eurospeech), Rhodes, Greece, pp. 17431746, 1997. Zhang Xueying and Jiao Zhiping, Speech Recognition based on auditory wavelet packet filter, International conference on Signal Processing, pp.695-698, 2004. Waleed H. Abdulla, Auditory based features vectors for speech recognition systems, Electrical and Electronic Engineering Department, The University of Auckland, Zew Zeland. 2005. Raghuveer M Rao and Ajith S Bopardikar, Wavelet transforms: introduction to theory and applications, Pearson Education Asia, 2000. N.S Nehe, D.V Jadhave, R.S Holambe, Multiresolution Features and Polynomial Kernel Subspace Approach for Isolated Word Recognition,International Conference in Advance of Computing, Communiction and Control(ICAC3),2009. C. Burrus, R. Gopinath, H. Guo, J. Odegard, and I. Selesnick, Introduction to wavelets and wavelet transforms: a primer, Prentice Hall Upper Saddle River, NJ, 1997. O. Farooq and S. Datta, Mel filter like admissible wavelet packet structure for speech recognition, Proc. of IEEE Signal Processing Letters, Vol. 8, No. 7, pp.: 196-198,2001. A. Cohen, I. Daubechies, and J. C. Feauveau, Biorthogonal bases of compactly supported wavelets, Communications on Pure and Applied Mathematics(Wiley Subscription Services, Inc., A Wiley Company New York), 45(5), (1992) 485 - 560. R. Behroozmand and F. Almasganj, Optimal selection of waveletpacket- based features using genetic algorithm in pathological assessment of patients speech signal with unilateral vocal fold paralysis, Computers in Biology and Medicine, Vol. 37, pp. 474 485,2007. Avci, E., Hanbay, D., & Varol, A, An expert discrete wavelet adaptive network based fuzzy inference system for digital modulation recognition, Expert System with Applications, Vo. 33, pp. 582589, 2006. A. Ben-Hur, D. Horn and H.T. Siegelmann et al, A support vector clustering method, Pattern Recognition15th International Conference, 2000, 1(2): pp. 724 727. K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle et.al, LSSVM lab Toolbox Users Guide Version 1.7, 2010. http://www.esat.kuleuven.be/sista/lssvmlab R Sarikaya , John N. G , Subband Based Classification of Speech under Stress, Digital Speech and Audio Processing Laboratory, Clemson Univerity,.1997. T L New , S W Foo, LC De Silvs, Detection of Stress and Emotion in Speech USingTraditional and FFT Based Log Energy Feature , ICIC-PCM 2003,2003.

[6]

You might also like