Professional Documents
Culture Documents
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
Abstract—In this paper we investigate a technique to find out publicly available speaker identification database (POLYCOST
vocal source based features from the LP residual of speech signal and YOHO) using GMM based classifier. It is observed that
for automatic speaker identification. Autocorrelation with some the performances of the speaker identification systems based
specific lag is computed for the residual signal to derive these
features. Compared to traditional features like MFCC, PLPCC on various spectral features are improved in combined mode
which represent vocal tract information, these features represent for different model order of GMM. The rest of the paper
complementary vocal cord information. Our experiment in fusing is organized as follows. In section II we briefly review the
these two sources of information in representing speaker char- basic LP analysis followed by the proposed feature extraction
acteristics yield better speaker identification accuracy. We have technique. The speaker identification experiment with results is
used Gaussian mixture model (GMM) based speaker modeling
and results are shown on two public databases to validate our shown in section III. Finally, the paper is concluded in section
proposition. IV.
Index Terms—Speaker Identification, Feature Extraction, Au-
tocorrelation, Residual Signal, Complementary Feature. II. F EATURE E XTRACTION F ROM R ESIDUAL S IGNAL
A. Linear Prediction Analysis and Residual Signal
I. I NTRODUCTION
In the LP model, (n − 1)-th to (n − p)-th samples of the
Peaker Identification [1]–[3] is the task to determine the
S identity of the unknown subject by its voice. It requires
a robust feature extraction technique followed by a modeling
speech wave (n, p are integers) are used to predict the n-th
sample. The predicted value of the n-th speech sample [9] is
given by
approach. Through feature extraction process the crude speech
signals are undergone through several dimensionality reduc- p
X
tion operations where the consequent outputs are more com- ŝ(n) = a(k)s(n − k) (1)
pact and robust than the original speech. Although the speech k=1
signal is a non-stationary signal it shows short term stationary where {a(k)}pk=1 are the predictor coefficients and s(n) is
in the interval 20-40 ms [4]. The vocal tract characteristics are the n-th speech sample.The value of p is chosen such that it
almost static during this period. Most of the feature extraction could effectively capture the real and complex poles of the
techniques are based on short term spectral analysis of the vocal tract in a frequency range equal to half the sampling
speech signal. The central idea is to capture the information frequency.The prediction coefficients (PC) are determined by
related to the formant frequency and their characteristics minimizing the mean square prediction error [1] and the error
like magnitude, bandwidth, and slope etc through various is defined as
spectrum estimation techniques. The shortcoming of these
N −1
techniques is that it neglects the vocal cord (or vocal fold) 1 X
E(n) = (s(n) − ŝ(n))2 (2)
characteristics which also carries significant speaker specific N n=0
information. The residual signal obtained through linear pre-
diction (LP) analysis of speech contains information related where summation is taken over all samples i.e., N . The set
to vocal cord. Some approaches are investigated to find and of coefficients {a(k)}pk=1 which minimize the mean-squared
model the speaker specific characteristics from this residual prediction error are obtained as solutions of the set of linear
signal. Auto associative neural network (AANN) [5], wavelet equation
octave coefficients of residue (WOCOR) [6], residual phase p
X
[7], higher order cumulant [8] are employed earlier to reduce φ(j, k)a(k) = φ(j, 0), j = 1, 2, 3, . . . , p (3)
the equal error rate (EER) for speaker recognition system. In k=1
this work we have considered the autocorrelation method in
finding the speaker specific traits from it. The contribution where
N −1
of this residual feature is fused with the spectral feature 1 X
φ(j, k) = s(n − j)s(n − k) (4)
based system. We have conducted experiments on two popular N n=0
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
2
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
The PC, {a(k)}pk=1 are derived by solving the recursive Firstly, the residual signal is normalized between -1 and +1
equation (3). to make the first order central moment i.e. mean to zero. This
Using the {a(k)}pk=1 as model parameters, equation (5) process also helps to reduce the unwanted variation in the
represents the fundamental basis of LP representation. It autocorrelation. Secondly, we compute the autocorrelation and
implies that any signal can be defined by a linear predictor take the upper half only (due to the symmetrical property of
and its prediction error. the correlation function). The rl (0) is also included because it
p
is related to the stress (energy) of a particular speech frame’s
X residual. If we consider the lag of [−L, L] then total number
s(n) = a(k)s(n − k) + e(n) (5)
of coefficients becomes L + 1. The residual feature extracted
k=1
through this technique is referred to as ACRLAG throughout
The LP transfer function can be defined as, this paper.
G G
H(z) = Pp −k
= (6)
1− k=1 a(k)z A(z) C. Integration of Complementary Information
where G is the gain scaling factor for the present input and In this section we propose to integrate vocal tract and vocal
A(z) is the p-th order inverse filter. These LP coefficients chord parameters identifying speakers. In spite of the two
itself can be used for speaker recognition as it contains approaches have significant performance difference, the way
some speaker specific information like vocal tract resonance they represent speech signal is complementary to one another.
frequencies, their bandwidths etc. However, some nonlinear Hence, it is expected that combining the advantages of both the
transformations are applied to those PC to improve the ro- feature will improve [11] the overall performance of speaker
bustness. Linear prediction cepstral coefficient (LPCC), line identification system. The block diagram of the combined
spectral pairs frequency (LSF), log-area ratio (LAR) etc are system is shown in Fig. 1. Spectral features and Residual
such representations of LPC. features are extracted from the training data in two separate
The prediction error i.e., e(n) is called Residual Signal and streams. Consequently, speaker modeling is performed for the
it contains all the complementary information that are not con- respective features independently and model parameters are
tained in the PC. Its worth mentioning here that residual signal stored in the model database. At the time of testing same
conveys vocal source cues containing fundamental frequency, process is adopted for feature extraction. Log-likelihood of
pitch period etc. two different features are computed w.r.t. their corresponding
models. Finally, the output score is weighted and combined.
We have used score level linear fusion which can be
B. Autocorrelation of Residual Signal
formulated as in equation (9). To get the advantages of both
The autocorrelation finds out the similarity of a signal with the system and their complementarity the score level linear
itself. It has a beautiful relationship with the power spectral fusion can be formulated as follows:
density (PSD). In speech processing tasks the autocorrelation
function is mostly popular in estimating the pitch of the signal LLRcombined = ηLLRspectral + (1 − η)LLRresidual (9)
and in LP based speech analysis. In pitch detection algorithm where LLRspectral and LLRresidual are log-likelihood ratio
we compute fundamental frequency as the difference between calculated from the spectral and residual based systems, re-
the two consecutive peaks of autocorrelation function. On the spectively. The fusion weight is decided by the parameter η.
other hand in LP analysis we figure out the second order
relationship among the speech samples through autocorrelation III. S PEAKER I DENTIFICATION E XPERIMENT
function which is previously described in Sec. II-A.
A. Experimental Setup
The autocorrelation of a discrete signal x(n) of length N is
given by, 1) Pre-processing stage: In this work, pre-processing stage
XN is kept similar throughout different features extraction meth-
r(n) = x(k)x(k − n) (7) ods. It is performed using the following steps:
k=−N • Silence removal and end-point detection are done using
In eqn. (7), we consider the full correlation over the shift from energy threshold criterion.
−N to N . The correlation also can be computed with some lag • The speech signal is then pre-emphasized with 0.97 pre-
in the original signal to check only the short time similarity. emphasis factor.
If the maximum and minimum lag is bounded by magnitude • The pre-emphasized speech signal is segmented into
of L then the correlation can be calculated as in eqn. (8), frames of each 20ms with 50% overlapping ,i.e. total
L
number of samples in each frame is N = 160, (sampling
X frequency Fs = 8kHz).
rl (n) = x(k)x(k − n) (8)
• In the last step of pre-processing, each frame is windowed
k=−L
using hamming window given equation
Higher lag autocorrelation of speech signal was employed
2πn
earlier for robust speech recognition engine [10]. In our w(n) = 0.54 − 0.46 cos( ) (10)
proposed method of feature extraction we have computed the N −1
autocorrelation function for a specific lag of residual signal. where N is the length of the window.
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
3
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
Fig. 1. Block diagram of Fusion Technique: Score level fusion of Vocal tract (short term spectral based feature) and Vocal cord information (Residual).
2) Classification & Identification stage: GMM technique experiments, the GMMs are trained with 10 iterations where
is used to get probabilistic model for the feature vectors of clusters are initialized by vector quantization [13] algorithm.
a speaker. The idea of GMM is to use weighted summation In identification stage, the log-likelihood scores of the
of multivariate gaussian functions to represent the probability feature vector of the utterance under test is calculated by
density of feature vectors and it is given by T
X
M
log p(X|λ) = p(xt |λ) (14)
t=1
X
p(x) = pi bi (x) (11)
i=1 Where X = {x1 , x2 , ..., xt } is the feature vector of the test
utterance.
where x is a d-dimensional feature vector, bi (x), i = 1, ..., M
In closed set SI task, an unknown utterance is identified
are the component densities and pi , i = 1, ..., M are the mix-
as an utterance of a particular speaker whose model gives
ture weights or prior of individual gaussian. Each component
maximum log-likelihood. It can be written as
density is given by
T
X
1 1 t −1 Ŝ = arg max p(xt |λk ) (15)
bi (x) = d 1 exp − (x−µ i ) Σ i (x−µ i ) (12) 1≤k≤S
(2π) 2 |Σi | 2 2 t=1
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
4
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
a telephone channel. There are 138 speakers (106 males and 32 through only residual feature (vocal cord information) is not
females); for each speaker, there are 4 enrollment sessions of much considerable, but it contains some useful complementary
24 utterances each and 10 test sessions of 4 utterances each. In information which cannot be explained by standard spectral
this work, a closed set text-independent speaker identification features. Hence, the LLR of the two systems are combined as
problem is attempted where we consider all 138 speakers stated in II-C. In our experiments the dimension for various
as client speakers. For a speaker, all the 96 (4 sessions × spectral features is set to 19 for fare comparison, and this order
24 utterances) utterances are used for developing the speaker is frequently used to model spectral information. The LP based
model while for testing, 40 (10 sessions × 4 utterances) features (LPCC, LAR, LSF, PLPCC) are extracted using 19
utterances are put under test. Therefore, for 138 speakers we order all pole modeling; on the other hand the filterbank based
put 138 × 40 = 5520 utterances under test and evaluated the features (LFCC and MFCC) are extracted using 20 triangular
identification accuracies. shaped bandpass filters which are equally spaced either in
POLYCOST Database: The POLYCOST database [15] was Hertz scale (for LFCC) or in Mel scale (for MFCC). In Table
recorded as a common initiative within the COST 250 action II and III the PIA for baseline systems as well as the fused
during January- March 1996. It contains around 10 sessions systems (with η = 0.5) are shown. The performance of the
recorded by 134 subjects from 14 countries. Each session fused system is better throughout different spectral features
consists of 14 items, two of which (MOT01 & MOT02 files) and different model orders of GMM for both the databases.
contain speech in the subject’s mother tongue. The database The improvement in performance is higher in lower model
was collected through the European telephone network. The order compared to higher model order. This is due to base
recording has been performed with ISDN cards on two XTL effect which is usually experienced in performance analysis of
SUN platforms with an 8 kHz sampling rate. In this work, a newly proposed features for a classifier system. For example,
closed set text independent speaker identification problem is in case of PLPCC feature based SI system performance is
addressed where only the mother tongue (MOT) files are used. improved by 12% for POLYCOST and 15.13% for YOHO
Specified guideline [15] for conducting closed set speaker database using model order 2. But, in case of higher model
identification experiments is adhered to, i.e. ‘MOT02’ files order the improvements are 4.74% and 1.47%. The improve-
from first four sessions are used to build a speaker model while ment in POLYCOST database (telephonic) is also significantly
‘MOT01’ files from session five onwards are taken for testing. better than that of YOHO (microphonic) for various features.
As with YOHO database, all speakers (131 after deletion of In this work all voiced and unvoiced speech frames are
three speakers) in the database were registered as clients. utilized for extracting residual features. Though only voiced
4) Score Calculation: In closed-set speaker identification frames contain pitch and significant residual information, still
problem, identification accuracy as defined in [16] and given unvoiced frames contain information which is not considered
by the equation (16) is followed. in auto regressive (AR) based approaches (LP). Filter bank
based approaches also removes pitch or pitch like information
Percentage of identification accuracy (PIA) = when wrapping with triangular filters. We have also performed
No. of utterance correctly identified an experiment by picking only voiced frames and observed
× 100 (16)
Total no. of utterance under test that taking only those frames the performance is not improved,
rather degraded due to the less amount of training data and
B. Speaker Identification Experiments and Results curse of dimensionality effect in GMM modeling.
Experiments were performed using GMM based classifier It is desirable to note that the performance of the combined
of different model orders which are power of two i.e. 2, 4, system which essentially considers 19+13=32 dimensions is
8, 16, etc. The number of gaussian is limited by the amount better than the system which is based on any 32 dimensional
of available training data (Average training speech length per single spectral feature based system.
speaker after silence removal: (i) POLYCOST-40s and (ii)
YOHO-150s). The number of mixtures are incremented to 16 TABLE I
for POLYCOST and 64 for YOHO database. We have evalu- S PEAKER I DENTIFICATION R ESULTS ON POLYCOST AND YOHO
ated the performance of ACRLAG feature as a front end for DATABASE USING ACRLAG FEATURE FOR DIFFERENT MODEL ORDER OF
GMM (ACRLAG C ONFIGURATION : LP O RDER = 13, N UMBER OF
speaker identification task for both the databases. Experiments L AG= 12).
were conducted for different order of linear prediction and lag
for optimal performance. An LP order of 12-20 is sufficient in Database No. of Mixtures Identification Accuracy
capturing speaker specific information from the residual signal. 2 45.7560
4 49.3369
Exhaustive search was also carried out for finding the optimal POLYCOST 8 55.7029
value of lag. The lag was chosen so that it can effectively 16 59.6817
capture the second order properties (autocorrelation) of one 2 41.2319
pitch (or pitch like information) in the residual signal. Exper- 4 49.3116
YOHO 8 56.9022
imentally we have observed that autocorrelation computation 16 63.2246
with a lag of 10-16 is sufficient to represent the residual 32 68.9130
information. In Table I the SI performance using ACRLAG is 64 73.3514
shown for LP order of 13 and 12 lags. Hence the number of
residual feature becomes 12+1=13. The performance attained
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
5
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010
TABLE II TABLE III
S PEAKER I DENTIFICATION R ESULTS ON POLYCOST DATABASE S PEAKER I DENTIFICATION R ESULTS ON YOHO DATABASE SHOWING THE
SHOWING THE PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM AND FUSED
AND FUSED SYSTEM .) SYSTEM .)
Proceedings of 2nd International Conference on RF & Signal Processing Systems - RSPS 2010