You are on page 1of 4

SPEAKER IDENTIFICATION USING A POLYNOMIAL-BASED CLASSIFIER

K. T. Assaleh

W. M. Campbell

Conexant Systems, Inc.


Newport Beach, CA 92660, USA

Motorola SSG
Scottsdale, AZ 85257, USA

ABSTRACT
A new set of techniques for using polynomial-based classifiers for speaker identification is examined. This set of techniques makes application of polynomial classifiers practical
for speaker identification by enabling discriminative training for large data sets. The training technique is shown to be
invariant to fixed liftering and affine transforms of the feature space. Efficient methods for new class addition, lowcomplexity retraining, and identification across large populations are given. The method is illustrated by application
to the YOHO database.

cally this leads to large intractable problems [6] for large


data sets. We propose a new method which makes the problem separable. We also demonstrate techniques which reduce computation considerably making higher order problems possible.
The organization of the paper is as follows. In Section 2,
we discuss the classifier structure and techniques for scoring. In Section 3, a new training algorithm is given. Section 4 shows how the new method easily adapts to reinforcement and new class addition. Section 5 applies the method
to the YOHO database.

1. INTRODUCTION

2. CLASSIFIER STRUCTURE

The objective of speaker identification is to determine which


individual is present given a sample of that persons speech.
The process of speaker identification can be broken up into
two categories, open set and closed set. For the closed set
problem, we must chose a speaker from a given list knowing that the speech is from an individual on that list. For
the open set problem, we must determine if an individual
is on a given list or unknown. Another aspect of speaker
identification is the use of text dependent or text independent classification. Text dependent classification uses the
structure of the words spoken in both training and testing.
Text independent classification does not assume any knowledge about the particular words spoken. For the purposes of
this paper, we deal only with the closed-set text-independent
speaker identification problem.
Many approaches have been proposed for the problem
of speaker identification including Gaussian Mixture Models [1], Hidden Markov Models [2, 3], VQ [4], and artificial
neural networks [4]. A new polynomial classifier is proposed in this paper. Polynomials have been used as classifiers for many years [5]. Traditionally, the methods used
either pertain to the Fischer linear discriminant or the log
of a Gaussian model. Typical methods of training are based
upon statistical parameter estimation. We propose an approach based upon the average mean-squared error. TypiThe work was supported by Motorola SSG under internal research and
development funding. The work of the first author was done while he was
at Motorola SSG.

The scenario for identification is as follows. Speech data is


obtained from a speaker known to be in a given list. Speaker
models are available for all speakers in the list. The goal is
to find the model that best matches the input speech.
The basic classifier structure for implementing this scenario using polynomials is shown in Figure 1. Feature vectors, 1 ; : : : ; N , are processed by a polynomial discriminant function. Every speaker i has a model, i . The output
of a discriminant function is averaged over time resulting in
a score, si . The speaker with the best score is selected as
the best match,

si =

Feature
Vectors

X
N

N j =1

Discriminant
Function

f (wi ; xj ):

Average

Speaker
Model

Figure 1: Polynomial classifier structure.

(1)

Score

best match = argmax s :

(2)

The polynomial discriminant function we use is of the


form

f (x; w) = wt p(x)

course, this separation cannot be achieved because of class


overlap. Therefore, we use the mean-squared error as an
objective criterion.
We define i as the matrix whose rows are the polynomial expansion of the speaker is data; i.e.,

2 p(x ) 3
66 p(x 12 ) 77
M = 64 .. 75
.
p(x )

(3)

where p( ) is a vector of polynomial basis functions up to


order K . We use basis functions of the form, xi1 xi2 :::xij .
Thus, an example second order p( ) would be


p(x) = 1

x1 x2 x21 x1 x2


x2 :
2

(4)

The advantage of using the approximation of the form (3)


is twofold. First, the problem results in a linear approximation space. Linear approximation has good properties. For
the approximation problem we pose in Section 3, we obtain
globally optimal parameter values. A second property of
the approximation is that one has access to all of the basis
terms directly. If one looks at the Taylor series approximation of the function generated by an artificial neuron, the
coefficients of each term in the polynomial expansion are
not directly accessible and there is considerable interaction.
Linear approximation allows direct access to those terms.
The scoring function (1) with a polynomial discriminant
function has several interesting properties. First, scoring
complexity across speakers can be reduced by a simple operation. If we define

p =

X
N

N j =1

i;

(6)

i;Ni

where Ni is the number of feature vectors for speaker i. We


define

M = M1 M2

:::

Nspk

(7)

where Nspk is the number of speakers. The training problem


may be stated as

w = argmin
kMw o k2
w

(8)

where i is the vector consisting of Ni ones in the rows


where the ith speakers data is located and 0s otherwise
(i.e., the ideal output).
Applying the method of normal equations [10] to (8)
gives the following problem

M Mw = M o :
t

(9)

We rearrange (9) to

p(xj );

(5)

then scores can be computed for each model as si = it p.


This simple rearrangement reduces scoring complexity dramatically for large populations. A second interesting property of (1) is that fixed liftering of the input features will
not improve classification performance. This property can
be seen by noting that liftering results in scaled coefficients
in the discriminant function (3). This invariance is in contrast to several other methods which do not have this property [7, 8]. Stated in another manner, polynomial classifiers
have built-in optimal liftering. More generally, an affine
transform of the input does not change the structure of the
classifier; i.e., p(
+ ) also has the same order as p( ).
This property can be useful for optimizing performance in
noisy environments [9].

Ax b

3. TRAINING METHOD
The basic method of optimizing the performance of the classifier is to use discriminative training. For each speaker,
a model is produced, i . We train so that the ideal output of the discriminant function using model i is 1 on
the speakers data and is 0 on all other speakers data. Of

i;

Nspk

=1

M M w =M 1
t
j

t
i

where is the vector of all ones. If we define


then (10) becomes

0
1
X
@ R A w = M 1:

(10)

=M M
t
j

Nspk

=1

t
i

(11)

Equation (11) is the basis of our training method. We note


that the problem is now separable. We can individually
compute j for each speaker j and then combine the fiNspk
nal result into a matrix = j =1
j . One advantage of
using the matrices j is that their size does not change as
more data becomes available. In fact, there are many redundant terms. The unique terms in are exactly the sums of
basis terms of order 2K or less where K is the polynomial
classifier order. We denote the terms of order 2K or less for
a vector as a vector, p2 ( ).
Further simplifications are possible. Note that the matrix term on the left-hand side of equation (11) does not
depend on i. Therefore, we need only compute a decomposition of once and then solve for i as we change the

R P

M1

right-hand side. The vector ti does not need to be directly computed; we can obtain it from the entries of i .
The basic process for training is then as follows. For
each speaker i, compute

Ni

P
Now form r =

Nspk

=1

p2 (xi;j ):

4. REINFORCEMENT AND CLASS ADDITION

X
Ni

Ni j =1

X X

) 1 2 + N1 0
w p(x ) 2
Nk

w p(x
t
i

R LL
L Lw

M1

i;j

i k

t
i

6= =1

k;j

i j

(13)
where Ni0 = N
Ni . We can incorporate this into the
training algorithm by altering equation (11) to

2
3
0
X
N
4 R + R 5w
N
i

6=

(12)

construct the matrix


j . From
j =1
and then find the Cholesky decomposition
= t . For
t
each speaker i, map i to i and then solve t
i =
t
using
back
substitution.
i
We augment training with a method for compensating
for prior probabilities. Since the number of feature vectors
for each class is typically never equal, the polynomial classifier may overtrain on one class. One way to see this is to
note that the mean-squared error criterion in (8) causes the
score range to get smaller as the amount of anti-class data
increases for a fixed amount of in-class data. This creates an
imbalance between the scores for two different classes; i.e.,
a class with low prior probability will tend to score lower
on average than a high prior probability class. This score
compression causes problems since we are finding the maximum scoring class model.
Several authors have tackled the problem of prior compensation. A summary of methods is given in [11]. We have
tried two approaches to the problem.
One method from [11] weights the output of the discriminant function by the training data prior probability;
i.e., the new score is given by s0i where s0i = si =i , i =
spk
Ni =N , N = N
Nj . This scoring can be implemented
j =1
by multiplying the result, i , of (11) by 1=i . The advantage of this method is that it improves performance, see
Section 5, and involves no changes to the fundamental algorithm.
Another method of compensating for prior probabilities
is to weight the objective function. We have not seen this
approach used previously in speaker identification. For each
speaker i, we consider the problem as a two-class situation
(i.e., in-class and out-of-class). Then the contribution of inclass and out-of-class errors can be balanced by weighting
the contribution to the total mean-squared error; this results
in the following optimization criterion for the ith speaker

M1

The resulting method of compensating for prior probabilities has the advantage of high accuracy. A disadvantage is
that a Cholesky decomposition must be performed for each
speaker. This increases computation, but it should be noted
that most of the computation is in computing the i .

=N
N

M 1:
t
i

(14)

New class addition is easily performed with the new method.


Suppose we have Nspk classes. Assume we have stored the
vectors i . Then as new class data is acquired, we form a
new vector Nspk +1 using (12). Then we construct a new
and retrain all models. This reduces computation considerably since we do not need to recompute from scratch.
This low computational complexity can be derived by noting that the process of forming using matrix multiplication is O(Nspk M 2 ) where M is the length of the vector
p( ). The cost of solving the matrix equation is O(M 3 ).
For typical problems, Nspk  M , so that retraining is fast.
For instance, for the YOHO database [12], if we assume the
training consists of 96 2:5 second phrases for 138 speakers, then approximately 3 million feature vectors are generated. For our typical system, we have an M of 455, so that
NM 2 = 62  109 and M 3 = 94  106 .
Reinforcement can also be performed very quickly. We
can store the vectors i as in the class addition case. Updating the class i is performed by updating the i vector

r new = r
i;

+ p2 (x new ):
i;

(15)

Retraining can be performed after (15) has been performed


for all new feature vectors.
5. RESULTS
We applied our method to the YOHO database [12, 13]. The
YOHO database is a multisession database for speaker verification. All speakers are enrolled and verified using combination lock phrases; e.g., 26-81-57. Enrollment consists
of four sessions where the user is prompted for 24 combination lock phrases. Verification is performed in ten separate
sessions each consisting of 4 phrases.
Spectral analysis was performed on a 30 ms frame of
speech every 10 ms. Each frame was pre-emphasized with
the filter 1 0:97z 1, and a Hamming window was applied.
LP analysis was performed using 12 coefficients and then
transformed to 12 LPCCs (LP cepstral coefficients).
We trained on 138 YOHO speakers using all four enrollment sessions. Speaker identification was performed using
1 phrase and 4 phrase tests. All 40 phrases were used in the
1 phrase test. For the 4 phrase test, the scores were combined for all phrases within a session. All 138 speakers were
scored using the classifier in Figure 1 for each identification
and the highest score was selected.

Table 1 shows the identification error rates of our system. Error rate is defined as the number of misclassified
utterances over the total number of utterances tested. These
results compare well to other systems. Results available in
the literature for identification using a 10 second test are
0:14% [2] and 0:22% [14]. Our best result is 0:36% for
text independent speaker identification. The comparison of
these systems to ours is not entirely fair since [2, 14] are text
dependent; we would expect a reduction in error rate for our
system if we used information about the prompted text.
Table 1: Classifier performance on the YOHO database.
order Error Rate % Error Rate %
1 phrase
4 phrase
3
4.71
0.73
4
2.74
0.36
We also compared different strategies for prior compensation for a one phrase (2:5 second) test, see Table 2. Table 2
shows that the weighting method increases accuracy noticeably over the base system. The error criterion weighting
also shows improvements over the prior division method.
Table 2: Classifier performance for various prior compensation strategies.
Method
None
Score Divided by Prior
Error Criterion Weighting

Error Rate %
1 phrase
17.34
11.07
4.71

[2] C. W. Che and Q. Lin, Speaker recognition using


HMM with experiments on the YOHO database, in
Proc. EUROSPEECH, pp. 625628, 1995.
[3] A. E. Rosenberg, J. DeLong, C.-H. Lee, B.-H. Juang,
and F. K. Soong, The use of cohort normalized scores
for speaker verification, in Proceedings of the International Conference on Spoken Language Processing,
pp. 599602, 1992.
[4] K. R. Farrell, R. J. Mammone, and K. T. Assaleh,
Speaker recognition using neural networks and conventional classifiers, IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 194205, Jan. 1994.
[5] K. Fukunaga, Introduction to Statistical Pattern
Recognition. Academic Press, 1990.
[6] J. Schurmann, Pattern Classification. John Wiley and
Sons, Inc., 1996.
[7] B.-H. Juang, L. R. Rabiner, and J. G. Wilpon, On the
use of bandpass liftering in speech recognition, IEEE
Trans. Acoust., Speech, Signal Processing, vol. ASSP35, pp. 947954, July 1987.
[8] S. Furui, Cepstral analysis technique for automatic
speaker verification, IEEE Trans. Acoust., Speech,
Signal Processing, vol. ASSP-29, pp. 254272, Apr.
1981.
[9] R. J. Mammone, X. Zhang, and P. Ramachandran,
Robust speaker recognition, IEEE Signal Processing Magazine, vol. 13, pp. 5871, Sept. 1996.
[10] G. H. Golub and C. F. Van Loan, Matrix Computations. John Hopkins, 1989.

6. SUMMARY AND CONCLUSIONS


A new method for applying polynomial classifiers to text
independent speaker identification was presented. A training method was illustrated which allowed the technique to
be applied to large data sets. The training method was easily adapted for new class addition and reinforcement. The
identification was performed with a low complexity classifier structure which averaged the output of a polynomial
discriminant function. The classifier scaled well to large
databases and exhibited invariance to fixed liftering. The
classifier was shown to have excellent performance on the
YOHO database.
7. REFERENCES
[1] D. A. Reynolds, Automatic speaker recognition using
Gaussian mixture speaker models, The Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173192, 1995.

[11] N. Morgan and H. A. Bourlard, Connectionist Speech


Recognition: A Hybrid Approach. Kluwer Academic
Publishers, 1994.
[12] J. P. Campbell, Jr., Testing with the YOHO CD-ROM
voice verification corpus, in Proceedings of the Internation Conference on Acoustics, Speech, and Signal
Processing, pp. 341344, 1995.
[13] J. P. Campbell, Jr., Speaker recognition: A tutorial,
Proceedings of the IEEE, vol. 85, pp. 14371462,
Sept. 1997.
[14] J. Colombi, D. Ruck, S. Rogers, M. Oxley, and T. Anderson, Cohort selection and word grammar effects
for speaker recognition, in International Conference
on Acoustics Speech and Signal Processing, pp. 85
88, 1996.

You might also like