Professional Documents
Culture Documents
K. T. Assaleh
W. M. Campbell
Motorola SSG
Scottsdale, AZ 85257, USA
ABSTRACT
A new set of techniques for using polynomial-based classifiers for speaker identification is examined. This set of techniques makes application of polynomial classifiers practical
for speaker identification by enabling discriminative training for large data sets. The training technique is shown to be
invariant to fixed liftering and affine transforms of the feature space. Efficient methods for new class addition, lowcomplexity retraining, and identification across large populations are given. The method is illustrated by application
to the YOHO database.
1. INTRODUCTION
2. CLASSIFIER STRUCTURE
si =
Feature
Vectors
X
N
N j =1
Discriminant
Function
f (wi ; xj ):
Average
Speaker
Model
(1)
Score
(2)
f (x; w) = wt p(x)
2 p(x ) 3
66 p(x 12 ) 77
M = 64 .. 75
.
p(x )
(3)
p(x) = 1
x1 x2 x21 x1 x2
x2 :
2
(4)
p =
X
N
N j =1
i;
(6)
i;Ni
M = M1 M2
:::
Nspk
(7)
w = argmin
kMw o k2
w
(8)
M Mw = M o :
t
(9)
We rearrange (9) to
p(xj );
(5)
Ax b
3. TRAINING METHOD
The basic method of optimizing the performance of the classifier is to use discriminative training. For each speaker,
a model is produced, i . We train so that the ideal output of the discriminant function using model i is 1 on
the speakers data and is 0 on all other speakers data. Of
i;
Nspk
=1
M M w =M 1
t
j
t
i
0
1
X
@ R A w = M 1:
(10)
=M M
t
j
Nspk
=1
t
i
(11)
R P
M1
right-hand side. The vector ti does not need to be directly computed; we can obtain it from the entries of i .
The basic process for training is then as follows. For
each speaker i, compute
Ni
P
Now form r =
Nspk
=1
p2 (xi;j ):
X
Ni
Ni j =1
X X
) 12 + N1 0
w p(x )2
Nk
w p(x
t
i
R LL
L Lw
M1
i;j
i k
t
i
6= =1
k;j
i j
(13)
where Ni0 = N
Ni . We can incorporate this into the
training algorithm by altering equation (11) to
2
3
0
X
N
4 R + R 5w
N
i
6=
(12)
M1
The resulting method of compensating for prior probabilities has the advantage of high accuracy. A disadvantage is
that a Cholesky decomposition must be performed for each
speaker. This increases computation, but it should be noted
that most of the computation is in computing the i .
=N
N
M 1:
t
i
(14)
r new = r
i;
+ p2 (x new ):
i;
(15)
Table 1 shows the identification error rates of our system. Error rate is defined as the number of misclassified
utterances over the total number of utterances tested. These
results compare well to other systems. Results available in
the literature for identification using a 10 second test are
0:14% [2] and 0:22% [14]. Our best result is 0:36% for
text independent speaker identification. The comparison of
these systems to ours is not entirely fair since [2, 14] are text
dependent; we would expect a reduction in error rate for our
system if we used information about the prompted text.
Table 1: Classifier performance on the YOHO database.
order Error Rate % Error Rate %
1 phrase
4 phrase
3
4.71
0.73
4
2.74
0.36
We also compared different strategies for prior compensation for a one phrase (2:5 second) test, see Table 2. Table 2
shows that the weighting method increases accuracy noticeably over the base system. The error criterion weighting
also shows improvements over the prior division method.
Table 2: Classifier performance for various prior compensation strategies.
Method
None
Score Divided by Prior
Error Criterion Weighting
Error Rate %
1 phrase
17.34
11.07
4.71