Professional Documents
Culture Documents
www.aiaa-journal.org
*1
Abstract
The trimming scheme with a prefixed cutoff portion is
known as a method of improving the robustness of statistical
models such as multivariate Gaussian mixture models (MGMMs) in small scale tests by alleviating the impacts of
outliers. However, when this method is applied to realworld data, such as noisy speech processing, it is hard to
know the optimal cut-off portion to remove the outliers and
sometimes removes useful data samples as well. In this
paper, we propose a new method based on measuring the
dispersion degree (DD) of the training data to avoid this
problem, so as to realise automatic robust estimation for
MGMMs. The DD model is studied by using two different
measures. For each one, we theoretically prove that the DD
of the data samples in a context of MGMMs approximately
obeys a specific (chi or chi-square) distribution. The
proposed method is evaluated on a real-world application
with a moderately-sized speaker recognition task.
Experiments show that the proposed method can
significantly improve the robustness of the conventional
training method of GMMs for speaker recognition.
Introduction
Keywords
Gaussian Mixture Models; K-means; Trimmed K-means; Speech
Processing
www.aiaa-journal.org
www.aiaa-journal.org
1.
2.
Algorithm Overview
K clusters.
n -th iteration is
generated.
3.
n -th trimmed
into K classes,
(n )
D as a subset of
objects
i = arg min x t k( n ) , x t D .
Re-calculate the centroids of
ing to the rule of
k( n+1) =
1
Ck
xt D ,xt Ck
K -clusters accord-
xt .
(1)
of
where
k(0)
k =1 xD ,xCk
x k ,
2
x D and V (D | C ) are the intraclass covariance of the trimmed data set D with
respect to the cluster set C , i.e.,
trimmed set, i.e.,
V (D | C ) =
k =1 xD ,xCk
x k .
2
(2)
www.aiaa-journal.org
T is continu-ous if
Tn T (F ) for n , at a distribution F .
n* (T , X ) ,
X with n
d (x || c) = distance(x || c).
(6)
c = { , } , i.e.,
1
d (x|| c) = x =
( x )
i
(7)
i =1
in terms of a multi-class
c = { , 1}
X ~ G = wk N( k , k1 ),
dis-persion
degree
Let
set
d (x || C) = d (x || c j ),
(9)
where
K
(10)
j = arg max Pr (x | ck )
k =1
and
(12)
k =1
d (x || c) = (x )T 1 (x ).
C = {ck }, k [1, K ] ,
C = {ck }, k [1, K ] as a
www.aiaa-journal.org
and an
Pr (x | ck ) = N(x | k , ).
1
k
k =1
k =1
(11)
X = {x t }, t [1, T ]
(13)
y to represent the
dispersion degree of the data set X in terms of the multiclass C , which includes the dispersion degree of each sample in the data set X that is derived based on the continuous Euclidean distance as defined in eq. (7), i.e.,
(14)
www.aiaa-journal.org
y ~ ( x; ) =
x e
x2
2
(15)
( )
2
the data
x (k ) of
X.
where
=
i =1
( ij4 ) 3
j =1
d
(16)
( ij6 ) 2
2
(k )
k =1
(21)
y 2 approximately obeys
2
another chi-square distribution 2 ( ) , with the
j =1
and
y2 =
( z ) = t z 1e t dt.
(17)
=
i =1
point
( ij4 ) 3
j =1
d
( )
(22)
6 2
ij
j =1
y 2 is in
(18)
(1)
(2)
X = {x , x , , x
(K )
}.
x t is assumed to be
(k )
k = 0
~ N( k , k 1 ) in a
(23)
x ( k ) k ~ N(0, k 1 ) .
hk = k =
2
(k )
= x
(k )
= (
2
ki
i =1
thus,
xi( k ) ki
dis-tributions
(1)
ki
) = (1),
2
k =
i =1
ck =
i =1
2
(k )
i =1
d
( ki6 ) 2
(20)
i =1
(25)
1
Kk
( ij4 ) 3
j =1
d
( )
(26)
6 2
ij
j =1
2 ( k ) , where
( ki4 ) 3
1
K
d
s1 =
c3
c23/2
(27)
s2 =
c4
.
c22
(28)
k =
(24)
i =1
2
ki
( )
i =1
d
6 2
ki
(19)
2
(k )
( ki4 ) 3
Because
1 2
p
c32
s12
K6
= 1,
=
=
1
1
s 2 c2 c4
p 4 p
K2
K
(29)
where
www.aiaa-journal.org
p=
i =1
( ij4 ) 3
j =1
d
( )
6 2
ij
Hence,
approxi-mately
y2 ( , )
df ( x) 1
=
x = 0.
dx
x
(30)
j =1
we have
f (x) to zero,
central
chi-square
xmax = 1 and
2 =
distribution
with
=0
(31)
1 3
p
c23 K 6
= 2=
= p.
1 2
c3
p
K6
(32)
(34)
1
1
1
= 1
= .
1
2
f ( x) | xmax
d(
x)
x
| xmax
dx
'
(35)
The derivation is done.
Next, we shall present the other main result of
dispersion degree modelling in regard to using the
Mahalanobis dis-tance in the following theorem.
X ~ G = wk N( k , k1 ),
accurately
approximated
by
normal
distribution
(36)
k =1
Let
[2].
f ( x) = ln p ( x) = ln x 1 + ln e
x2
(33)
( ) and 2
2
1
k
set
k =1
k =1
(37)
y to represent the
dispersion degree of the data set X in terms of the multiclass C , which includes the dispersion degree of each
sample in the data set X that is derived based on the
continuous Mahalanobis distance as defined in eq. (8), i.e.,
(38)
www.aiaa-journal.org
y is
y ~ 2 ( x; ) =
(39)
where
= dK .
(40)
(1)
(2)
= {x , x , , x
(K )
X
a
} . By noting that
y( k ) = (x ( k ) k )T k 1 (x ( k ) k ) ~ 2 ( x ( k ) , d ) , then we
get y( k ) ~ 2 ( x ( k ) , d ) . Further, let
y=
k =1
(41)
y ~ 2 ( x, )
(42)
that
1
K
(43)
hk = d
(44)
k = 0
(45)
k =
1
d
ck = ( ) k d = k 1
K
i =1 K
s12 =
c32
1
=
3
c2 dK
s2 =
As
(47)
1
c4
=
.
2
c2 dK
(48)
central
distribution
c23
= dK .
c32
(46)
2 ( , )
with
=0
and
2 ( ) can be
distribution N( ,2 ) for large
A chi-square distribution
approximated by a normal
(k )
s, where
2 ( ) [12].
P(x | M) =
d ( xPC )
p( y | M)dy > .
(49)
M : according to Theo2
rem 4.1, 4.3, 4.2, 4.4, y ~ N( ) , where = ( , ) .
Then can be asymptotically estimated as
= ( , 2 ) by the well-known formulae
T
y
t =1
(50)
T
1
(( yt ) 2 )
T 1 t =1
(51)
where
(52)
yt = d (x t || C).
2. Outlier identification:
(a) For each data sample
i. Calculate
x t do;
training.
(b) Endfor
Further Remarks: In the theory, you may notice that
both of theorem 4.1 and 4.3 reply on using the
Bayesian decision rule for the partition process. As it is
well known in pattern recognition domain [6], the
Bayesian decision rule is an optional classifier in a
sense of minimising the decision risk in a squared
form [6]. Hence, the quality of the recog-nition process
should be able to satisfy the requirements of most of
the pattern recognition applications. In the next
section, we shall carry out experiments to show that
such a procedure is effective.
Experiments
In previous sections, we have discussed the ARETRIM algorithm from a theoretical viewpoint. In this
section, we shall show its effectiveness by applying it
to a real signal processing application.
Our proposed training approach aims at improving
the robustness of the classic GMMs by adopting the
automatic trimmed K-means training techniques. Thus,
theoretically speaking, any application using GMMs,
such as speaker /speech recognition or acoustical noise
reduction, can be used to evaluate the effectiveness of
our proposed method. Without loss of genera-lisation,
we select a moderately-sized speaker recognition task
for evaluation, as GMM is widely accepted as the most
effective method for speaker recognition.
Speaker recognition has two common application
tasks: speaker identification (SI) (recognising a speaker
www.aiaa-journal.org
SX 3 and SI 3 ) as
K
K
www.aiaa-journal.org
10
www.aiaa-journal.org
Q( X ) = i h2 ( i ),
where
i -th 2 distribution.
i =1
i =1
ck = ik hi + k ik i
(54)
s1 = c3 /c 3/2
(55)
s2 = c4 /c22 ,
(56)
c23 /c32 ,
ifs12 s2
l= 2
2
a 2 , ifs1 > s2 ,
Conclusions
In this paper, we have proposed an automatic
trimming estimation algorithm for the conventional
Gaussian mixture models. This trimming scheme
consists of several novel contributions to improve the
robustness of Gaussian mixture model training by
effectively removing outlier interference. First of all, a
modified Lloyds algorithm is proposed to realise the
trimmed K-means clustering algorithm and used for
parameter initialisation for Gaussian mixture models.
Secondly, data dispersion degree is proposed to be
used for automatically identifying outliers. Thirdly,
we have theore-tically proved that data dispersion
degree in the context of GMM training approximately
obeys a certain distribution (chi and chi-square
distribution), in terms of either the Eu-clidean or
(53)
i =1
0,
if s12 s2
= 3 2
2
s1a a , if s1 > s2
(57)
(58)
and
a = 1/( s1 s12 s2 ).
(59)
11
www.aiaa-journal.org
dimensional
distri-bution, i.e.,
j , j i
(60)
i =1
As
(70)
j , j i
the parameters
=0
(71)
d
ck = i2 k
(61)
l=
i =1
c23
=
c32
c3
=
c23/2
s1 =
i =1
d
(62)
3
4 2
i
( )
i =1
d
( )
(72)
6 2
i
i =1
REFERENCES
c
s2 = 42 =
c2
( i4 ) 3
6
i
i =1
8
i
0-07-042864-6, 1974.
i =1
d
(63)
( )
4 2
i
i =1
s12 s2 ,
(64)
because
( i6 ) 2
2
1
i =1
4
i
i =1
8
i
i =1
12
i
12
i
i =1
d
+ i6 6j
i
i =1
j , j i
+ i4 8j
i
1,
j , j i
(65)
and
C = i6 6j i4 8j = i4 6j ( i2 2j ).
i
j , j i
j , j i
j , j i
(66)
By noticing a term
A = ( )
4
i
6
j
2
i
(67)
2
j
(68)
A + B = i4 4j ( i2 2j )( 2j i2 ) 0.
B , i.e.,
B = 4j i6 ( 2j i2 ),
12
As we know
know that
s
=
s2
6j i4 8j .
6
i
y = (1).
2
i
(69)
F.R.,
general
qualitative
definition
of
www.aiaa-journal.org
1896, 1971.
Hampel, F. R., Rousseeuw, P.J., Ronchetti, E., & Stahel, W. A.,
I. Shafran and R. Rose, Robust speech detection and segmentation for real-time ASR applications", Proc. of Int.
Conf. Acoust., Speech, Signal Process., vol. 1, pp. 432-435,
2003.
Johnson, N.L., Kotz, S., & Balakrishnan, N., Continuous
Univariate Distributions, John Willey and Sons. 1994.
Liu, H., Tang, Y. Q., & Zhang, H. H., A new chi-square
approximation to the distribution of non-negative
definite quadratic forms in non-central normal variables",
Comput-ational Statistics and Data Analysis, 53, 853 856,
2009.
Morris, A., Wu, D., & Koreman, J., MLP trained to classify a
small number of speakers generates discriminative
features for improved speaker recognition", Proceedings
Techniques
for
Robust
Speaker
2008.
-GMM based on
1 20, 2009.
Wu, D., Li, J., & Wu, H. Q., -Gaussian mixture models
transmission
degradations
on
speaker
recognition
IEEE Signal
13