Professional Documents
Culture Documents
c.xydeasgLancasteracuk
INTRODUCTION
186
ICICS 2005
P(i
P(Y AP ) = fI -P(i
i=l
(Y~AP)
(4)
with
Tk
k=1
t=
Tk
,
k=l
Tk,
K
(1)
(oti) )
(5)
Tk
Zi)
(3)
Zh(Ok,t 0"t)
k=l CA=
_
K
Cq(1 (b (OT)
Tk
vt(i))
ZZh(Ukt,(i),
k=l t'C
Uk('(i)=[ok,, ,kt',-,k,t' , Ok,t'V
I.o.(C)1 with c t,
and Vt(i) =tot(l) o t)
where
(2)
V4(i) for all Ukxt(i) in K training data sets (i.e. K different sets of
observations sequences), k= t,2,.,K.
The counting function h(a,b) is one if and only if ta=b}
A. Dependent-Multi-HMM-D2
A new model structure (DM-HMM-D2), has been
developed which is also based on the assumption that input
features are inter-dependent.
187
|,o,(2
Y2
Y3
02(l)
0 o3(2)
Y4
02 03
1(2)
1
0100
2(1)
2(lo)
0(2)
0o4(2)
HMM2 W
~~HMM2
04
.o(2O)
(1)
(2)
01
(2)
(2)
03
o(10) Q~(10)
(1 0)
03
0(1)
03
=Y=[yl Y2 Y3]
(6)
Now, having y,, y2 and y3 and using Y4 to indicate an
estimate of the missing SMQ vector y4, this estimate can be
selected such that the likelihood probability of observation
Y {=tY Y2, Y3, Y41, given an HMM model i, i.e. P(Y =tyj, Y2,
Y3, Y4 }) 1u) is maximum.
This effectively means that the HIMM bank is designed i.e.
trained to classify input patterns of four SMQ vectors {yv, Y2,
y3, y4} to one of i classes. Then given the resulting AiHMM
bank models and an incomplete observation vector sequence
%Y], Y2, Y3}, the system defines y4 so that P(Y*=ty, Y2, Y3,
Y4 }) 1)) is maximum over all i classes.
Of course each of these classes should represent clusters of
"similar" four-vector SMQ "sequences" Y whose
characteristics are captured by the corresponding bank of
HMM models. This concept of having clusters of "similar" Y
sequences of SMQ vectors can be easily accepted due to the
structure imposed on the speech signals in general and LSP
tracks in particular by language rules and human speech
production mechanism constrains.
Note that a SMQ frame size of 80 msecs and T0b of 320
msecs were selected to reflect phoneme/syllabic durations. The
voiced-unvoiced nature of the speech signal was also selected
as the clustering criterion, an assumption that results however
in significant variability between Y sequences belonging to the
same class but, at the same time, leads to a small number of
classes, )=7. Also note that voiced/unvoiced classifications are
produced at the output of a Voiced Activity Detection (VAD)
process operating on a 20 msecs frame basis and thus Y
sequences defined over T0b= 320msecs are classified into the i
=7 classes using 16 voiced/unvoiced flags (320/20=16). The
classes are 1) voiced, 2) unvoiced, 3) voiced to unvoiced, 4)
unvoiced to voiced, 5) voiced to unvoiced to voiced, 6)
unvoiced to voiced to unvoiced and 7) other.
Now given seven classes and ten different HMM networks
in the system bank, see figure 1, the probability of observing
Q(') given the the j-th class i-th HVMM model is P(O(i) 1,j,), with
i=1 ,.,10 and j= 1.,7. Then the total probability P(Y1i) over
the system bank can be maximized using:
(1) P(1Yli) =maxjP(O(') 1il l) x P(0(2) 1S.1 2) x . .. x P O(l ) 12 0),
II
Yi
0(1)
H M 0
3(lo)HMM10
Training HMMs
for Class 1
or
(2)
P('l2)=max{P(O(')|)u ]),P(O(')<Z21),
P(0Q '<Z 9
xmax{ P(Q(2) 1,2), P( 2)122 2), ...,P(Q (2) u,72) }
xmaxl P(ofl ) 1il 10), P(ofl 1,2, 10), P(ofl 127 l0) }
...
x...
188
IM.HMM-D
Probability
Selection
DM-HMM-D2
Encoder/Decoder
Performance
Method 2
(with missing
vectors)
1.946306
Method 1
Method 2
Method 1
type
PESQ
Score
(Hello)
CONCLUDING REMARKS
189
[5]
[6]
[7]
[8]
[9]
[10]
190