You are on page 1of 5

W2C.

Recovery of LSP Coefficients in VoIP Systems


Costas Xydeas, Shih-Yang Chiao, Eric Jones
Dept. of Communication Systems,
Lancaster University,
Lancaster, United Kingdom

c.xydeasgLancasteracuk

techniques and ii) the application of packet prediction


techniques that provide estimates of the lost information for use
in the place of the missing packets. Of interest here is the
application of 1HMM models within the framework of the
second methodology.
Note that existing VoIP systems employ Pulse Code
Modulation (PCM) coding (at 64 Kbits/s) and when missing
packets occur at the receiver, these packets of PCM samples
are estimated to be either "silence" (i.e. zero value samples), or
equal to the previous correctly received packet of samples. This
paper assumes that the next generation of VoIP systems will
employ efficient parametric speech coders which are most
likely to be of the Analysis-by-Synthesis LPC type, since all
the currently established speech coding standard algorithms
operating at bit rates in the range of 2.4 Kbits/s to 13 Kbits/s
are LPC based.
This means that a substantial part of the transmitted speech
information will be in the form of LPC parameters and when
transmitted packets are missing, estimates of these parameters
should be formed at the receiving end and used by the speech
recovery (decoding) process. Another assumption made is that
LPC information is efficiently quantized using the Split Matrix
Quantization (SMQ) method [4].
Thus a new probabilistic approach for estimating missing
LPC filter coefficients is presented in this paper. This approach
employs a new formulation of LSP recovery system
architecture where dependent-multiple Hidden Markov Models
with Discrete Densities (DM-HMM-D) operate in parallel.
Each HMM processes sequences of received quantized vectors
of LSP coefficients and, while allowing for the modeling of the
inter-dependencies that exist between LPC coefficients, the
resulting maximum likelihood observation probabilities are
used to provide the required estimates of missing LSPs.
A brief introduction to HMM classification is given in the
next section which also presents a new DM-HIMM-D
methodology. Experiments and results that determine the
performance of the proposed LSP recovery system are
presented in Section 4, whereas conclusions are given in
Section 5.

Abstract-In order to deliver real time, high quality voice


services, VoIP system designers must tackle the packet-loss
problems that are inherent in packet-based networks. To combat
the inevitable speech quality deterioration resulting from the loss
of transmitted packets of speech information, techniques that
provide estimates of the lost information that is needed by the
speech recovery process are of considerable interest.
Furthermore, in future VoIP systems employing LPC based
speech coders, a significant percentage of the coded speech
information will represent the values of LPC coefficients and thus
a new probabilistic approach for estimating missing LPC filter
coefficients is presented in this paper. This approach employs a
new formulation of LSP recovery system architecture where
dependent-multiple Hidden Markov Models with Discrete
Densities (DM-HMM-D) operate in parallel. Each HMM
processes sequences of received quantized vectors of LSP
coefficients and, while allowing for the modeling of the interdependencies that exist between LPC coefficients, resulting
maximum likelihood observation probabilities are used to
provide the required estimates of missing LSPs. The proposed
missing parameters estimation technique is generic and initial
experimental results demonstrate its considerable potential in
improving the quality of LPC based decoded speech in VoIP
applications).

Keywords- Hidden Markov Models, VoIP systems, LSP


coefficient recovery.
I.

INTRODUCTION

Packetised speech transmission in general and Voice over


Internet Protocol (VoIP) [1][2] in particular represent an
important new technological area that has the potential to
completely revolutionize the world's phone communications.
Developers of VoIP systems face many obstacles as they
try to device system architectures that merge traditional Plain

Old Telephone Service (POTS) networks with packet


networks. One of the biggest challenges to the successful
development of these systems is quality of service (QoS) [3].
Unlike traditional Internet Protocol (IP) systems, end users
demand that new voice-enabled packet systems deliver in real
time high speech quality service, at all times. In order to
achieve this, VoIP system designers must tackle the packet-loss
problems that are inherent in packet-based networks.
There are two important methodologies that can be used to
combat the inevitable speech quality deterioration resulting
from the loss of transmitted packets of speech information.
That is, i) the introduction of redundancy via channel coding

0-7803-9282-5/05/$20.00 2005 IEEE

II. CONVENTIONAL HMM


A discrete, single-feature observation (input) HIMM
(referred to as HMM-D) classification network is described in
[5,6]. Briefly there are N states {S1,S2,.,SN] in the network and

186

ICICS 2005

This network is an extension of the conventional HIMM-D


methodology. It takes into account statistical dependencies and
hence possible relationships amongst different features by
employing conditional observation probabilities per feature
with respect to other features. Thus the maximum likelihood
probability of the i-th observation sequence for a given HMMD parameter set )j

M possible observations can be generated by the model. At


every time step one of the states, say Sj, is entered based on a
state transition probability [ao} that effectively depends only on
the previous state Si
After each transition is made, an observation, say the m-th
observation m, is produced from Sj with a corresponding
observation probability tbj(Om)}. Note that the initial state
probabilities of the model are defined as 7r,}. In compact
notation % = t[aoj, tb(O1,)f, tzuj is set to indicate the p-th model
parameters, p=1,...,P. Therefore given an observation sequence
o=[01,02..., OTI that is obtained over a period of time T, the
likelihood probability P(O1 %) can be calculated by tracing the
Viterbi paths Q=[ql,q2 ...qT}.

P(CJ ) zql ql (ol aql(I q2 (2())..


=;ql)kql) (0(i ) Hqi)lTkT) t

P(i

In many applications multiple features/measurements


(observations) are extracted/made at the same instant in time
and presented to the HMM process. These observations can be
represented statistically by two types of probability densities,
i.e. discrete or continuous.
When using discrete densities, a vector quantization (VQ)
technique is employed to represent an input vector with one of
the vectors that populate a finite size codebook. Sequences of
quantized vectors are then modeled by a single HMM-D
structure.
An alternative method (referred to as IM-HVMM-D) can be
also formulated when features (elements) in the input vector
are assumed to be independent. In this case a single HMM
network is designed and employed for each discrete
(quantized) input feature (element). Thus when C features
{t(),t(), ,t )} (vector elements) are defined at a given time
t; the system employs C HMM classifiers, i.e. there are C
models per class, and the total likelihood probability of the p-th
class P(YlIP), where Y= t(]), 0(2),...,O(C)}, is given as:

P(Y AP ) = fI -P(i
i=l

(Y~AP)

(4)

with

b(o() |yt) = b(o(') (l{o ,*


,

Tk

,h(k ,tl 0t) /

k=1
t=
Tk

,
k=l

Tk,
K

EEh(Uk,t(i), Vt (i)) / k=lETk


k=l t'=l

(1)

(oti) )

p(Q() IY,2A) = f (irT'),a(')) J7b(')(oW)/yt)


t=

(5)

Tk

Zi)

1og(P(Y2AP))= log(w(Y) * p(0QI)

(3)

where f('w(i),a(i)) is a function of initial state and state


transition probabilities, has been modified so that observation
probabilities of feature sequences are now "linked" to the
observation values ofthe remaining features.
In this case,

Zh(Ok,t 0"t)

k=l CA=

_
K

In an alternative modeling approach S. Chiao and C.


Xydeas [7] recognized the existence of statistical dependencies
among the elements (features) of input observation vectors. In
their system (referred to as DM-HIMM-DI) the resulting
likelihood probability is defined as:
P

f(;Z(i a(') ) . 11b(t

Cq(1 (b (OT)

Tk

vt(i))
ZZh(Ukt,(i),
k=l t'C
Uk('(i)=[ok,, ,kt',-,k,t' , Ok,t'V
I.o.(C)1 with c t,
and Vt(i) =tot(l) o t)

where

(2)

are calculated as the expected number of times in observing

V4(i) for all Ukxt(i) in K training data sets (i.e. K different sets of
observations sequences), k= t,2,.,K.
The counting function h(a,b) is one if and only if ta=b}

where the value of the weight wi(Y) attached to the


likelihood probability of the i-th feature is a function of all C
features (observations). Furthermore wi(Y) varies with time and
effectively represents the "instantaneous" importance or
otherwise of individual input features within the HMM
classification framework.

otherwise it is zero. All values are stored in the codebook when


they are required to be used during the procedure.

A. Dependent-Multi-HMM-D2
A new model structure (DM-HMM-D2), has been
developed which is also based on the assumption that input
features are inter-dependent.

1HMM BASED LSP RECOVERY


The HIHM based estimation of missing LPC filter
information involved sets of ten LSP coefficients. These sets
are calculated originally from the analysis of successive 20
msecs speech segments and LPC coefficients are transformed
to LSP coefficients (an equivalent vocal tract parametric
III.

187

representation form that is more appropriate for quantization),


prior to applying SMQ. Furthermore in these experiments SMQ
operates over four 20msecs frames (one SMQ frame is equal to
four LSP frames, i.e. 80 msecs), that is over four sets of ten
LSP coefficients, in order to produce one set of ten SMQ
indexes (LSPi, i=1,...,10). Also, the LPC information that is
contained within each transmitted packet corresponds to an 80
msecs speech segment (i.e. one SMQ frame).
Now, given that the current packet is missing and can not
therefore be used in the speech synthesis process, an HMM
process that operates on previously received SMQ information,
can be used to determine a "most-likelihood" estimate for the
missing SMQ indices. In general, the 1HMM period of
observation time T0b, consists of current "c" (missing) and
previous "p" (received) SMQ frames, is T=( c+p)Tp,a, where
c<<p . In these experiments c=1, p=3 and Tpa=80 msecs.
In the case of a missing packet, the LSP
recovery/estimation process operates on four successive SMQ
frames with each frame represented by ten SMQ quantization
index values and the last frame being that of the missing
packet. Thus the estimation process employs a "bank" of ten
HMMM models, i.e. i=1,..,10 with the "i-th" model operating on
the observation sequence Q(')=to], 02, 03, 04}() , see figure 1.
Alternatively, this bank of HMMs operates on a sequence
,
of four observation (column) vectors y,=to([),o o2.
t( )
where t =1,..4 are the time indices of SMQ frames.
Class-1 training
data sets

|,o,(2

Y2

Y3

02(l)

0 o3(2)

Y4

02 03
1(2)
1

0100

2(1)

2(lo)

0(2)

0o4(2)

HMM2 W

~~HMM2

04
.o(2O)

(1)

(2)
01

(2)

(2)
03

o(10) Q~(10)

(1 0)
03

0(1)

03

=Y=[yl Y2 Y3]

(6)
Now, having y,, y2 and y3 and using Y4 to indicate an
estimate of the missing SMQ vector y4, this estimate can be
selected such that the likelihood probability of observation
Y {=tY Y2, Y3, Y41, given an HMM model i, i.e. P(Y =tyj, Y2,
Y3, Y4 }) 1u) is maximum.
This effectively means that the HIMM bank is designed i.e.
trained to classify input patterns of four SMQ vectors {yv, Y2,
y3, y4} to one of i classes. Then given the resulting AiHMM
bank models and an incomplete observation vector sequence
%Y], Y2, Y3}, the system defines y4 so that P(Y*=ty, Y2, Y3,
Y4 }) 1)) is maximum over all i classes.
Of course each of these classes should represent clusters of
"similar" four-vector SMQ "sequences" Y whose
characteristics are captured by the corresponding bank of
HMM models. This concept of having clusters of "similar" Y
sequences of SMQ vectors can be easily accepted due to the
structure imposed on the speech signals in general and LSP
tracks in particular by language rules and human speech
production mechanism constrains.
Note that a SMQ frame size of 80 msecs and T0b of 320
msecs were selected to reflect phoneme/syllabic durations. The
voiced-unvoiced nature of the speech signal was also selected
as the clustering criterion, an assumption that results however
in significant variability between Y sequences belonging to the
same class but, at the same time, leads to a small number of
classes, )=7. Also note that voiced/unvoiced classifications are
produced at the output of a Voiced Activity Detection (VAD)
process operating on a 20 msecs frame basis and thus Y
sequences defined over T0b= 320msecs are classified into the i
=7 classes using 16 voiced/unvoiced flags (320/20=16). The
classes are 1) voiced, 2) unvoiced, 3) voiced to unvoiced, 4)
unvoiced to voiced, 5) voiced to unvoiced to voiced, 6)
unvoiced to voiced to unvoiced and 7) other.
Now given seven classes and ten different HMM networks
in the system bank, see figure 1, the probability of observing
Q(') given the the j-th class i-th HVMM model is P(O(i) 1,j,), with
i=1 ,.,10 and j= 1.,7. Then the total probability P(Y1i) over
the system bank can be maximized using:
(1) P(1Yli) =maxjP(O(') 1il l) x P(0(2) 1S.1 2) x . .. x P O(l ) 12 0),

II
Yi

0(1)

H M 0

3(lo)HMM10
Training HMMs
for Class 1

Figure 1. Training HMMs on a per class basis.

However since the last (4i) observation vector is missing,


only the yl, Y2 and y3 vectors are available to the input of the
11MM bank, thus over the 320 msecs observation period and
the original sixteen LPC vectors of ten coefficients each, which
were quantised to four SMQ vectors of ten coefficients each,
only the first three of the SMQ vectors are available to the LSP
recovery process, i.e.

P(Q( 12,1), P(Q(2)2,2).,P(Q(]O) j2,0o). P(Q()


P('Q(2) )17,2),..., P(O( 127,1o) }
.

or

(2)

P('l2)=max{P(O(')|)u ]),P(O(')<Z21),
P(0Q '<Z 9
xmax{ P(Q(2) 1,2), P( 2)122 2), ...,P(Q (2) u,72) }
xmaxl P(ofl ) 1il 10), P(ofl 1,2, 10), P(ofl 127 l0) }
...

x...

That is, P(Yj,) is maximized by defining (1) the largest


product of 10 P(oQ(l)j) probabilities, with each product

188

1.995804 is obtained via applying the DM-IHMM-D2 process


and Method 2.

defined within a given class i.e. the value of j is fixed or (2)


each of the 10 probability terms (i=1,.,10) which form the
product that defines P(YJX) maximum is itself the maximum
P(Q(l) Ij,d value acrossj=1,., i with the value of i fixed.
IV.

Table 1: PESQ SCORES USING HMM BASED LSP RECOVERY METHOD(


HELLO FILE, M=1 AND N=8)

EXPERIMENTAL PROCEDURE AND RESULTS

IM.HMM-D

A total 20,000 seconds (5.56 hours) of speech signals was


used as input to the 1HMM models training process. Having an
SMQ index vector estimated every 80 msecs, a total of 250,000
SMQ vectors (20,000,000/80=250,000) were produced. A
super frame window of length 320ms (4 SMQ frames) with a
shift step of 80ms (i.e. 1 vector) was then employed in order to
generate the appropriate HMM model training data. Thus the
window moves one super frame at a time and for each shift it
generates a training data set of four SMQ vectors. This gave
rise to a total 249997 training data sets.
The performance of the proposed SMQ index estimation
(LSP recovery) methodology has been evaluated
experimentally via computer simulation with the HMM models
bank system implemented using two different approaches, that
is IM-IHMM-D and DM-HMM-D2 [8]. Furthermore system
performance was evaluated while processing input files (e.g.
"Hello operator" file yielding 307 SMQ vectors), which were
not included in the generation of the data used in the HMM
model training process.
In these experiments missing packets/SMQ vectors were
"introduced" at regular intervals, in the sequence of SMQ
vectors generated for each of the test files. That is, "m" SMQ
vectors were assumed as missing every "n" SMQ vectors, in
the test sequence, for example m=1 and n=8. In addition it has
been assumed that the transmitted LSP parameters are those of
a Pitch Synchronous Prototype Interpolation Manchester coder
[9] operating at 2.4 Kbits/sec and thus voiced/unvoiced flags
are also available at the receiver.
The quality of the recovered speech signal at the output of
the Pitch Synchronous Prototype Interpolation Manchester
decoder was measured using the objective Perceptual
Evaluation of Speech Quality (PESQ) method [10] that returns
a quality score in the region of 0 to 5. In addition informal
subjective tests were also carried out.
Table 1 shows the PESQ scores obtained by using the IMH4MM-D and DM-HMM-D2 to predict the missing speech LSP
vectors.
The encoder/decoder performance of the last column is the
reference performance that is obtained by reproducing missing
speech LSP vectors (V4) using immediately previously received
vectors (3). These results are typical to those obtained from
several experiments and show that for the "Hello" input speech
file and with 12.5% of LSP information been replaced by
estimates, DM-HMM-D2 provides higher scores than IMH4MM-D. This highlights the importance of inter-dependencies
between LSPs and the fact the conventional IM-HIMM-D
approach ignores them. As a result the use of IM-HMM-D
within this specific LSP recovery system configuration
provides worst results than the reference " use previous value"
approach i.e. use the received y3 SMQ vector. A PESQ score of

Probability
Selection

DM-HMM-D2

Encoder/Decoder
Performance

Method 2

(with missing
vectors)

1.804188 1.808352 1.89119 1.995804

1.946306

Method 1

Method 2

Method 1

type

PESQ
Score
(Hello)

I___ I__ I__ I__ I__


V.

CONCLUDING REMARKS

Experimental results show only a slight improvement in


terms of both PESQ scores and this only in the case of
employing DM-HMM-D2 with Method 2. Informal subjective
tests on the other hand suggest a more useful advantage.
From these experiments and when looking into the HMM
model parameter sets, the diagonal values of the matrix A,
where A= a) represents hidden state transitional probabilities,
I <i,j<N and N is the number of hidden states, are found to be
significantly higher than other values. This implies that when
estimating the next observation (04*), 04 is more likely to be
generated by the hidden state s/ which previously produced the
observation 03. In this case, 04 is estimated by selecting the m
with the highest observation probability b1(?o), where lm<<M
and M is the number of possible quantized LSP vectors. i.e. the
estimated 04 is highly likely to assume the same value as that
given in the previous hidden state Sj.
This situation arises from the use of a small number of
classes i i.e. seven, to represent/model the various Y sequences
of SMQ indexes. As it was mentioned earlier, each of these
classes should represent clusters of "similar" four vectors SMQ
"sequences" Y and the current voiced/unvoiced based criterion
for generating these classes is introducing an excessive degree
of intra-class variability. A different clustering criterion that
relates better to the "shape" characteristics of the SMQ
sequences and thus offers enhanced similarity between
sequences belonging to the same class, is therefore required.
This is expected to significantly increase system performance
since classes will capture broad language based signal
characteristics
REFERENCES
[1] R. Arcomnno, VoIP How to Tutorial, Technical Report, Free Software
Foundation Inc. Boston USA, 2001.
[2] J. Tyson and R. Valdes, How VoIP Works, Technical Report,
HowStuffWorks Inc. 2004.
[3] P. Ferguson and G. Huston, Quality of Service: Delivering QoS on the
Internet and in Corporate Networks, John Wiley and Sons Publishers,
New York, 1998.
[4] C. S. Xydeas and C. Papanastasiou, Split Matrix Quantisation of LPC
parameters, Proceedings of IEEE Transactions in Speech and Audio,
Vol. 7, No. 2, pp.1 13-125, March, 1999.

189

[5]
[6]
[7]

[8]

[9]
[10]

Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and


Selected Application in Speech Recognition, Proceedings of IEEE,
vol.77, no.2 pp. 257-285.
Zoubin Ghahramani, An Introduction to Hidden Markov Models and
Bayesian Networks, International Journal of Pattern Recognition and
Artificial Intelligence, vol. 15, no. 1, pp. 9-42, 200 1.
Shih-Yang Chiao and C.S. Xydeas, Modelling Behaviors of Players in
Competitive Environment, IEEEIWIC International Conference on
Intelligent Agent Technology pp. 566-569.
Shih-Yang Chiao and Costas S. Xydeas, Using Hierarchical HMMs in
Dynamic Behaviour Modelling, Proceedings of Seventh International
Conference on Information Fusion, Fusion 2004. pp. 576-582,
Stockholm, Sweden, 2004.
F. Zafiropoulos C. Xydeas, Model Based Packet loss Concealment for
AMR Coders, IEEE, International Conference on Acoustics Speech and
Signal Processing, Vol. I pp. 1 12 -115, April 2003.
P. Ordas and B. Fox, Perceptual Evaluation of Speech Quality (PESQ),
Technical Report, Microtronix Systems Ltd, 2004.

190

You might also like