You are on page 1of 3

IEEE TRANSACTIONS ON INFORMATION

THEORY, VOL. IT-32, NO. 2, MARCH 1986

307

TABLE II GROUP WITH THE OPTIMAL STATISTICAL PERFORMANCE FOR THE FIRST-ORDER MARROV PROCESS n
P

12

16

18

24

30

32
s3

36
'36

40
s3

42

48

54

56

60

64

C .9 S, Q2 G X s3 CM Cl, 1.99 s, Q, C, x S, C, x Q2 C, x S, C, TQ,

C C C, ?:es, C, TQ,

GO X S, Q, X c5

c s, x""Q,

S, x C,

c54

Q2 x C,

c56

S, ?C,,

Q22Q;

TABLE III DISPERSION FOR THE GROUP FILTER WITH G = S, X C,

P
Ds3xc2

0.5
0.4545

0.6
0.4294

0.7
0.3944

0.8
0.3437

0.9
0.2628

0.92
0.2398

0.94
0.2128

0.96
0.1800

0.99
0.111

TABLE IV GROUP WITH THE OPTIMAL STATISTICAL PERFORMANCE FOR THE RANDOM SINE WAVE

8
c,

12
c,,

16
cl6

18

24
~3

30
S3

32

3:
S3

40
Q2 X G Q2 x C, c, x Q2

42
s, x c,
S3

48
S, x C,
S3

54
S, x C,

56
Q2 x C,
s3 S3

60
x Cl0

64
ep2 ; g3

1.01 c,

S, x

S3 x C, s, x c, S, X c4

x C,

Q2 x C,
Q2 x C, c, X Q2

X G

1.05 2

;:
Q2

22 , 2
C2

z;x, :
Q2

s, x c, S, X C,

s, x c, s3 X c5

S, x S, c, X s3

x C,
x s3

X Q2 S, x C, Q2 x G
c,,
G4 c56

x Go Q : x Q :
C 60 C 64

1.1

s3 S3

x S, G x

c,

ACKNOWLEDGMENT The authors wish to thank Mr. 0. Zimmerman for conducting the computer experiments of Section III.
REFERENCES

VI
PI I31 [41 [51

[61 [71

WI [91

PO1
Ull

and K. R. Rae, Fast Transforms: Algorithm, Analyses und Applications. New York: Academic, 1983. Picture Processing and Digital Filtering. New York: H. C. Andrews, Springer-Verlag. 1975. N. Ahmed and K. R. Rae, Orthogonal Transform for Digitul Signal Processing. New York: Springer-Verlag. 1975. B. J. Fino and V. R. Algazi, A unified treatment of discrete fast unitary transforms, SIAM J. Computing, vol. 6, pp. 700-717, Dec. 1977. H. C. Andrew and K. L. Caspari, Degrees of freedom and modular IEEE Trans. Comput., vol. C-19, pp. structure in matrix multiplication, 16-25, 1970. N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput., vol. C-23, pp. 90-93, 1974. J. Pearl, Basis restricted transformations and performance measures for spectral representations, IEEE Trms. Inform. Theory, vol. IT-17, pp. 751-752, 1971. H. S. Kwak, R. Srinivasan, and K. R. Rao, C-matrix transforms, submitted to IEEE Truns. Acoust. Speech, Signul Processing. W. M. Chen, C. M. Smith, and S. C. Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Truns. Commun., vol. COM-25, pp. 1004-1009.1977. K. R. Rao, J. G. K. Kuo, and M. A. Narasimhan, Slant-Haar transform, Int. J. Computer M&h., sec. B, 7, pp. 73-83, 1979. E. A. Trachtenberg, Fast Wiener filtering computation technique, in D. F. Elliott

Fast Fourier transforms on finite non-Abelian M. G. Karpovsky, groups, IEEE Truns. Comput., vol. C-26, pp. 1028-1030, 1977. P. J. Nicholson, Algebraic theory of finite Fourier transforms, J. WI Comput. Syst. Sci., pp. 524-547, 1971. L. Domhoff, Group Representation Theory. New York: Marcel Dekker, WI 1971. and E. A. Trachtenberg, Fourier transform over 1231 M. G. Karpovsky finite groups for error detection and error correction in computation channels, Inform. Contr., vol. 40, pp. 335-358, 1979. [241 M. D. Flickner and N. Ahmed, A derivation for the discrete cosine transform, Proc. IEEE, vol. 70, pp. 1132-1134, 1982. of cosine and Fourier transforms 1251 M. Hamidi and J. Pearl, Comparison of Markov-1 signals, IEEE Trans. Acourt., Speech, Signcrl Processing, vol. ASSP-24, pp. 428-429,1976. E. A. Trachtenberg, Systems over finite groups as suboptimal Wiener WI filters: A comparative study, presented at 1983 Int. Symp. on Muthenwtical Theory of System and Networks, Beer-Sheva, Israel, 1983. M. G. Karpovsky, Finite Orthogonal Series in the Design of Digital v71 Deuices. New York: Wiley, 1976. Construction of Group Transforms Subject to E. A. Trachtenberg, WI Several Performance Criteria, IEEE Tmns. Acourt. Speech, Signal Processing, vol. ASSP-33, pp. 1521-1531, 1985. PO1

Proc. 1979 Int. Symp. on the Muthemuticul Theory of Networks and Swtems. vol. 3, Delft, The Netherlands pp. 174-177.

WI

E. A. Trachtenberg, Construction of fast unitary transforms which are equivalent to Karhunen-Loeve spectral representations, in Proc. 1980 IEEE Int. Symp. on EMC, Baltimore, MD, USA, pp. 376-379. M. G. Karpovsky and E. A. Trachtenberg, Some optimization probv31 lems for convolution systems over finite groups, In/arm. Corm., vol. 34, pp. 227-247.1977. J. Pearl, Optimal dyadic models of time-invariant systems, IEEE v41 Trum. Comp;t., vol. C-24, 1975, pp. 598-602. Wiener filtering computation technique, [I51 W. K. Pratt, Generalized IEEE Truns. Comput., vol. C-21, pp. 636-641, 1972. V61 J. Pearl, Walsh processing of random signals, IEEE Truns. Electromugn. Cornput., vol. EMC-13, pp. 137-141, 1971. On coding and filtering stationery signals by discrete Fourier (171 -, transform, IEEE Truns. Inform.Theory, vol. IT-19, pp. 229-232, 1973. -. Asymptotic equivalence of spectral representations, IEEE Truns. IlRl Acoust.. Speech, Signul Processing, vol. ASSP-23, pp. 547-551. 1975. properties of discrete unitary I1 91 Y. Yemini and J. Pearl, Asymptotic transforms, IEEE Trans. Pattern Anal. Muchine Intell., vol. PAMI-1, pp. 366-371. 1979.

Maximum Likelihood Estimation for Multivariate Mixture Observationsof Markov Chains


B. H. JUANG, STEPHEN E. LEVINSON, AND M. M. SONDHI SENIOR MEMBER, IEEE,

Abstract-To use probabilistic functions of a Markov chain to model certain parameterizations of the speech signal, we extend an estimation technique of Liporace to the cases of multivariate mixtures, such as Gaussian sums, and products of mixtures. We also show how these problems relate to Liporace original framework. s
Manuscript received September 10, 1984; revised July 12, 1985. This work was presented at the 1985 IEEE International Symposium on Information Theory, Brighton, England, June 24-28. The authors are with Acoustics Research Department, Bell Laboratories, Murray Hill, NJ 07974. IEEE Log Number 8406633.

001%9448/86/0300-0307$01.00 01986 IEEE

Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 23, 2009 at 11:23 from IEEE Xplore. Restrictions apply.

308
INTRODUCTION

IEEE TRANSACTIONSON

INFORMATIONTHEORY,VOL.IT-32,NO.2,~ARCH1986

In a recently published paper, Liporace [6] derived a method for estimating the parameters of a broad class of elliptically symmetric probabilistic functions of a Markov chain. The corollary to that work presented here was motivated by the desire to use this general technique to model the speech signal for which it is known [4], [8] that, unfortunately, certain of its most useful parameterizations do not possess the prescribed symmetry. Since any continuous probability density function can be approximated arbitrarily closely by a normal mixture [9], it is reasonable to use such constructs to avoid the restrictions imposed by the requirement of elliptical symmetry. In this correspondence we adapt the method and proof of [6] to two types of mixture densities.
NOMENCLATURE

02~ (0) = 0. Thus a recursive application of Y to some initial value of X converges to a local maximum (or possibly an inflection point) of the likelihood functions. Liporace result [6] res laxed the original requirement of Baum et al. [2] that 5(x) be strictly log concave to the requirement that it be strictly log concave and/or elliptically symmetric. We will further extend the class of admissible pdf to mixtures and products of mixtures of s strictly log concave and/or elliptically symmetric densities. For the present problem we will show that a suitable mapping Y is given by the following equations:
T-l

Zij =quij)

C a,(i>u,jbj(O,+l)p,+l(j) = t=l T-l


t=1

Throughout this presentation we shall, where possible, adopt the notation used in [6]. Consider an unobservable n-state Markov chain with state transition matrix A = [ u~~],,~,~. Associated with each state j of the hidden Markov chain is a probability density function b,(x), of the observed d-dimensional random vector x. Here we shall consider densities of the form
k=l

c % (9&(i)

(6)

Cjk =.7(cjk) = =; t~14i)8,(i)


T

(7)

(1)
i$k =r(ll/k) =

where m is known; cjk 2 0 for 1 5 j I n, 1 I k 5 m; C; ,ic,, = 1 for 1 < j I n; and .N(x, p, U) denotes the d-dimensional normal density function of mean vector p and covariance matrix U. It is convenient then to think of our hidden Markov chains as being defined over a parameter manifold A = { zZpp Q x gd x x @ }, where .sP is the set of all n x n row-wise stochastic matrices; Y is the set of all M x n row-wise stochastic matrices; Wd is the usual d-dimensional Euclidean space; and ed is the set of all d x d real symmetric positive definite matrices. Then for a given sequence of observations, 0 = 0, ,Oz,. . , Or, of the vector x and a particular choice of parameter values X E A, we can efficiently evaluate the likelihood function, L,(O), of the hidden Markov chain by the forward-backward method of Baum [l]. The forward and backward partial likelihoods, cy,(j) and p,(i), are computed recursively from

=f

I=1

c di, k>Nj)
k)bt(j)(q pjk)(q pjk)

(8)

and
f if& = 9-( qk) = 1-1 ,$ldj9 k)&(j) dj?

(9)

forl~i,jIn,lIkImandlIr,sId.
In U-(9) ( db,t

fort=l, for 1 < 4 T.

and

(10)
(For fixed k, p,( j, k) is formally identical to p:(j), as defined by Liporace.)

Proof of the Formu1u.s:Equations (6) and (7) for the reestimation of ai, and cjk follow directly from a theorem of Baum and Sell [3] because the likelihood function ZA(0) given in (4) is a polynomial with nonnegative coefficients in the variables (4) u,~, ~,~,l s i, j I n,l I k I m. To prove (8) and (9) our strategy, following Liporace, is to define an appropriate auxiliary function Q(1,x). This function for any t between 1 and T - 1. will have the property that Q(x, X_> Q(x, A) implies Z > x<O) > THEESTIMATIONALGORITHM sA(0). Further, as a function of h for any fixed X, Q(x, X) will The parameter estimation problem is then one of maximizing have a unique global maximum given by (6)-(9). As a first step to derive such a function we express the Zi( 0) with respect to X for a given 0. One way to mtimize Z A likelihood function as a sum over the set, 9, of all state seis to use conventional methods of constrained optimization. Liporace, on the other hand, advocates a reestimation technique quences S: analogous to that of Baum et al. [l], [2]. It is essentially a mapping Y: A + A with the property that %T(h,(O) 22A(O) with equality iff A is a critical point of ZA(0), (9 that is,

respectively. The recursion is initialized by setting a,(l) = 1, q)(j) = 0 for 2 <j I n and &-(I = 1 for 1 I i I n, ) whereupon we may write

Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 23, 2009 at 11:23 from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS

ON INFORMATION

THEORY,

VOL.

IT-32,

NO.

2, MARCH

1986

309 zero or unity. As these are unaltered by (6) and (7), they need not be reestimated. Using this reconfiguration of the state diagram, Liporace formulas can be used in case b,(x) is any mixture of s elliptically symmetric densities. A variant on the Gaussian mixture theme results from using b,(x) of the form of a product of mixtures,
D m

Let us partition the likelihood function further by choosing a particular sequence, K = (k, , k, , . . . , k,), of mixture densities. As in the case of state sequences denote the set of all mixture we sequencesas X= { 1,2,. . . , m}? Thus for some particular K E X we can write the joint likelihood of 0, S, and K as

~A(O,S,K) =iras ,~(~)~LS,I(I)US,~,)CS,~,. ,-,. F (14


t=1

W e have now succeededin partitioning the likelihood function as What we have considered so far is the special case of (16) for D = 1. Yh(O) = c c g~(o,s,K). (13) From the structure of our derivation it is clear that for hidden SE9KEY Markov chains having densities of the form (16), reestimation formulas can be derived as before by solving v-,Q(x, X) = 0. In view of the similarity of the representation (13) to that of Such solutions will yield results quite analogousto (6)-(10). Note -Ep, in [6], we now define the auxiliary function that this case too can be represented as a reconfiguration of the state diagram. =CC~~(O,S,K)log~~(O,S,K). (14) One numerical difficulty which may be manifest in the methS K ods described is the phenomenon noted by Nadas [7] in which When the expressions for Px and 2~ derived from (12) are one or more of the mean vectors converge to a particular observation while the corresponding covariance matrix approaches substituted in (14), we get a singular matrix. Under these conditions, PA(O) -+ cc but the value of X is meaningless.A practical, if unedifying, remedy for this difficulty is to try a different initial A. Alternatively, one can drop the offending term from the mixture since it is only contributing at one point of A. where Y,~,~,, 0. The innermost summation in (15) is formally 2 Finally, we call attention to two minor facets of these alidentical to that used by Liporace in his proof; therefore, the gorithms. First, for flexibility in modeling, the number of terms properties which he demonstrated for his auxiliary function with in each mixture may vary with state, so that m in (1) could as respect to p and U hold in our case as well, thus giving us (8) well be m,. A similar dependenceon dimension results if m in and (9). W e may thus conclude that (5) is correct for Y defined (16) is replaced by mj,. In either case, the constraints on the by (6)-(9). Furthermore, the parameter separation made explicit in (12)-(15) allows us to apply the same algorithm to mixtures of mixture weights must be satisfied. Second, for realistic numbers of observations, for example, strictly log concave densities and/or,elliptically symmetric densiT 2 5000, the reestimation formulas will underflow on any existties as treated by Liporace in 161. ing computer. The basic scaling mechanism described in [5] can be used to alleviate the problem but must be modified to account DISCUSSION for the fact that the pt/3, product will be missing the tth scale In [6] Liporace notes that by setting a,, = pj,l I j I n,Vi, factor. To divide out the product of scale factors, the tth sumthe special case of a single mixture can be treated. It is natural mand in both numerator and denominator of (7), (9), and (10) then to think of using a model with n clusters of m states, each must be multiplied by the missing coefficient. with a single associated Gaussian density function as a way of At this writing numerical experiments based on Monte Carlo treating the Gaussian mixture problem considered here. simulations and classification experiments using real speech sigThe transformation can be accomplished in the following way. nals are being conducted. W e hope to report the results of these First we expand the state space of our n-state model as shown in studies upon their completion. Fig. 1, in which we have added states j, through j, for each state j in the original Markov chain. Associated with states REFERENCES Jl, 52, ., j,, are distinct Gaussian densities corresponding to the . m terms of the j th Gaussian mixture in our initial formulation. 111 L. E. Baum, An inequality and associated maximization technique in The transitions exiting state j have probabilities equal to the statistical estimation for probabilistic functions of a Markov process, in Inequalities, vol. III, 0. Shisha, Ed. New York: Academic, 1972, pp. corresponding mixture weights. State j,, is a distinguished state l-8. that is entered with probability 1 from the other new states, exits PI L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A maximization to state j with probability ujj, and generatesno observation in so technique occurring in the statistical analysis of probabilistic functions of doing. The transition matrix for this configuration can be written Markov chains Ann. Murh. Statist., vol. 41, pp. 164-171, 1970. down by inspection. A large number of the entries in it will be [31 L. E. Baum and G. R. Sell, Growth transformations for functions on

Q&x)

Fig. 1.

Equivalent

M + 2 state configuration Gaussian mixture.

for each state with

m-term

manifolds, Pm. J. Math., vol. 27, pp. 211-227, 1968. A. H. Gray and J. D. Markel, Quantization and bit allocation in speech processing, IEEE Truns. Acourt. Speech Signal Processing, vol. ASSP-24, pp. 459-473, Dec. 1976. to [51 S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, An introduction the application of the theory of probabilistic functions of a Markov process to automatic speech recognition, Bell. Cyst. Tech. J., vol. 62, pp. 1035-1074, Apr. 1983. 161 L. R. Liporace, Maximum likelihood estimation for multivariate observations of Markov sources. IEEE Truns. Inform. Theory. vol. IT-28. pp. 129-734, Sept. 1982. A. Nadas, Hidden Markov chains, the forward-backward algorithm and [71 initial statistics, IEEE Truns. Acoust. Speech Signd Processing, vol. ASSP-31, pp. 504-506, Apr. 1983. L. R. Rabiner, J. G. Wilpon, and J. G. Ackenhusen, On the effects of WI varying analysis parameters on an LPC-based isolated word recognizer, Bell Svst. Tech. J.. vol. 60, pp. 893-911, 1981. H. W. Sorenson and D. L. Alspach, Recursive Bayesian estimation using [91 Gaussian sums, Automuticu, vol. 7, pp. 465-479, 1971. [41

Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 23, 2009 at 11:23 from IEEE Xplore. Restrictions apply.

You might also like