You are on page 1of 190

Chapter 1 Introduction

Introduction to Speech processing

5th Edition

2010

Summary In this chapter the basic skills needed to work with human speech signal will be introduced. The human speech signal is sort of human interface tools. Human body can interact with environment using {Hands, eyes, hearing, touch, smells and speech}. Human Speech Signal (HSS) is altered by electrical engineers as a digital signal that includes some exclusive features. It is a random signal. Also it includes allot of information encoded into patterns. In this chapter the mathematical skills needed for speech signal manipulations will be introduced. In addition to that, speech signal production mechanism will be illustrated. Objectives Understanding the human speech mechanism. Recall knowledge in Digital filters. Recall knowledge in statistical process and random variables.

References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] LR. Rabiner, "Fundamentals Of Speech Recognition", Prentice-Hall, ISBN: 0-13285826-6. [3] ALAN V, OPPENHEIHIM, Signals and Systems, 2nd Edition, Prentice-Hall International, Inc., ISBN 7-302-03058-8.

Amr M. Gody

Page

1. Human speech production mechanism

Introduction to Speech processing

5th Edition

2010

Speech is the main method human interacts together. It is the way of communicating between humans. Figure 1 introduces a basic model of speech communication system.

Figure 1 Schematic of speech production/ perception

Amr M. Gody

Page

The produced speech signal is a digital signal that includes the human messages. The information is encoded within that signal. Human brain has the ability to encode and decode that signal. The process starts in human brain by formulating certain message. Then the message is encoded into certain phonetic sequence. The phonetic is the smallest information unit in speech technology. It is like the character in any written language. Then each phone is produced by sending certain sequence of signals to the controlling muscles of vocal tract and vocal cords. The signal is an analog signal. This signal is radiated from the mouth through the environmental air. It affects the particles of air in such that this effect is continuing transferred to the listener ears. The listener ears acts as a receptor for such environmental air deformation. It reverses the process in such that regenerating the pressure and speed waveform that was being generated by

the speech apparatus of the talker. The signal is analyzed by the basilar membrane (part in the auditory system). Then some features are extracted for the subsequent recognition process. Figure 2 provides some sort of classifications of speech applications.

Introduction to Speech processing

5th Edition

2010

Figure 2 Speech applications

Amr M. Gody

Page

Introduction to Speech processing

5th Edition

2010

Figure 3 Speech production Organs

Amr M. Gody

Page

Introduction to Speech processing

5th Edition

2010

Figure 4 Frequency response distribution in the basilar membrane

2. Statistical process Referring to figure 1, part of speech recognition system depends on the extracted features from the percept speech. Features are some physical properties that identify certain phenomena. For example female sounds much deeper than male sounds. This is a physical phenomenon. We can measure this phenomenon by extracting the fundamental frequency from speech signal. This evaluates to high frequency for deeper sounds.

Amr M. Gody

Page

Introduction to Speech processing

5th Edition

2010

One can imagine that muscles responsible for controlling the vocal tract may not make 100% match in every time talker wants to communicate the same message. Certainly it may be a little bit deviated. This will reflected on a little bit difference in the value of the extracted features. It is not possible to alter features as deterministic values although they are generated through a deterministic process. Random process is a function of random variables. Random variables are modeled using probability distribution functions. Figure 5 gives a close image of speech signal as information source. The figure is 3 parts. The top is the waveform, the middle is the spectrogram and the bottom figure is the annotation. Speech waveform is the graph of the analog values of speech signal. The y axe represents the signal value. This value is a function of the microphone used to record the signal. The x axe is the time in this graph and the other two graphs. The middle graph represents the spectrogram. This is a 3 dimensional graph that cross reference time, frequency and power. The y axe is the frequency and the z axe is normal to the paper and represents the signal power. The more power the darker points. The x axe is the time as mentioned before. Recalling the figure, we can see that the signal has a non

Amr M. Gody

Page

homogenous frequency distribution over time. The signal is the Arabic word .It is pronounced khamsah.

Introduction to Speech processing

5th Edition

2010

Amr M. Gody

Page

Figure 5: Speech signal as information source.

The observer can notice that this 4-Arabic-letters word include 6 different homogenous areas. Each area represents a stable duration in time of stable features. Those stable areas called the sounds or in other word the phonemes. So the phonemes play in spoken speech the same role as the letters play in the written language. The above discussion gives us the idea of segmenting the speech signal into small duration in order that deal with it as stationary during those small durations as shown in figure 6.

Introduction to Speech processing

5th Edition

2010

Figure 6 Speech signal is segmented into sequence of features frames.

The segmentation process is very important to make it possible to model the speech production systems. The phoneme may consist of single frame or sequence of frames. The statistical model should be suitable to the statistical process being descried. Assume the following example: (For a complete list of a certain symbols of Arabic phonemes refer to figure 6. Three Arabic vowels {a,o,e}. Formants features will be used. Formants are the resonant frequencies. Appears as black lines in figure 5. The first two Formants for the Arabic three vowels are used to build three Gaussians pdfs. Viewing the space {F1-F2} (figure 9), It is clearly appeared that we have three classes. We can model each class using a single Gaussian distribution.
Amr M. Gody

Page

Introduction to Speech processing

5th Edition

2010

Figure 7 Arabic phonemes

Figure 8 Formants of certain frame of speech signal.

Amr M. Gody

Page

Introduction to Speech processing

5th Edition

2010

Figure 9 F1-F2 graph for two of Arabic vowels.

F1 and F2 are random variables. Frames of {F1, F2} are modeled as Gaussian distribution.

To use this function you should evaluates the parameters 1- Covariance matrix 2- Mean vector The covariance matrix indicates the correlation between the different random variables {F1 and F2} . 11 =
21

12 22

Amr M. Gody

Page

The values are estimated from the available training data. We should collect a suitable descriptive frames that covers all situations of {F1,F2}.

10

Then we can estimate the model parameters. After that we can use the pdf to evaluate the probability of certain frame of {F1,F2} against the trained model. 200 300 = 255 221 295
=1

Introduction to Speech processing


150 120 111 110 145

5th Edition

2010

equal in probability the value should evaluated to .


=1

(1 ) is the expectation of that value of F 1 . In case of all vectors are 12 = 21 = (1 1 ) (2 2 ) (1 , 2 )


1

11 = (1 1 )2 (1 )

evaluate to(1 , 2 ) = ; where c is the number of recurrence and N is the total number of training vectors in the data set.
=1 =1

(1 , 2 ) is the probability of having a vector containing both values of F1 and F2 at the same time. In case of having independent variables these value should evaluates to zero. In case of not dependency exist; each vector may have a recurrence within the data set. In this case (1 , 2 ) will = 1 (1 ) 2 (2 )
1

Amr M. Gody

Page

11

normal case), (1 ) =

In case of all vectors are equally probable (This should be almost the

Introduction to Speech processing

5th Edition

2010

(a)

(b)

(C)
Figure 10 a) F1-F2 graph for two Arabic vowels. b) Gaussian pdf for the training data. c) 3D graph of Gaussian pdf.

Amr M. Gody

Page

12

Introduction to Speech processing

5th Edition

2010

mu = [0 0]; Sigma = [.25 .3; .3 1]; F1 = -3:.2:3; F2 = -3:.2:3; [F1,F2]= meshgrid(F1,F2); F = mvnpdf([F1(:) F2(:)],mu,Sigma); F = reshape(F,length(F2),length(F1)); surf(F1,F2,F); caxis([min(F(:))-.5*range(F(:)),max(F(:))]); axis([-3 3 -3 3 0 .4]) xlabel('F1'); ylabel('F2'); zlabel('Probability Density');
Figure 11 Matlab script that evaluates the Multi variate Normal distribution of a certain Random variables.

Consider the following hypothetical case indicated in the following figure:

Figure 12 Hypothetical case as F1-F2 are for single class of data. For example single phone.

The previous figure explains that data could not be fit by single Gaussian. As shown in figure there are two groups for data distribution. The data for the same phoneme is almost concentrated into two separate areas. Or in other word it appears as there is two points that may be assumed as a center. Gaussian mixture can represent such multi modal data. Figure10-b indicates the contours of Gaussian mixture pdf.

Amr M. Gody

Page

13

%%[m s p ll] = GMI(data,n) % Initialize Gaussian mixture using data and n. n is the number of Gaussians. % This function returns in successes the mean vector m, covariance matrix % s, portions vector P and negative log likelihood ll. % data is row based. Each row is a features vector. function [m s p ll] = GMI(data,n) options = statset('Display','final'); obj = gmdistribution.fit(data,n,'Options',options); s = obj.Sigma; m = obj.mu; p = obj.PComponents; ll = obj.NlogL; end %%[p] = GMcalProp(o,seg,mu,p) % Calculate Gaussian Mixture probability for observation vectors O. The % output is stored into the vector P. % O : Observation matrix. Each row in the observation matrix corresponds to % certain features vector. %seg : Covariance matrix of nxn elements. n is the size of features vector. %Mu : This is the mean vector. it is 1xn. %P : It is the portions vector. It is 1xm. M is the number of Gaussian mixtures. function [p] = GMcalProp(o,seg,mu,p) obj = gmdistribution(mu,seg,p);

Introduction to Speech processing

5th Edition

2010

p = pdf(obj,o); end
Figure 13 Matlab code to estimate Gaussian mixture of dataset.

Amr M. Gody

Page

14

3. Digital speech processing

Introduction to Speech processing

5th Edition

2010

Signal processing involves the transformation of a signal into a form which is in some sense more desirable. Thus we are concerned with discrete systems, or equivalently, transformations of an input sequence into an output sequence.

Linear shift invariant systems are useful for performing filtering operations on speech signals and, perhaps more importantly, they are useful as models for speech production. 4. Digital signal processing and digital filters In this section, the basics of signal processing will be recalled. This section starts by introducing the Fourier series and ends by digital filters.
1.Fourier series of periodic continuous time signal

The Fourier series is a link to the frequency domain for periodic time signals. Consider the following equation; () = 0 + ( cos + sin )
=1

Amr M. Gody

Page

This is a general closed form equation that express any function of time f(t) as a sum of harmonics (sin and cosine signals). To make the above equation true and valid, { , and 0 } should be calculated.

15

All terms in the Right Hand Side (RHS) will evaluate to zero excluding the first term. This is due to that T is the fundamental period. = 2 (
1

()dt = 0 + ( cos + sin ) dt


T T =1

Introduction to Speech processing

5th Edition

2010

0 = ()dt

() cos dt
T T

In the same way we can evaluate and . This is explained as follows: = 0 + ( cos + sin ) cos dt cos 2 =
T =1

0 is the average value of the periodic function ().

sec

All terms in the RHS will evaluates to zero excluding the term of . = 2 () cos dt T 2 () sin dt T T 2

Follow the same way but multiplying both sides in sin =

Amr M. Gody

Page

16

This explains that any periodic time signal of main period T can be expressed as a sum of sin and cosine signals of the main period T and integer multiples of the main frequency . This is very important gate that opens a new horizons in processing any periodic time signals by considering its components instead of the function itself. The filters are the first application of this evolutionary step. Keeping in mind that eJnot = cos no t + J sin no t f(t) = Cn eJnot
n=

Introduction to Speech processing

5th Edition

2010

It is better to express the function in terms of the complex exponential.

This is the same as the previous processing of the signal but this is much familiar to engineering to express it in phasor form (Magnitude and phase components). In the same way we can obtain the complex parameter C n .
T =

() eJno t = Cn eJno t eJno t dt Cn = 1 () eJno t T

Hence

Recalling the previous presentation of f(t)

Amr M. Gody

Page

17

Introduction to Speech processing


1 = ()dt T 2 = () cos dt T = 2 () sin dt T
Cn

5th Edition

2010

It is directly evaluated that

= J
Co

Let us list the following points 1- The frequency domain coefficients C n are discrete. It is also called spectral coefficients. 2- The difference in w domain between two successive coefficients is . 3- is the period of the time domain periodic function.

2. Non periodic continuos time function and fourier transform Recall the above points, let us think about what will be the effect on the above handling to the time function () in case that the function is not a periodic function.

Amr M. Gody

Page

18

In that case we have no to start with. We should think in some different way to get the frequency domian components of the time signal. The basic and stright forward direction is to handle it as a periodic signal with period equals to infinite.

Introduction to Speech processing

T= =

5th Edition

2010

We can excpect that the spectral coffecients are going to be much closer to each other. It is going to be touched together. This introduces that the frequency domain function is going to be a continous function instead of a discrete one such in the case of periodic time signals. Let us set some valuse according to the new situation: = o = 2 0 = d T

2 T

at n = = at n = Now let us recall Fourier series equations


() = Cn eJno t T 1

no = n =

Amr M. Gody

Page

1 () = () eJno t eJno t T
=

Cn =

() eJno t

19

Introduction to Speech processing


() =
=

1 () = d () eJt eJt 2

5th Edition

2010

() =

1 () eJt eJt d 2

The last two equations are called Fourier transform pairs. As shown that they are straight forward produced from Fourier series for periodic signals of infinite period or in simple words for non periodic signals. Let us list the following points: 1- Non periodic signals has continues spectral coefficients in the frequency domain. Example 1 Consider the impulse function in frequency domain () = ( )

() =

()

() eJt

Then

() =

Example 2

() =

()

Amr M. Gody

Page

20

Now let us consider the spectral function given by impulse train as follows

Introduction to Speech processing


Then () = 2 ( )
=

5th Edition

2010

Consider the results of example 1

1 () = 2 ( ) 2
=

This indicates that () is a periodic function of time with period = Example 3 Find the Fourier transform of the following time function () = ( )
=

Compare x(t) to the Fourier series of periodic signal, they are identical.

() =
=

By using the results of example 2, we can handle this function as a periodic function of period T. Then we can calculate 1 () = 2 ( ) = ( )
= =

1 1 ( ) =

Amr M. Gody

Page

21

This is a very important result. The spectrum of pulse train in time domain with a period T is also a pulse train separated with in frequency domain.
3.Discrete Fourier transform for a periodic sequence

Introduction to Speech processing

5th Edition

2010

Let us continue in our analysis toward none continues time functions. The cause of discontinuity is that time signal is sampled. This is the first step toward the digital world. What will be the effect in the frequency domain for periodic and non periodic sequences? Let us start in a periodic sequence of period N. In the same way let us start in Fourier series representation:
() = Cn eJno t T 1

Cn =

Now we have

t = Ts as this is the minimum time difference between two successive samples. o = 2 rad ( ) NTS sec

T = NTs (sec

t = kTs

() eJno t

Where:

Ts is the sample period.

Amr M. Gody

Page

22

Introduction to Speech processing


Substitute in Fourier series equations
= Jn kTs (kTs ) = Cn e NTS Jn k (kTs ) = Cn e N =

n: is the sample index


2

5th Edition

2010

Let

o = 0 Ts

(kTs ) = Cn eJno k
=

o =

2 rad ( ) N Sample

rad sample

We can remove the independent variable T S from the function brackets as it is a constant. (k) = Cn eJnok
Cn = 1 T 1 1
=

Let us examine the frequency domain now

Cn =

Cn =

( )

Amr M. Gody

Page

23

NTS

NTS

( )

() eJno t

() eJno k

() eJno k

Introduction to Speech processing


F(no ) = () eJnok =1
N 1

Let Cn

Cn =

=1

() eJnok F(no )

5th Edition

2010

We can notice the following points:

1- F(no ) is a discrete function. 2- The distance between any two succsseive points in frequency domain is rad o ( ) sample 3- F(no ) is a periodic function in domain. The period is 2 . This is a direct impact of the fact of F(no ) =
N 1
=1

4- The relation between the real frequency and the digital frequency is
0 sec

() eJn(o+2)k = () eJnok = o fs
rad

4. Discrete Fourier transform for non periodic time sequence

=1

Going from this point, we can start to consider the case of non periodic sampled time signal. As we did before, we can assume it as periodic with a period infinite. Let us start with the pair of Fourier series for sampled periodic time signals: (k) = F(no ) eJnok
= =1

1 F(no ) = () eJnok N
Amr M. Gody

Page

24

Consider the followings:

Introduction to Speech processing


o = = 2 0 N

5th Edition

2010

This is the same issue happened in case of Fourier transform. The distance between the successive points in the frequency domains tends to be zero. This will leads to continues frequency domain function. We also have the followings:
= 0 = 2 d N

= n = no (

at n = = at n = = fs
=

rad ) sample

Let us start

Amr M. Gody

Page

25

1 (k) = () eJnok eJnok N


= =1

(k) = F(no ) eJnok

rad sec

Introduction to Speech processing


= =1

2 1 (k) = () eJk eJk 2 N (k) = 1 2 F() eJk 2 N


=

5th Edition

2010

1 F() eJk d (k) = 2 = F() =


= =1

1- F() is a continuous function in . We can write the following points:

() eJk

2- F() is periodic with a period of 2.

3- Recalling = fs , at = 2, f =

introduces the very important rule of sampling theory. We should sample at least twice the maximum signal frequency to ensure that no overlap in frequency domain.

fs 2

2fs 2

= fs (Hz). This

Amr M. Gody

Page

26

Introduction to Speech processing

5th Edition

2010

F s >2F max

F s =2F max

F s <2F max

Figure 14 The effect of sampling rate in frequency domain.

Amr M. Gody

Page

27

Introduction to Speech processing

5th Edition

2010

Amr M. Gody

Page

28

Introduction to Speech processing

5th Edition

2010

Example 4 Find the spectrum of the following discrete sequence


=

Following the same direction as in examples 1 through example 3, Let us first consider the impulse response of the train of impulses in frequency domain () = 2 ( o )
=

() = ( )

This is identical to Fourier series of periodic discrete signal. So we can get by consider it for a periodic delta function of period P.

1 1 (2 X() eJn d= 2 = ( = 2 = J(o ) = e

() =

o )) eJn d =

Amr M. Gody

2 () = 2 ( o ) = ( o ) = o ( o ) P
= = =

Page

1 1 1 F() = = () eJnom = () eJnom = P P P


=1 =1

29

Introduction to Speech processing

5th Edition

2010

Amr M. Gody

Page

30

Introduction to Speech processing

5th Edition

2010

5. Z-transform

Region of convergence The region of convergence (ROC) is the set of points in the complex plane for which the Z-transform summation converges.

Amr M. Gody

Page

Example 1 (No ROC)

31

Let

Introduction to Speech processing


. Expanding on the interval

5th Edition

2010

it becomes

Looking at the sum

There are no such values of that satisfy this condition. Example 2 (causal ROC)

ROC shown in blue, the unit circle as a dotted grey circle and the circle is shown as a dashed black circle Let on the interval (where u is the Heaviside step function). Expanding it becomes

Looking at the sum

Amr M. Gody

Page

32

The last equality arises from the infinite geometric series and the equality only holds if which can be rewritten in terms of as

Introduction to Speech processing

5th Edition

2010

. Thus, the ROC is . In this case the ROC is the complex plane with a disc of radius 0.5 at the origin "punched out". Example 3 (anticausal ROC)

ROC shown in blue, the unit circle as a dotted grey circle and the circle is shown as a dashed black circle Let Expanding on the interval (where u is the Heaviside step function). it becomes

Looking at the sum

Amr M. Gody

Page

33

Using the infinite geometric series, again, the equality only holds if which can be rewritten in terms of as ROC is of radius 0.5. . Thus, the

Introduction to Speech processing

5th Edition

2010

. In this case the ROC is a disc centered at the origin and

What differentiates this example from the previous example is only the ROC. This is intentional to demonstrate that the transform result alone is insufficient. Examples conclusion Examples 2 & 3 clearly show that the Z-transform of is unique when and only when specifying the ROC. Creating the pole-zero plot for the causal and anticausal case show that the ROC for either case does not include the pole that is at 0.5. This extends to cases with multiple poles: the ROC will never contain poles. In example 2, the causal system yields an ROC that includes while the anticausal system in example 3 yields an ROC that includes .

Amr M. Gody

Page

ROC shown as a blue ring

34

In systems with multiple poles it is possible to have an ROC that includes neither example, nor . The ROC creates a circular band. For has poles at 0.5 and 0.75.

Introduction to Speech processing

5th Edition

2010

The ROC will be , which includes neither the origin nor infinity. Such a system is called a mixed-causality system as it contains a causal term and an anticausal term .

The stability of a system can also be determined by knowing the ROC alone. If the ROC contains the unit circle (i.e., ) then the system is stable. In the above systems the causal system (Example 2) is stable because contains the unit circle. If you are provided a Z-transform of a system without an ROC (i.e., an ambiguous following: Stability Causality If you need stability then the ROC must contain the unit circle. If you need a causal system then the ROC must contain infinity and the system function will be a right-sided sequence. If you need an anticausal system then the ROC must contain the origin and the system function will be a leftsided sequence. If you need both, stability and causality, all the poles of the system function must be inside the unit circle. The unique Properties Properties of the z-transform Time domain
Notation

) you can determine a unique

provided you desire the

can then be found.

ROC:

Amr M. Gody

Page

35

Z-domain

ROC

Introduction to Speech processing


Linearity

5th Edition

2010

At least the intersection of ROC 1 and ROC 2 ROC, except and if

Time shifting

if

Scaling in the zdomain

Time reversal

Conjugation

ROC

Real part

ROC

Imaginary part

ROC

Differentiation

ROC

Convolution

At least the intersection of ROC 1 and ROC 2 At least the intersection of ROC of X 1 (z) and X 2 (z 1) At least

Correlation

Multiplication

Parseval's relation

Amr M. Gody

Page

Initial value theorem

36

Introduction to Speech processing


, If causal Final value theorem , Only if poles of the unit circle Table of common Z-transform pairs Here: u[n]=1 for n>=0, u[n]=0 for n<0 [n] = 1 for n=0, [n] = 0 otherwise Signal, x[n] 1 2 3 4 Z-transform, X(z)

5th Edition

2010

are inside

ROC

Amr M. Gody

Page

37

Introduction to Speech processing

5th Edition

2010

10

11 12

13

14

15

16

17

18

19

20
Page Amr M. Gody

38

Chapter 2 Speech Signal as an information source

Speech Signal as an information source 2008

Summary This chapter provides details about information included in speech signal. Speech signal conduct too many information. Some of the information like talkers identity, message and emotions will be explored. Research is very active to model such information. This chapter deals with introducing basic information models. It also goes further to provide some explanation of speech database. The spectral characteristics of different speech components will be discussed. The classification of speech components will be introduced. How emotions can be modeled will be discussed. The relation between prosodic characteristic and speech components will be discussed. Objectives Explore human speech from linguistic perspective. Understanding speech corpus. Understanding human speech signal as information source.

Amr M. Gody

Page

1. Acoustical parameters Most 1anguages, including Arabic, can be described in terms of a set of distinctive sounds, or phonemes. In particular, for American English, there are about 42 phonemes including [2] vowels, diphthongs, semivowels and consonants. There are a variety of ways of studying phonetics; e.g., linguists study the distinctive features or characteristics of the phonemes. For our purposes it is sufficient to consider an acoustic characterization of the various sounds including the place and manner of articulation, waveforms, and spectrographic characterizations of these sounds. Figure 1 shows how the sounds of American English are broken into phoneme classes.' The four broad classes of sounds are vowels, diphthongs, semivowels, and consonants. Each of these classes may be further broken down into subclasses that are related to the manner, and place of articulation of the sound within the vocal tract. Each of the phonemes in Figure 1 (a) can be classified as either a continuant, or a noncontinuant sound. Continuant sounds are produced by fixed (non-time-varying) vocal tract configuration excited by the appropriate source. The class of continuant sounds includes the vowels, the fricatives (both unvoiced and voiced), and the nasals. The remaining sounds (diphthongs, semivowels, stops and affricates) are produced by a changing vocal tract configuration. These are therefore classed as noncontinuants.

Speech Signal as an information source 2008

Amr M. Gody

Page

Speech Signal as an information source 2008

(a)

Figure 1: (a) Phonemes in American English, (b) Arabic phonemes.

Amr M. Gody

Page

(b)

The Arabic language has basically 34 phonemes, 28 consonants and six vowels (see fig 1 b). 2. Phonological hierarchy Phonological hierarchy describes a series of increasingly smaller regions of a phonological utterance. From larger to smaller units, it is as follows: Utterance Prosodic declination unit (DU) / intonational phrase (I-phrase) Prosodic intonation1 unit (IU) / phonological phrase (P-phrase) Prosodic list unit (LU) Clitic group Phonological word (P-word, ) Foot (F): "strong-weak" syllable sequences such as English ladder, button, eat it Syllable (): e.g. cat (1), ladder (2) Mora () ("half-syllable") Segment (phoneme): e.g. [k], [] and [t] in cat Feature Syllable A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "building blocks" of words. They can influence the rhythm of a language, its prosody, its poetic meter, its stress patterns, etc. A word that consists of a single syllable (like English cat) is called a monosyllable (such a word is monosyllabic), while a word consisting of two syllables (like monkey) is called a disyllable (such a word is disyllabic). A
1

Speech Signal as an information source 2008

The use of changing pitch to convey syntactic information: a questioning intonation.

Amr M. Gody

Page

word consisting of three syllables (such as indigent) is called a trisyllable (the adjective form is trisyllabic). A word consisting of more than three syllables (such as intelligence) is called a polysyllable (and could be described as polysyllabic), although this term is often used to describe words of two syllables or more. phoneme In human language, a phoneme is the smallest posited structural unit that distinguishes meaning. Phonemes are not the physical segments themselves, but, in theoretical terms, cognitive abstractions or categorizations of them. An example of a phoneme is the /t/ sound in the words tip, stand, water, and cat. (In transcription, phonemes are placed between slashes, as here.) These instances of /t/ are considered to fall under the same sound category despite the fact that in each word they are pronounced somewhat differently. The difference may not even be audible to native speakers, or the audible differences not apparent. phones Phoneme may cover several recognizably different speech sounds, called phones. In our example, the /t/ in tip is aspirated, [t], while the /t/ in stand is not, [t]. (In transcription, speech sounds that are not phonemes are placed in brackets, as here.) Allophones Phones that belong to the same phoneme, such as [t] and [t] for English /t/, are called allophones. A common test to determine whether two phones are allophones or separate phonemes rely on finding minimal pairs: words that differ by only the phones in question. For example, the words tip and dip illustrate that [t] and [d] are separate phonemes, /t/ and /d/, in English, whereas the lack of such a contrast in Korean (/tata/ is pronounced [tada], for example) indicates that in this language they are allophones of a phoneme /t/.
Page Amr M. Gody

Speech Signal as an information source 2008

3. Corpus Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are largely derived by an automated process, which is corrected. Computational methods had once been viewed as a holy grail of linguistic research, which would ultimately manifest a ruleset for natural language processing and machine translation at a high level. Such has not been the case, and since the cognitive revolution, cognitive linguistics has been largely critical of many claimed practical uses for corpora. However, as computation capacity and speed have increased, the use of corpora to study language and term relationships en masse has gained some respectability. The corpus approach runs counter to Noam Chomsky's view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting. Corpus Linguistics has generated a number of research methods, attempting to trace a path from data to theory: Annotation consists of the application of a scheme to texts. Annotations may include structural markup, POS-tagging, parsing, and numerous other representations. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers. Analysis consists of statistically probing, manipulating and generalizing from the dataset. Analysis might include statistical evaluations, optimization of rule-bases or knowledge discovery methods.
Amr M. Gody

Speech Signal as an information source 2008

Page

Most lexical corpora today are POS-tagged. However even corpus linguists who work with 'un-annotated plain text' inevitably apply some method to isolate terms that they are interested in from surrounding words. In such situations annotation and abstraction are combined in a lexical search. The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus. Linguists with other interests and differing perspectives than the originators can exploit this work. Speech corpuses are designed to provide source of segmented sound samples for researchers or for application manufacturers. There are many of famous databases in the market. Also there are many of languages being targeted by the databases. Any Speech corpus consists of all or some of the following parts: 1. 2. 3. 4. 5. 6. 7. Annotation information. Waveform Samples. Segmentation information. Text that describes the Spoken utterance. Phoneme Statistics. Recording information. Talkers statistics.

Speech Signal as an information source 2008

3.1. TIMIT corpus TIMIT is a corpus of phonemically and lexically2 transcribed 3 speech of American English speakers of different sexes and dialects 4. Each transcribed element has been represented precisely in time. TIMIT was designed to advance acoustic-phonetic knowledge and automatic speech recognition systems. It was commissioned by DARPA and worked on by many sites, including Texas Instruments (TI) and

2 3

By the mean of words To represent (speech sounds) by phonetic symbols. 4 A regional or social variety of a language distinguished by pronunciation, grammar, or vocabulary, especially a variety of speech differing from the standard literary language or speech pattern of the culture in which it exists.

Amr M. Gody

Page

Massachusetts Institute of Technology (MIT), hence the corpus' name. There is also a telephone bandwidth version called NTIMIT (Network TIMIT). The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation. 3.2. SCRIBE corpus

Speech Signal as an information source 2008

Amr M. Gody

Page

The material consists of a mixture of read speech and spontaneous speech. The read speech material consists of sentences selected from a set of 200 'phonetically rich' sentences (SET-A) and 460 'phonetically compact' sentences (SET-B) and a two-minute continuous passage. The 'phonetically rich' sentences were designed at CSTR to be phonetically balanced. The 'phonetically compact' sentences were based on a British version of the MIT compact sentences (as in TIMIT) which were expanded to include relevant RP contrasts (the set contains at least one example of every possible triphone in British English). The passage was designed at UCL to contain accent sensitive material. The spontaneous speech material was collected from a constrained 'free speech' situation where a talker gave a verbal description of a picture.

The recordings were divided between a 'many talker' set and a 'few talker' set. In the 'many talker' set, each speaker recorded ten sentences from the 'phonetically rich' sentences and ten sentences from the 'phonetically compact' sentences. In the 'few talker' set, each speaker recorded 100 sentences from the 'phonetically rich' set and 100 from the 'phonetically compact' set. Speakers were recruited from four 'dialect areas': South East (DR1), Glasgow (DR2), Leeds (DR3) and Birmingham (DR4). The aim was to employ 5 male and 5 female speakers from each dialect area for the fewtalker sub corpus, with 20 male and 20 female speakers from each dialect area for the many-talker corpus. In fact this number of speakers was not fully achieved. The original aim of the project was to release the corpus as a collection of audio recordings with just orthographic transcription, but with a small percentage to be phonetically annotated in the style of the TIMIT corpus. FILE EXTENSIONS
SES PES FES SE2 PE2 FE2 SET PET FER text SEO Sentence(s) English Sampled Passage English Sampled Free-speech English Sampled Sentence(s) Eng. 2nd channel Passage Eng. 2nd channel Free-speech Eng. 2nd channel Sentence English Text Passage English Text Pressure Microphone signal Pressure Microphone signal Pressure Microphone signal Close talking Microphone signal Close talking Microphone signal Close talking Microphone signal Text used to prompt the subject Text used to prompt the subject

Speech Signal as an information source 2008

Free-speech Eng. tRanscription Orthographic transcription as

Sentence Eng. Orthography

Orthographic time aligned labels Acoustic phonetic time aligned Acoustic phonetic time aligned

SEA Sentence Eng. Acoustic labels PEA Passage Eng. Acoustic labels SEB Sentence Eng. Broad labels PEB Passage Eng. Broad labels

Broad phonetic time aligned Broad phonetic time aligned

Amr M. Gody

Page

3.3. Steps to prepare speech corpus Prepare the text file.

Speech Signal as an information source 2008

Read the text file to record the speech Prepare transcription for the text. Open speech using suitable tool that enables you to view speech waveform and associated spectrogram.(SFS 5) With the aid of spectrogram, start to locate each character into the transcription file. Find the stable parts on the spectrogram. Mark them, then write the symbol. {Number of stable parts should equals to the number of symbols into the transcription file} Store annotations into suitable file. Name the file exactly as the original speech file with a new descriptive file extension.

Examples of annotation file HTK format (.Lab)


0 32829500 34364500 34529500 34918500 37454000 42013500 43930000 46479500 47297000 48293000 50451000 51963000 52388500 54029500 54817500 56515000 56687500 32829500 34364500 34529500 34918500 37454000 42013500 43930000 46479500 47297000 48293000 50451000 51963000 52388500 54029500 54817500 56515000 56687500 57779500 SIL VOI UNV VOI UNV VOI UNV VOI SIL UNV VOI UNV SIL UNV VOI UNV VOI UNV

Speech Filing System, http://www.phon.ucl.ac.uk/resource/sfs/

Amr M. Gody

Page

10

Each raw indicates start time, end time and transcription symbol. The units of time are in 100(ns). For example the first Raw tells us that There is a SIL starts at time 0 and ends by 32829500 x 100 x 10-9 = 3.2(s).

Speech Signal as an information source 2008

Figure 2 Mapping annotation of HTK format into seconds

Figure 3 Multi view screen in SFS is used to best annotate certain speech sample in SCRIBE corpus.

Amr M. Gody

Page

11

4. Speech stream information extraction model

Speech Signal as an information source 2008

In this section, speech stream will be processed as information source. To start, we have first to declare what information we need to extract from speech stream. Phonemes will be considered as information in this discussion. Let us start by the first beginning of the problem. The story starts by the first moments we faced the world. It is the birthday of the human body. By the first beginning, the perception mechanisms starts to collects all sounds available in the environment. The first step is information packaging. Speech samples are packaged into short duration time frames. Each frame may be not exceeding 30 (ms). This number is to ensure that spectral parameters of speech signal will be stationary. This stage is visualized in figure 4. Each frame is 10 (ms) in this example. Within the 10 (ms) the frame may fall into three different states: 1- The frame is a pure part of a single phone. 2- The frame includes a pure part of a single phone and a successor or a possessor part that is a transition to or from adjacent phone. 3- The frame covered all phone parts. Each phone may be considered as a three stationary parts. Tow transition parts to adjacent phones and one middle stationary part. Frames are colored in figure 4 to hold the three-state model image in our mind.

Amr M. Gody

Page

12

Speech Signal as an information source 2008


Speech Pereception stage

30 (ms) = 3 Frames Train to get the maximum probability by randomly dropping fractals

Figure 4 Speech stream is converted into frames stream. Each frame is 10(ms).

The first process is to locate the speech activities. It is the first intelligent process done by the human childs brain. The brain starts to discriminate human speech off the available environmental sounds.

Baby stage
Utterance

Phone detector Enhancing old models Generating new models

Figure 5: Phone models are continuously updated.

Amr M. Gody

Page

13

Speech Signal as an information source 2008


By the time, childs brain starts to discriminate human sounds. It starts to recognize the phonemes.

Child stage

Transcription

Models Labeling

Figure 6: Models are identified into phones. Transcription is included into the learning process.

Going growth, the human brain starts to make complex analytical functionalities. The brain starts to apply grammars and to understand the emotions from the speech signal. Altering the meaning according to the context starts and a huge word networks are constructed in the mind.

Amr M. Gody

Page

14

Speech Signal as an information source 2008

Adult Stage

Corpus

Dictionary

Language Modeler Words net Grammars

Figure 7: Corpus and language models are included in the training process. Brain starts to make complex processing of speech stream. The brain starts to apply grammars and to recognize words and meanings. The lexical word net starts to be constructed.

Figure 8 introduces a simple mathematical model of speech stream processing as being discussed. First step, speech stream is turned into small time durations. This process is called time framing. This is important to have stationary signals for subsequent mathematical operations. So we can process frames stream instead of samples stream. The frame may be 30 (ms) or less for human speech signal. This is to minimize the effect of dynamics in speech features along the time. Step 1 is to Mark frames into speech / nonspeech. The output of this stage should be purely speech frames. Step 2 is to mark the speech frames stream into phones. This is a coloring phase. Step 3 is to merge frames for the same phone into one part. This is the information stream.
Amr M. Gody

Page

15

Speech Signal as an information source 2008


CAT Silent Markers Stationary delimiters Markers translation Observation Sets Phone Recognizer L1 Phone sequence Phone 2 text

Figure 8: Speech stream information extraction model

Figure 9 provides a simple pipeline process that describes the basic steps for information extraction model discussed in the previous paragraph.

Layout
Digitized speech Sample/Sec Frame/Sec

Buffer A

Buffer B

Process A

Process B

Phone/Sec

Process C

Each process is responsible for Reading from a certain buffer and writing to another buffer. This way the total pipeline operation is constructed. Each buffer is an entity that self manage of its store.

Buffer C

Word/Sec

Process D

Buffer D

Figure 9: Information extraction pipeline.

Amr M. Gody

Page

16

Speech Signal as an information source 2008


5. Speech processor model In this section a preliminary model for speech processor will be introduced. This model based on phone units.

Figure 10: Configurable phone cell (CPC)

Wavelet packets technique is utilized to detect phone features. Best tree nodes method is used as phone identifier. This gives a real representation of phone frequency components over a predetermined frequency bands. Node index is used as a feature. This give a sense that phone has a frequency component with suitable relative power in this band. This will give the process more speed by ignoring do any feature
Amr M. Gody

Page

17

extraction in certain band. Also this make the process is signal level independent. The following slide indicates how signal with different properties are discriminated using best tree algorithm

Speech Signal as an information source 2008

An algorithm of tree matching is utilized to give a weight for the match not just [matched/no matched] condition. This is important in such stochastically signals. Figure 11 represents the overall system layout

Figure 11: Scheme of speech processor

Amr M. Gody

Page

18

Speech Signal as an information source 2008


Start training mode

1- Number of training phones N 2- Number of new CPC units needed. 3- Reference to already matched CPC units in previous training sessions. 4- Phone Sequence List

Analyze the transcript file

Loop for Each phone in transcription

No
Allocate new CPC
Phone Already exist

yes
Locate phone CPC

Input buffer

Pass input frame to CPC

Group Testing Testing the next three frames

Frame belongs to Group

No yes No
Average count indicates group change

yes
Exclude garbage frames

Hmm training set

CPC Start Training HMM after finishing loop

Amr M. Gody

Page

19

Figure 12speech processors flowchart.

References

Speech Signal as an information source 2008

[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987,pp. 57-98, 136-192, 291-317. [3] Nemat Sayed Abdel Kader ," Arabic Text-to-Speech Synthesis by Rule", Ph.D thesis , Cairo university, faculty of Eng., communication dept., 1992. Page 165. [4] Wikipedia contributors, "Phonological hierarchy," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Phonological_hierarchy&oldid=204742 230 (accessed October 24, 2008). [5] Amr M. Gody, " Speech Processing Using Wavelet Based Algorithms" , Ph.D thesis , Cairo university, faculty of Eng., communication dept., 1999. [6] Amr M. Gody, "Human Hearing Mechanism Codec (HHMC)", CLE2007,The sixth conference of language engineering , Ain-Shams University, Cairo Egypt, December 2007 [7] Amr M. Gody, "Wavelet Packets Best Tree 4 Points Encoded (BTE) Features", The Eighth Conference on Language Engineering, Ain-Shams University, Cairo Egypt, December 2008 [8] Amr M. Gody, " Voiced/Unvoiced and Silent Classification Using HMM Classifier Based on Wavelet Packets BTE Features", The Eighth Conference on Language Engineering, Ain-Shams University, Cairo Egypt, December 2008
3TU U3T

Amr M. Gody

Page

20

Chapter 3 Speech Analysis

Speech Analysis

2nd Edition

2008

Summary In this chapter Linear Prediction Coding (LPC) will be introduced. Storage. The importance of this method lies both in its ability to provide extremely accurate estimates of the speech features, and in its relative speed of computation. Some of the issues which are involved in using it in practical speech applications will be discussed.

Objectives Understanding LPC. Practice LPC using Matlab

Amr M. Gody

Page

Speech Analysis

2nd Edition

2008

1. Linear Prediction Coding (LPC) Linear prediction is one of the most important tools in speech processing. It can be utilized in many ways but regarding to speech processing, the most important property is the ability to model the vocal tract. It can be shown that the lattice structured model of the vocal tract is an all-pole filter which means a filter that has only poles. One can also think that the lack of zeros restricts the filter to bolster up certain frequencies which in this case are the formant frequencies of the vocal tract. In reality the vocal tract is not composed of lossless uniform tubes, but in practice, modeling the vocal tract with an all-pole filter works fine. Linear prediction (LP) is a useful method for estimating the parameters of this all-pole filter according to a recorded speech signal.

Figure 1: Vocal tract model

Amr M. Gody

Page

This system is excited by an impulse train for voiced speech or a random noise sequence for unvoiced speech. Thus, the parameters of this model are: voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and the coefficients (a k ) or the digital filter, These parameters are slowly varying with time.

Speech Analysis

2nd Edition

2008

The term linear prediction refers to the prediction of the output of a linear system based on its input and previous outputs s(n) = ( ) + bm u(n m)
k=1 m=0 p L

[1]

For systems where the input is unknown, it is better to try to express it in terms of current input and previous output. This is achieved by putting = 0 > 0 () = ( ) + 0 ()
=1

[2]

Once we know the current input and the previous output we can predict the current output. The target is to evaluate the behavior of the unknown system H(z) once we have such information {previous outputs and current input}. () = () () () = ()()

[3] [4]

To go into the model that may generate such output in equation 2, we consider All-pole model. Assume the following transfer function () = ()

Amr M. Gody

Page

[5]

Speech Analysis

2nd Edition

2008

While

() = 1 +
=1

[6]

Now let us evaluate the predicted output from that system

() 1 + = ()
=1

() = ()()

[7]

[8]

Taking the inverse Z-transform

We can go further into our assumptions by putting = 1 () = () ( )


=1

() = () ( )
=1

[9]

[10]

The input is unknown in speech signal. If we get ride the input () = ( )


=1

[11]

Amr M. Gody

Page

From above derivations we have the following points:

Speech Analysis

2nd Edition

2008

1- The vocal track is modeled in an all pole model. This implies that the output is a function of previous output and the current input. 2- We are targeting to model vocal tract filter using the observation utterance. The input is unkown. = ( () ())2 = ()2 =

3- The mission now is to find the optimal parameters { } that ensures the minimum error

[12]

Let us start to process equation 12. As the number of available samples are limited due to the framing process, e(n) has a value only in certain time indexes (Instances defined by the index n). This leads to that the energy of e(n) will not change if we sum over infinite = ()2

= ( ( ))2 = ( )
= =0

= ( ( ) + ())2
= =1

[13]

Amr M. Gody

Page

So the problem now is to minimize E on the variable where { = 1,2, , }

0 = 1

[14]

( ) =

( )

= ( )
=

=0

Speech Analysis

2nd Edition

2008
[15]

= 2 ( ( )) + ( )
=0 =0

= ( ) = 2 ( ( ))( )
= = =0

( ) = 2 ( ( ))( )

() +1 ( 1) + 0

[16]

[17]

By rearranging Eqn. 17:

Let

= 2 ( )( )
=0 =

[18]

Amr M. Gody

Page

(, ) = ( )( )
=

[19]

Speech Analysis

2nd Edition

2008

Equation 19 is the autocorrelation of s(n) with a delay k-i. We can rearrange (, ) much more by replacing n-i = m (, ) = ()( ( )) = ( )
=

[20]

Substituting 19 into 18:

= 2 (, )
=0

[21]

Recalling equation 15: = 2 (, ) = 0


=0

It is required to solve equation 22 for . Recalling that = {1,2, , }

[22]

To confirm equation 22 for any value of , then: 2 (, 1) = 0 =0 2 (, 2) = 0 =0 2 (, ) = 0 =0

Amr M. Gody

Page

[23]

Speech Analysis

2nd Edition

2008

Recalling from equations 19 and 20 that R(k) = R(-k) and from equation 14 that 0 = 1, then equation 24 may be rearranged as follows:
=1

= 2 (1) +1 (0) + + ( 1)

2 (, 1) = 2 0 (0,1) +1 (1,1) + + (, 1)
=0

[24]

( 1) = (1)

[25]

Follow the same direction as in 24 and 25 we can write the other terms of the equality in equation 23 as follows:
( 1) = (1) =1 ( 2) = (2) =1 ( ) = () =1

[26]

Equation 26 may be written in matrix form:


( 1)

(0) (1)

(1) (2)

( 2)

( 1) 1 (1) 2 ( 2) (2) = (0) ()

[27]

Equation 27 is symmetric matrix of autocorrelation coefficients. Let us recall what we have till now:
Page

1- The linear prediction coefficients may be calculated using equation 27. It is for a linear system that can we utilize to regenerate the output with
Amr M. Gody

Speech Analysis

2nd Edition

2008

3- () = () From this point let us figure out the possible ways to solve equation 27.

2- 0 = 1.

minimum error. This model do not assume the real input that generates the original signal, rather it assumes an impulse input.

Figure 2 LPC of certain speech frame

Amr M. Gody

Page

%%[A] = LPCDemo(file,N,T,p) % This function demonstrate LPC using plotting spectral, Estimated frame % spectral, Frame waveform and estimated frame waveform. The frame is % extracted from a speech samples included into the WAV file 'file'. This % function returns LPC parameters vector of the requested frame. %file :WAV file %N : Frame number. The first frame is frame number 0. Default = 1 %T : Frame size in ms, Default = 10 (ms) %p : LPC order, Default = 12 %%----------------------------------------------------------------------function [A] = LPCDemo(file,N,T,p) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); elseif nbIn == 1, N=1; T=10; p=12;

Speech Analysis

2nd Edition

2008

elseif nbIn ==2, T=10; p=12; elseif nbIn ==3, p=12; end; [y fs] = wavread(file); frame_size = T *1e-3 * fs; Start = N* frame_size; End = Start + frame_size -1; Frame = y(Start:End); temp = abs(fft(Frame)); Spectral_Parameters = temp(1:int32(frame_size/2)); dltaf = fs / frame_size; dltat = 1/fs; t = Start*dltat:dltat: End * dltat; f = 0:dltaf:fs/2-dltaf; A = LPC(Frame,p);

B = 1; x = zeros(frame_size,1); x(1) = 1; Estimated_frame = filter(B,A,x); temp = abs(fft(Estimated_frame)); Estimated_Spectral_Parameters = temp(1:int32(frame_size/2));

subplot(2,2,1);plot(t,Estimated_frame);xlabel('Time(sec)');ylabel('Amplitude'); title('Estimated Frame for single impulse'); subplot(2,2,2);plot(t,Frame);xlabel('Time(sec)');ylabel('Amplitude');title('Fra me'); subplot(2,2,4);plot(f,Spectral_Parameters);xlabel('Frequency(Hz)');ylabel('Spec tral value');title('Spectrum of Frame'); subplot(2,2,3);plot(f,Estimated_Spectral_Parameters);xlabel('Frequency(Hz)');yl abel('Spectral value');title('Spectrum of Estimated Frame'); end

Amr M. Gody

Page

10

Speech Analysis

2nd Edition

2008

Figure 3 Speech properties are a time varying.

Figure 3 indicates that speech waveform changes in shape by time. As it is shown in figure, each step on the time axis is 5 (ms). LPC parameters provide filter parameters for certain time duration. The duration should be selected such that speech waveform has stationary properties. The typical values for the analysis period may be 10, 15, 20, 25, and 30 (ms). Let us consider a simple example. Assume a speech signal sampled at 32000 (Hz). Consider analysis period of 20(ms). The frame length in samples will be 32000 20 103 = 640 ().

Assume that the LPC model considered in this chapter is used to model speech production process. Assume that the LPC order is 12. This means that 12 parameters will be used to generate equivalent speech waveform in the designated analysis period. For this example it is 20 (ms). The model in figure1 will be excited using a train of impulses with a period equals to the fundamental frequency (pitch) in case of voiced sound. The above example indicates that 640 samples frame length is replaced by 12 parameters + 1 extra parameter that indicated if this frame is for voiced sound or unvoiced sound. This implies 640:13 compression ratio.
Page Amr M. Gody

11

Speech Analysis

2nd Edition

2008

Figure 4 Encoding /Decoding process using LPC parameters

The process is illustrated in figure 4. Speech waveform is sliced into frames. Each frame is 20(ms). Each frame is 640 samples. Each frame is applied to the LPC process. LPC process extracts the LPC parameters. Here it is 12 parameters. Also the pitch information is packaged into the parameters frame. So each time samples frame is converted into 12 LPC parameters + 2 parameters that include the pitch period and the type of that frame voiced/unvoiced). The sequence of parameters frames are applied to the model. Then the predicted speech waveform is constructed. The MUX block is for Multiplexer. It multiplex the two inputs using the Voiced/Unvoiced selector that is included into the parameters frame.

Amr M. Gody

Page

12

Speech Analysis

2nd Edition

2008

References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987

Amr M. Gody

Page

13

Chapter 4 Popular Speech features

Popular Speech features

5th

Edition

2008

Summary In this chapter some of the popular speech signal features will be explained. Speech features are the base of speech recognition. Speech features are used to discriminate different speech sounds. There are too many features that may be utilized in Automatic speech recognition systems. In this chapter we will pass through Pitch, formants, Cepstrum, Energy and zero crossing and Mel Cepstrum. The algorithms to evaluate the mentioned features will be explained and provided in C# or Matlab scripting programming languages.

Objectives Understanding Pitch. Understanding Reflection coefficients. Understanding Formants. Understanding Cepstrum and Mel Cepstrum. Understating Energy and Zero crossing rate of different speech sounds. Practice using Matlab and C#.

Amr M. Gody

Page

Popular Speech features

5th

Edition

2008

1. Energy The energy is a property that can be used to discriminate Voiced speech from unvoiced speech. Let us consider the short time energy. Short time means the energy of analysis period. = ()2
=

[1]

Equation 1 evaluates the energy of the signal s(n). We are dealing here in short time energy for single frame.

Figure 1: Impulse train and its Fourier transform.

As it shown in figure 1, the impulse train has Fourier transform of impulse train with a frequency spacing of be utilized to understand all events in frequency domain. Let us start by analyzing the windowing effect. It is the process of cut certain period off the signal to extract features. This process is called a framing stage. The cut
Amr M. Gody

. This result is very important and may

Page

Popular Speech features

5th

Edition

2008

signal is called the analysis period. Figure 2 indicates the model of framing process.

Figure 2: The window filter

Speech frame S f (n) is a result of applying a frame of impulse train U(n) to the filter H(z). H(z) is a model of glottal pulse shape filter multiplied by vocal tract filter. The frame of impulse train is a result of multiplying an infinite impulse train by the window W(n).
() = ()()

The multiplication in time domain is a convolution in frequency domain between the W(z) and P(Z). Figure 3 indicates the frequency response of a rectangular window of length 30 samples.

Amr M. Gody

Page

Popular Speech features

5th

Edition

2008

Figure 3: Rectangular window and its spectrum.

As shown in figure 3 the window in frequency domain is a low pass filter with a cut off frequency of between P(Z) and W(z) will produce a train of W(z) in the frequency domain.
30

in this example. The convolution

Amr M. Gody

Page

Figure 4: Frame of impulses of period P = 5

Popular Speech features

5th

Edition

2008

As shown in figure 4, the frequency response of U(n) is shaped impulses by the filter in figure 3. Now U(n) is the excitation of H(z). The multiplication between the shaped impulses and H(Z) will be much sensitive to frequency contents than impulse train with almost zero duration in frequency domain as shown in figure 5. The pulse width increases if the window length decreases. = 2 2
2

PW is the pulse width. As it is shown in figure 3, half of the pulse is

When pulse width increases much frequency contents will be included. This certainly will has a significant factor if it is needed to catch any phenomena that is fluctuating in time domain or has many adjacent frequency components in frequency domain. Also if the window is very narrow for example in order of impulse period, this will cause that the shaped impulses will be overlapped in frequency domain and may cause noisy information. So if we choose a suitable length of the window, this will help to detect much frequency components from E(z), which in turn cause to detect the fast fluctuations in E(z).

(a)

Amr M. Gody

Page

Popular Speech features

5th

Edition

2008

(b)
Figure 5: The U(n) H(n) in frequency domain. (a) U(n) is a multiplication between window function and glottal impulses. (b) No window function is applied to the glottal impulses.

= (()( )) = () ( )
= =

[2]

() = ()

For rectangular window as that one shown in figure 2, the impulse response is given by 1 0 1 () = 0 sin (

Amr M. Gody

Page

) 2 (N1) 2 () = sin ( ) 2

[3]

Popular Speech features

5th

Edition

2008

Figure 6: S(n) is being convoluted in time domain with the window filter to obtain the frame samples.

Amr M. Gody

Page

Figure 7 : Rectangular window and its impulse response. Window width 220.

Popular Speech features

5th

Edition

2008

Figure 8: Rectangular window and its impulse response. Window width 40.

Figure 8 explains frequency response of rectangular window of length 220 (samples). The cut off frequency of the filter is 0.028 (rad/sample). In figure 9 where window width is 40 (samples), the cut off frequency is 0.157 (rad/sample). The following script in Matlab is used to draw figures 8 and 9.
%% function [omega,ys] = WFD(N) % WFD is for Window Filter Demo. This function calculate the impulse % response of the rectangular window. It returns the digital frequency % Omega and the associated absolute spectrum for the window of width N. %%---------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [omega,ys] = WFD(N) t = -N * 10:N*10; M = size(t,2)-1; x = rectpuls(t,N); y = fft(x); ys = abs(y); dltaOmega = 2*pi/M; L = 0:M; omega = dltaOmega .* L; tt = 1; yy= 1.4; Amr M. Gody

Page

Popular Speech features

5th

Edition

2008

subplot(2,1,1);plot(t,x,'-',tt,yy);xlabel('Time (Index)');ylabel('Magnitude');Title('Sampled Rectangular window'); subplot(2,1,2);plot(omega,ys);xlabel('Omega (rad/sample)');ylabel('Absolute spectrum');Title('Frequency Domain'); end

As shown in figure 10 Energy is a function of window width. When the window is too short, the cut off frequency is a little bit large so that the fluctuation in energy is much noticeable. The energy function may be used to discriminate Voiced sounds off the unvoiced sounds. Voiced sounds have noticeable energy values with respect to the unvoiced sounds.

Figure 9 : Energy as a function in window width for certain speech sample. The speech utterance is What she, said

Amr M. Gody

Page

From the above discussion we can have the following conclusions

Popular Speech features

5th

Edition

2008

1- The rectangular window affects the frequency contents of the analysis frame. The window width is acting as a low pass filter that shapes the impulse train of the vocal cords. The cut off frequency inversely proportional with window width. The cut off frequency is 2- Window length has a key factor in detecting the fluctuation in the energy function. The fluctuation may be modeled as frequency components spread over certain frequency band. Window length shapes the impulse trains in frequency domain. It cause the that the output signal to be much dense of frequency components than the zero length impulse train. 3- The energy function is used to discriminate the voiced sounds off the unvoiced sounds.
2

2. Zero Crossing Speech signals are broadband signals and the interpretation of average zero-crossing rate is therefore much less precise. However, rough estimates of spectral properties can be obtained using a representation based on the short- time average zero-crossing rate. Before discussing the interpretation or zero- crossing rate for speech, let us first define and discuss the required computations. = () (( 1))( )
=

[4]

Where

1 () 0 () = 1 () < 0

Amr M. Gody

Page

1 () = 2 0 1 0

[5]

[6]

10

Popular Speech features

5th

Edition

2008

Figure 10 : Short time average Zero crossing

If zero crossing occurred, equation 4 will evaluate to 2. To get the total number of zero crossing over the N samples we should first divide by 2 to evaluate the right number of zero crossing then we will divide by N to get the average. Now let us see how the short-time average zero-crossing rate applies to speech signals. The model for speech production suggests that the energy of voiced speech is concentrated below about 3 kHz because of the spectrum fall off introduced by the glottal wave, whereas for unvoiced speech, most of the energy is round at higher frequencies. Since high frequencies imply high zero- crossing rates, and low frequencies imply low zero-crossing rates, there is a strong correlation between zero-crossing rate and energy distribution with frequency. A reasonable generalization is that if the zerocrossing rate is high, the speech signal is unvoiced, while if the zerocrossing rate is low, the speech signal is voiced. This however is a very imprecise statement because we have not said what is high and what is low. And, of course, it really is not possible to be precise. Figure 7 shows a histogram of average zero-crossing rates (average over 10 (msec)) for both voiced and unvoiced speech. Note that a Gaussian curve provides a reasonably good fit to each distribution. The mean short-time average zerocrossing rate is 49 per 10 (msec) for unvoiced and 14 per 10 (msec) for voiced. Clearly the two distributions overlap so that an unequivocal voiced/unvoiced decision is not possible based on short-time average zero crossing rate alone. Nevertheless, such a representation is quite useful in making this distinction.

Amr M. Gody

Page

11

Popular Speech features

5th

Edition

2008

Figure 11: Histogram for zero crossing averaged on 10 (ms) for Voiced/Unvoiced samples. The solid curve is for Gaussian distribution fit.

Clearly, the zero-crossing rate is strongly affected by dc offset in the analog-to-digital converter; 60 Hz hum in the signal and any noise that may present in the digitizing system. Therefore, extreme care must be taken in the analog processing prior to sampling to minimize these effects. For example, is often preferable to use a bandpass filter, rather than a lowpass filter as the anti-aliasing filter to eliminate dc and 60 Hz components in the signal.

Amr M. Gody

Page

12

Popular Speech features

5th

Edition

2008

Figure 12: Average zero-crossing rate for three different utterances.

3. Reflection coefficients A widely used model for speech production is based upon the assumption that the vocal tract can be represented as a concatenation of lossless acoustic tubes. Figure 9 lists the main organs that are responsible of speech production. Starting from this model, we can see that speech waveform is a flow of air inside different tubes. The tubes are not uniform. It has different cross section variations along the tube path to articulate the different sounds. This changing properties cause turbulence in the air flow.

Amr M. Gody

Page

13

Popular Speech features

5th

Edition

2008

Figure 13: Speech production schematic in human being.

To study the air flow inside those tubes, we should consider the velocity of the air, the pressure along with the tube path and the properties of the tubes. Let us consider the following parameters:

Amr M. Gody

Page

14

Popular Speech features

5th

Edition

2008

Figure 14: Block diagram of uniform tube.

Considering the hypothetical tube of constant cross section indicated in figure 10, and then we will have the following relations that rule the pressure and the velocity:
= = 2

[7]

Now let us expand our talk to cover the whole vocal tract tube. Refer to figure 10; the whole vocal tract excluding the nasal cavity is modeled as a long straight tube with continuously changing cross sectional area.

Figure 15: The tube model of continuous area function.

Amr M. Gody

Page

Let us solve the differential equations in 7. Equation 7 is derived for uniform cross section tube as that one indicated in figure 10. To overcome

15

Popular Speech features

5th

Edition

2008

this limitation, let us consider to segment the tube in figure 11 into segments with each of uniform cross section as shown in figure 12.

Figure 16: Segmented vocal tract tube.

The solution will be for a single segment. Assume time harmonic input as in equation 8. This assumption is very practical as any signal may be considered as a combination of sins and cosines. Also the system is a linear system, so the output is a sum of each output resulted from the input harmonics. The term will appear in all equations terms.

= =

[8] [9] [10]

Substituting in equation 7 from eqn. 9:


= = 2

Amr M. Gody

Page

16

Popular Speech features


5th

Edition

2008

Applying
2

= 2 2 = 2 2
2 = 2 2 2 2 2 = 2 2

on equation 10:

[11]

Substitute from equation 10 into equation 11:

Let us define the constant =


2

= + + = + +

2
2

[12]

[13]

[14] [15]

(, ) = +

(, ) = + +

[16]

The + and signs are for incident and reflected wav respectively. The x direction is considered as the incident direction. = +

To get wav impedance, assume no reflection:

= +

Substitute in equation 10 from equations 17:

Amr M. Gody

Page

= =

[18]

[17] [19]

17

R is the speech wave impedance inside the lossless transmission line media (vocal tract model). =

Popular Speech features

5th

Edition

2008
[20]

[21]

(, ) = + ( ) ( ) () = + + And

() = +

Recalling that the speech wave has a direction of the velocity, equation 16 may be written as: [22] [23]

Equations 23 and 24 will be denoted with the index of the corresponding segment as of figure 12 (, ) = ( + ( ) + ( )) (, ) = + ( ) ( ) + = + ( ) = +

[24]

[25]

By a few processing to equation 25, we can obtain equation 27.


Amr M. Gody

Page

18

The propagation within a segment causes a shift in the velocity wave as indicated in equation 27 equals to . For a complete picture refer to figure 13.

+ ( , ) = + = + ( )

[26] [27]

Popular Speech features

5th

Edition

2008

[28]

Figure 17: Reflection at a glance due to Vocal tract segmentations model.

The reflection coefficient at the interface between segments k and k+1 is given by: Let us consider the following boundary conditions between any two successive segments ( , ) = +1 (0, ) ( , ) = +1 (0, ) [29]

The reflection and transmission coefficients for the wav propagating from Media 1 to media 2 Now Let solve equations 25 for + and

Amr M. Gody

Page

19

Popular Speech features

5th

Edition

2008

Figure 13 and the differential equations 10 leads to the transmission line analogy. Each segment may be modeled as a lossless transmission line. P and U will do the role of V and I in the transmission line. The characteristic impedance of the transmission line will be R.
lk Rg

Z1 Vg

Zk

Zk+1

Zlips

Figure 18: Analogous circuit for vocal tract lossless tube model
L I-dI C

dx

= =

= =

Figure 19: Transmission Line segment model

[30]

[31]

Comparing equation 7 to 31, we can find that


2

[32]

The characteristic impedance in lossless transmission line is given by:

Amr M. Gody

Page

20

Popular Speech features =


5th

Edition

2008
[33]

= =

Substituting from 32 into 33


[34]

Equation 34 matched to 21 as expected. The above discussion leads us to use Transmission line calculations. Each segment in figure 13 is a load segment as shown in figure 14. The wav turns to be standing wave due to the reflection at load points. The reflection and transmission coefficients are given by equations 35.
2 1 2 + 1

1,2 =

1,2 =

22 2 + 1

2 1 + 2 1 2 2 + 2 1

= 1 +2
21 1 +2
1 2

[35]

4. Pitch The opening and closing of the cords break the air stream up into pulses as shown in figure 1. The repetition rate of the pulses is termed Pitch.

Figure 20: The glottal pulse train.

Amr M. Gody

Page

21

Popular Speech features

5th

Edition

2008

Figure 21: Vocal cords

5. Cepstrum

The output of the characteristic system in figure 18; () is called the "complex cepstrum". The term "complex cepstrum" implies that the complex logarithm is involved. Given that it is possible to compute the complex logarithm so as to satisfy Equation 36, the inverse transform of the complex logarithm of the Fourier transform of the input is the output of the characteristic system for convolution, i.e..
Figure 22: Complex cepstrum

() = log() = log1 ()2 () = log1 () + log2 () 1 () = () 2 [36]

[37]

Amr M. Gody

Page

22

Popular Speech features

5th

Edition

2008

Equation 37 gives a very important aspect of cepstrum. The Fourier transform of a certain process may be considered as a resultant of cascaded filters as shown in equation 36. Each filter in the cascaded process may hold a certain piece of information. It is hypothetical filters. I can assume that the total information is a compound of smaller pieces of that information pieces. The log operator breakdown the compound information into superimposed pieces. The final stage of the process is the inverse transform. This makes the cepstrum parameters are the output of a linear system that is being triggered by that superimposed pieces of information signals. This makes cepstrum domain acts as a parallel time domain that gives a new perspective of the original speech signal. For example the signal in time domain is a sum of sin and cosines. We can discriminate those signals in frequency domain by using suitable filters. The signal in the parallel time domain (quefrency domain) is a resultant of sum of some signals in the log-frequency domain. Those signals may be discriminated using suitable filters (called lifters) in the parallel time domain. Much information may be emphasized from this new perspective. The keywords in this new parallel domain are a little bit altered to remind us with the way it is being created. Figure 19 provides the keywords in the newly established domain. This is a Homomorphic filtering.

Figure 23: Keywords in the cepstrum domain.

Amr M. Gody

Page

23

What is the Homomorphic filtering?

Popular Speech features

5th

Edition

2008

Homomorphic filtering is a generalized technique for signal and image processing, involving a nonlinear mapping to a different domain in which linear filter techniques are applied, followed by mapping back to the original domain. Homomorphic filtering is used in the log-spectral domain to separate filter effects from excitation effects, for example in the computation of the cepstrum as a sound representation; enhancements in the log spectral domain can improve sound intelligibility. 7.1. Speech signal in Complex cepstrum domain Recall that the model for speech production consists essentially or a slowly time-varying linear system excited by either a quasi-periodic impulse train or by random noise. Thus, it is appropriate to think or a short segment of voiced speech as having been generated by exciting a linear timeinvariant system by a periodic impulse train. Similarly, a short segment of unvoiced speech can be thought of as resulting from the excitation of a linear time-invariant system by random noise. That is, a short segment of voiced speech can be thought of as a segment from the waveform Equation 38 illustrate that the voiced segment is a resultant of a convolution of Periodic pulse train () with a period and glottal pulse shape filter () and vocal tract filter () and radiation effects filter (). For unvoiced speech the waveform segment is given by equation 39 below: () = () () () () : is random noise excitation. [39] () = () () () () [38]

Amr M. Gody

Page

From equations 38 and 39, and from the definition of complex cepstrum, we can find that using cepstrum makes it possible to isolate the excitation off the vocal tract model. This is very important in automatic speech recognition process. The excitation is a speaker dependant. By isolating that effect, it is expected that the recognition efficiency will be highly enhanced.

24

Popular Speech features

5th

Edition

2008

Figure 24: 20(ms) segment of voiced speech. Hamming window is applied.

A segment of voiced speech is shown in figure 20. The hamming window is applied to minimize the effect of sharp changes at segments ends. The effect of excitation plus train is clearly appearing as high peaks. The filters effect appear in the shape of pulses.

Figure 25: Cepstrum of voiced segment.

Amr M. Gody

Page

25

Popular Speech features

5th

Edition

2008

Figure 26: Cepstrum analysis of hamming segmented voiced speech segment.

%%Cepstrum_Demo(file,start_time,duration) % This function demonstrate Cepstrum analysis. It applies complex cepstrum % analysis on a short segment of duration in (ms). To use it, get a wav % file and choose a period of voiced speech and periods of unvoiced speech. % Save the figures and make a conclusion. Try to focus on pitch period. %file : WAV file %start_time : Starting time in seconds to get a segment. %duration :duration of the segment in milliseconds %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function Cepstrum_Demo(file,start_time,duration) [y,fs] = wavread(file); LastSample = size(y,1); t = linspace(0,LastSample/fs,size(y,1)); end_time = start_time+duration * 1e-3; start_sample = start_time * fs; end_sample = end_time * fs; S = y(start_sample:end_sample); t_s = t(start_sample:end_sample); t_s = t_s - t_s(1); %this step is to shift the time axis to 0 S_H = S .* hamming(size(S,1)); C = cceps(S_H); subplot(3,2,1); plot(t,y);xlabel('Time (Sec)'); ylabel('Amplitude');title('Original wav file'); subplot(3,2,2);plot(t_s,S);xlabel('Time (Sec)'); ylabel('Amplitude');title('WAV Segment'); subplot(3,2,3);plot(t_s,S_H);xlabel('Time (Sec)'); ylabel('Amplitude');title('WAV Segment after Hamming'); subplot(3,2,4);plot(t_s,C);xlabel('Time (Sec)'); ylabel('Amplitude');title('Complex Cepstrum of Hamming segment'); end

Amr M. Gody

Page

26

Popular Speech features

5th

Edition

2008

6. mel Cepstrum It is the same as cepstrum with only one difference. The log is applied on mel scale spectrum instead of spectrum.
It is believed that human hearing system is the best recognition system. By trying to simulate human hearing system, good practical results may be achieved. Speech signal is processed in this research in such a manner that low frequency components have more weights than high frequency components [3]. The human ear responds to speech in a manner such as that as indicated by Mel scale in figure 23. This curve explains a very important fact. Human ears cannot differentiate between different sounds in high frequency scale while it can do this in low frequency scale. Mel scale is a scale that reflects what human can hear. As shown by figure 23, a change in frequency from 4000(HZ) to 8000 (HZ) makes only 1000 (Mel) change in Mel scale. This is not the case in the low frequency range which starts at 0(HZ) and ends by 1000 (HZ). In this low frequency range it appears that 1000(Hz)s change is equivalent to 1000(Mel) change in Mel scale. This explains that the human hearing is very sensitive for frequency variation in low range while it is not the case in high range.

Figure 27: Mel scale curve that models the human hearing response to different frequencies[4]

Amr M. Gody

Page

27

Popular Speech features

5th

Edition

2008

7. Exercises Exercise 1
Use the following Matlab script to do the following exercises. Find the frequency response of impulse train of period 120 (Hz). Assume the sampling rate is 10 (KHz).

Answer Write the following command lines g = 1; Tp = 1/120; Ts = 1/10000; P = Tp/Ts ; [f,h]=cr_fr(g,P);

Amr M. Gody

Page

28

Popular Speech features

5th

Edition

2008

Exercise 2
Now apply a rectangular window of length 20 (ms); Answers

Add te following two command lines Tw = 20 * 1e-3 * 10000; [f,h]=cr_fr(g,P,Tw);

Amr M. Gody

Page

29

Popular Speech features The observations 1- Distance between two maximum is

5th

Edition

2008

(HZ) when multiplying it by 10000 (Hz) sampling rate. 2- Impulses in frequency domain are shaped in window frequency response. Exercise 3: Now consider the following glottal pulse shape, find the frequency response. g = [[0:0.1:5]';[5:0]']; [f,h]=cr_fr(g);

2 =0.0754

(rad/sample). This is evaluates to 120

Exercise 4: Now let us find the frequency response of a train of impulses applied to the glottal pulse in exercise 3. Assume a period of 150 samples [f,h]=cr_fr(g,150);

Amr M. Gody

Page

30

Popular Speech features

5th

Edition

2008

Observations: 1- Frequency response is repeated in a period corresponding to the impulse train period. 2- The frequency response is discrete signal. 3- The frequency response in exercise 3 is the envelop of the frequency response in this exercise. 4- The distance between any two successive impulses in the frequency domain is
2

= 0.0419.

Amr M. Gody

Page

31

Popular Speech features

5th

Edition

2008

Exercise 5:

Now apply a window of length 600 samples on the signal in exercise 4. [f,h]=cr_fr(g,150,600);

Observations 1- Window of length 600 cause a shaping of the impulses in frequency domain. 2- Frequency response of exercise 3 is the envelope of this frequency response. 3- The distance between any two peaks in frequency domain is
2 150 2

Amr M. Gody

Page

= 0.0419 (rad/sample)

32

Popular Speech features

5th

Edition

2008

4- The bandwidth of the impulse is

600

= 0.0105 (rad/sample).

5- The signal power is much distributed over the frequency. Compare the frequency response of exercise 4 to this frequency response.

Amr M. Gody

Page

33

Popular Speech features

5th

Edition

2008

Exercise 6: Now Evaluate the vocal tract model for certain voiced speech segment. Use the function v = cr_vtf_1(a2);

Amr M. Gody

Page

34

Popular Speech features

5th

Edition

2008

Exercise 7 Using the filter obtained in exercise 6, pitch = 286, obtain the corresponding filter response of impulse train in period equals to pitch period. [f h t y] = cr_fr(v,286);

Exercise 8 Using the following chart that provides the spectrum of the vocal tract filter of the signal in exercise 6, locate the formants and the pitch on the chart. Pitch frequency =
286 2

= 0.0220 (

Amr M. Gody

Page

35

Popular Speech features

5th

Edition

2008

Observations 1- Frequency response of vocal tract filter is sampled in pitch frequency. 2- Formants locations have the maximum power. Exercise 9 Include the window effect in the previous example. This is by having a window of length = 4 pitch periods. Wl = 4 * 286 = 1144 samples; [f2 h2] = cr_fr(v,286); [f1 h1] = cr_fr(v,286,1144); [f h] = cr_fr(v); h = cr_TR(h2,h); h1 = cr_TR(h2,h1); subplot(3,1,1);plot(f2(1:100000),h(1:100000)); subplot(3,1,2);plot(f2(1:100000),h2(1:100000)); subplot(3,1,3);plot(f2(1:100000),h1(1:100000));
Page Amr M. Gody

36

Popular Speech features

5th

Edition

2008

Observations: 1- The power is distributed on more frequency components. This indicates the smaller values of spectrum with respect to the figure in the middle. 2- The bandwidth of the impulse is
2

1144

= 0.0055 (

Amr M. Gody

Page

37

Popular Speech features

5th

Edition

2008

7.1.

Matlab functions
function CR_fr

%% function [o,h]=CR_fr(g,p,Wn,Wt) % FR evaluates Frequency Response of digital function. %The function is sample base. The frequency response is in digital frequency domain. It do not include %the sampling period. If you want to evaluate the real frequency you should multiply the frequency axis by %sampling frequency. The function plot the frequency domain as well as the time domain. % G: The Discrete time signal for one period. % P: The period in samples. Default is 0 means that it is not a periodic signal. % O: The digital frequency values associated to the frequency response values (H). % H: the frequency response. % Wn : Window length in samples. Default = Signal length % Wt : Default is rectangular window. %----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [o,h]=CR_fr(g,p,Wn,Wt) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); elseif nbIn==1 , %% Period = 0 , Rect window , window length = Signal length g = ToColumn(g); Ng = size(g,1); p = 0; Wn = Ng * 10; Wt = 1; N = Ng * 10; x = [g;zeros(N-Ng,1)];

elseif nbIn==2 , %% P ~= 0 , Rect window , window length = Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end N = size(x,1);

Amr M. Gody

Page

38

Popular Speech features

5th

Edition

2008

elseif nbIn==3, %% P~=0 , Rect window , window length ~= Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end x = x(1:Wn); N = Wn; elseif nbIn==4, %% P~=0 , Hamming window , window length ~= Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end x = x(1:Wn) .* hamming(Wn); N = Wn; end %% N = N * 10; % this is to increase the calculation resolution. This step do not affect the results. h = abs(fft(x,N)); dlta = 2*pi/N; o = 0:dlta:(N-1) * dlta; n = [0:max(size(x)-1)]; if(p ~=0)' n = [0:max(size(x)-1)]; subplot(2,1,1);bar(n(1:4*p),x(1:4*p),0.001);Xlabel('Time (Sample)'); else subplot(2,1,1);bar(n,x,0.001);Xlabel('Time (Sample)'); end subplot(2,1,2);plot(o,h);xlabel('Frequency (Rad/sample)'); end

Amr M. Gody

Page

39

Popular Speech features

5th

Edition

2008

function cr_VTF_1
%[v] = cr_VTF_1(frame) %%VTF = Vocal Track Filter Estimation. The function calculate the frequency %%response of vocal tract for a certain frame. The function returns the vocal tract filter in time %%domain.
%Frame : wavefor Frame %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [v] = cr_VTF_1(Frame) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); end; p = 12; % LPC order % H(z) = B/A. B and A are the coefficients of Z series in the numerator and denumerator. A = LPC(Frame,p); frame_size = max(size(Frame)); B = 1; x = zeros(frame_size,1); x(1) = 1; v = filter(B,A,x); subplot(2,1,1);plot(Frame);xlabel('Time (Sample)');title('Speech frame'); subplot(2,1,2);plot(v);xlabel('Time (Sample)');title('Vocal tract impulse response'); end

function cr_TR
%%[f] = cr_TR(f1,f2) %%TR = Time wrapping. This function aligns f2 to f1. It resamples f2 such %%that f2 length = f1 length. The function returns f2 in the new sampling %%rate. %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [f] = cr_TR(f1,f2) nbIn = nargin; if nbIn < 2 , error('Not enough input arguments.TR = Time wrapping. This function aligns f2 to f1. It resamples f2 such that f2 length = f1 length. The function returns f2 in the new sampling'); end; s1 = max(size(f1));

Amr M. Gody

Page

40

Popular Speech features


s2 = max(size(f2)); u2 = linspace(1,s2,s2); u = linspace(1,s2,s1); f = interp1(u2,f2,u); end

5th

Edition

2008

Amr M. Gody

Page

41

Popular Speech features

5th

Edition

2008

8. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale
3TU U3T

Amr M. Gody

Page

42

Chapter 5 Speech Recognition Basics

Speech Recognition Basics

7th

Edition

2009

Summary In this chapter the basic ideas in speech recognitions will be illustrated. Automatic Speech Recognition (ASR) techniques are widely spread from phone recognition to context understanding. Too many applications nowadays get benefits of ASR. Speech is considered as a new human user interface just like mouse and keyboard in the modern computer systems. Moreover, speech understanding machines are widely used in some systems that require intensive interface to clients like ticket reservation of air planes and restaurant orders. Speech dialog is replacing the human-human interface. Those types of predefined speech dialogs are very stable due to that the speech stream is guided by predefined grammar. The idea of distance measurer of speech features will be introduced through different type of measurers. Time alignment method like dynamic time warping will be discussed. The different type of template training methods will be discussed.

Objectives Understanding the idea of speech normalization. Understanding source coding using vector quantization. Understanding clustering methods and how it is used in recognition. Get much familiar in using C# and Matlab in speech recognition.

Amr M. Gody

Page

Speech Recognition Basics

7th

Edition

2009

1. Time alignment Time alignment is the process of normalizing the length of all speech samples of the same spoken word. This alignment should consider speech variability according to different articulation of speakers or self speaker variability conditions that affects the way of spoken word. In this case all speech samples are belonging to the same word, but they are different in length. In order to make recognition we should first make a time alignment. The time alignment should consider features alignment along the time of spoken word. This dynamic process is called a dynamic programming or Dynamic Time Warping (DTW). Figure 1 illustrates two samples of the same word. Although the two samples contain the same phones sequence, the duration of each phone is not linearly scaled. This leads us to the dynamic programming. We should consider this fact during time alignment. It is not a linear process.

Figure 1 Two speech signals for the same word.

Amr M. Gody

Page

To discuss the dynamic programming, let us consider the following problem. A space of N states as each state represents certain stationary time

Speech Recognition Basics

7th

Edition

2009

frame of speech signal. Assume that a cost function is defined to best align the frames sequence to a reference utterance based on phonetic contents as illustrated in figure 1. In figure 1 assume that the frames sequence is for the top stream and the reference utterance is the bottom stream. Assume that the reference stream is M frames while our utterance is N frames. We need to reform that sequence of N frames in length at the top to best align with the reference word of M frames in length based on the phonetic contents. So our target is to reform the sequence to be M frames starting at frame number 1 and ends at frame number n. (, )
Table 1: (i,j) the cost function for certain phone sequence.

1 2 3

1 57 58 59

2 54 86 22

3 53 3 90

Table 1, illustrates a cost function for certain sequence of the three frames {1, 2, 3}. Let us assume that the sequence should be aligned in 3 frames in length. Assume that the first frame is number 2. The sequence ends by frame number 3. What is the best sequence the best align the utterance to the reference 4 frames utterance with minimum cost? To answer this question let us go through the following discussion.

Amr M. Gody

Page

Speech Recognition Basics

7th

Edition

2009

Figure 2: The problem of finding the minimum cost path between point 1 and point i.

Each point in figure 2 represents a frame of time from the speech signal. Consider the space of N states indicates in figure 2. It is needed to evaluate the best path between two certain states. For practical situations, the cost function may be is the distance between the two vectors. (, ) = (, ) = [] [] =1
2

[1]

Equation 1 is the Euclidian distance between two vectors of length M. Let us consider that we are given a cost function that express the cost of moving from vector to vector . Now let us back to our problem. We need to find the optimal path between two given vectors in the space assuming that the maximum numbers of allowable hops between any two vectors are M.

Amr M. Gody

Page

Speech Recognition Basics

7th

Edition

2009

Figure 3: Trellis diagram of N speech vectors.

Let us arrange the nodes in a trellis diagram as shown in figure 3. The x direction in the diagram represents steps in time. The y direction in the diagram represents the available nodes. The arrow between any two nodes represents the hop direction from the first node to the second node where the first node is the left node and the second node is the right one. The cost of jumping from to in one time step can be expressed as: If we consider the possibility of 2 hops between and , equation 2 should be restructured as the following: Equation 3 evaluates the best path between in two hops. By following the same direction we can evaluates3 (, ), 4 (, ), , (, ). [3] 2 (, ) =
, + , 1

1 (, ) = (, )

[2]

Amr M. Gody

Page

Now let us apply what we got on our problem that is defined by the cost function given in table 1. Figure 4 indicates the method of finding the best path in case of considering the cost function given in table 1, the number of steps are 3, the start node is 2 and the end node is 3. The square above each circle is (, ) and the number above the arrow is (, ). The best path is marked by a star symbol.

1 2,1 = 58 2 2,1 = 62 3 2,1 = For clarification 1 2,2 = 86 2 2,2 = 25 3 2,2 = 1 2,3 = 3 2 2,3 = 89 3 2,3 = 28

Speech Recognition Basics

7th

Edition

2009

Figure 4: Tracing the problem of dynamic programming for N=3 and according to the cost function given by table 1

Exercise 1: Write a C# function that evaluates the best path between two given nodes. Assume that the cost function is given. Assume that the maximum possible number of hops is M.
<summary> Get the best path for certain number of hops. The path is returned into Int array of m elements. The first element is the starting node and the last element is the end node. </summary> /// <remarks> /// chapter 5 /// Amr M. Gody /// sites.google.com/site/agdomains /// 2009 /// </remarks> /// <param name="start">Start Node. The first one is node number 0</param> /// <param name="end">End Node.</param> /// <returns> /// 0: if succsess /// -1: if false /// </returns> public int GetBestPath(int start, int end, int m, ref int[] path, ref int ct) {

Amr M. Gody

Page

if(m>S ) throw new Exception ("m should be less than or equal S.");

try

Speech Recognition Basics

7th

Edition

2009

if(path .GetLength (0) != m+1) throw new Exception ("The path array should be m elemnts in size."); epsai = new int[S,N, N];

//1 - Initialize path[0] = start; int min = epsai[0, start, 0] = cost[start, 0]; for (int j = 1; j < N; j++) { epsai[0, start, j] = cost[start, j]; if (epsai[0, start, j] < min) { min = epsai[0, start, j]; path[1] = j; } } //2- Recurresion for (int hops = 1; hops < m-1; hops++) { min = epsai[hops, start, 0] = GetMinmum(epsai, hops - 1, start, 0); path[hops+1] = 0; for(int n=1;n<N;n++) { epsai[hops, start, n] = GetMinmum(epsai, hops - 1, start, n); if (epsai[hops, start, n] < min) { min = epsai[hops, start, n]; path[hops+1] = n; } } } // 3- termination epsai[m - 1, start, end] = GetMinmum(epsai, m - 2, start, end); path[m ] = end; ct = epsai[m - 1, start, end]; return 0; } catch (Exception e) { throw e ; } }

for(int n=1;n<N ;n++)

Amr M. Gody

Page

public int GetMinmum(int[, ,] epsai, int hops, int start, int end) { int min_sum = epsai[hops, start, 0] + cost[0, end]; ; int sum;

Speech Recognition Basics

7th

Edition

2009

{ sum = epsai[hops, start, n] + cost[n, end]; if (sum < min_sum) { min_sum = sum; } } return min_sum; }

Now let us back to the second question. Given sequence of feature vectors that compose a certain word and given feature space, what is this word? This is a speech recognition problem. Now to answer this question let us reformulate the question. Figure 5 introduces the recognition process.

Figure 5: The recognition process using dynamic programming.

The recognition process has 2 main phases. The first phase is the training phase and the second phase is the testing phase. Consider dynamic programming method, the training phase deals with constructing a prototype sequence of each word considered in the recognition system. The prototype is considered as the best sequence of feature vectors that produce that word.
Amr M. Gody

Page

Speech Recognition Basics

7th

Edition

2009

It is the average in sequence length and the centroid in features space of that word. The process of building a prototype of certain word is illustrated in figure 6. The process starts by time alignment of the available training samples of that word. Then the average process of all aligned samples is applied.

Figure 6: The process of building the prototype of certain word.

The second phase is the testing phase. In this phase, the unknown sequence will be tested against all available prototypes. The prototype that cause of minimum distance will be considered as the decision. The dynamic programming that so far introduced should be modified to consider some practical situations. For example the speech sequence is a temporal process. This means that the time alignment should consider that the sequence of the aligned sample should save the temporal information. It is not possible to consider that the frame number 4 will be come after the frame number 4 in that sequence. This will distract the temporal information of the sample that is being aligned. The following sections will provide more information about the temporal information and some other practical stuff that should be considered during the time alignment process.

Amr M. Gody

Page

Speech Recognition Basics

7th

Edition

2009

2. Time alignment and normalization This process of time normalization is very important in pattern recognition. The recognition processes will concern of spectral distortion without including the time into consideration. Equation 4 explains the linear normalization of two signals during comparison. In this case, signal X is considered as a reference. Where: =
(, ) = =1 ( , )

[4] [5]

In the linear normalization we considered liner relation that joins frame indices of both signals as indicated in equation 5. This linear relationship could not be kept in the practical situation. We should try to make it much practical by considering a common reference signal of central common length. For example assume that we have the following database: Utterances symbol Durations (frames) Cat 20 Cat 1000 Cat 15 The database indicates 1000 different utterances of word Cat. Let us assume that average length is 18 (frames). Then we can normalize all utterances to certain sample that is closest in length to 18. The process of time normalization is called time wrapping process. Figure 7 indicates the time wrapping of two signals x and y to common reference signal of length T. # 1

Amr M. Gody

Page

10

Speech Recognition Basics

7th

Edition

2009

is the time index in signal X. For example if the signal X is wrapped to a signal of length T. This means that: [6] (1) = 1 () = [7]

Figure 7 illustrates a non linear time wrapping process. Let us define a wrapping function () . This function relates the time index of signal X to the time index of signal Y. = () = 1

Figure 7: Wrapping of two signals x and y to a common time index.

As shown in figure 7, the time index for signal x on X-Axis is calculated based on the wrapping function (). Also the same for time index calculation of signal Y on Y-Axis using (). For example the point on the figure that wrap based on the above wrapping functions will be evaluated as in following table: k 1 2
Amr M. Gody

Page

1 2

= ()

1 2

= ()

11

Speech Recognition Basics

7th

Edition

2009

3 4 5 6 7 8

3. Constraint in dynamic programming

According to the above table we can draw ix iy as shown in figure 7.

4 4 5 7 8 8

3 4 6 7 7 8

Dynamic programming is used to align between two time signals based on feature matching. This is called temporal matching. Suppose that we have two signals that represent the same utterance. The two signals have different temporal properties. They convey the same information but in different time frames. (Recall figure 1 for more details.) Typical warping constraints that are considered necessary and reasonable for time Alignment between utterances include the following: Endpoint Constraints Monotonicity conditions Local continuity constraints Global path constraints Slope weighting. An endpoint constraint deals with the end points. It means that the two signals that represent that same utterance should have the same end points. That means the first and the last frames of both signals will be aligned together. Let us define a function called the wrapping function. This function explains the corresponding frame number during the wrapping process.
Amr M. Gody

Page

12

is a wrapping function of signal x. Equation 4 read as frame number 1 of signal x is wrapped to frame number 1 of signal y. [8] (1) = 1 () = (1) = 1 () = Back to end point constraints we can write it as

(1) = 1

Speech Recognition Basics

7th

Edition

2009

[9]

Equation 5 reflects that the end points of the signal should be respected during the wrapping process. Equation 5 implies that the signals X and Y are wrapped to length of T frames. Monotonicity conditions Suppose that it is needed to align the two signals on a temporal basis. This is very important step before making any further recognition process. It is the normalization process. The dynamic programming can be used to make such time alignment. The method discussed in section 1 is very generic. It may be not practical. Let us navigate through an example: Let us recall figure 7. We need to align signal Y to signal X in a way that ensures minimum temporal distortion.

Amr M. Gody

Page

13

Speech Recognition Basics

7th

Edition

2009

Figure 8: Trills diagram. The x axis represents signal X and the y axis represents signal y. The solid path represents the best path that ensures minimum temporal disturbance.

It is not logical to go backward. If frame 3 of signal y is aligned to frame 2 of signal x, this means that the subsequent frames of signal y should be aligned to frames that are higher than or equal to time index 2 in signal x. Recalling figure 4, the best sequence is 2,3,2,3. This is very strange, frame number 3 is followed by frame number 2 to best align with the reference signal. This is what come up with the pure mathematics without applying the monotonicity constraint.

Amr M. Gody

Page

14

Speech Recognition Basics

7th

Edition

2009

3.5 3 2.5 2 Iy 1.5 1 0.5 0 1 2 ix


Figure 9: Negative slope caused by dynamic programming without monotonicity constraint.

The monotonicity constraint, as shown in Figure 8, implies that any path along evaluated will not have a negative slope. The constraint eliminates the possibility of (time) reverse warping, along the time axis, even within a short time interval. Local continuity constraints deals with the maximum increment allowed between frames during the alignment process.

Figure 10: path function.

Let us define the path function. This function describes the path along the trills diagram. Recalling figure 9, P 1 is a path that can be described as: 1 = (1,1)(1,0) [10]

Amr M. Gody

Page

15

Speech Recognition Basics

7th

Edition

2009

Equation 10 describes the path as the increments in both directions. Path 1 will be read such that it is a one increment in x direction and a one increment in y direction, then a 1 increment in x direction and 0 increments in y direction. Exercise Consider the path functions in figure 10, describe the following path:

The solution of this exercise will be as follows

Amr M. Gody

Page

16

Speech Recognition Basics

7th

Edition

2009

Figure 10 illustrates some popular path functions. Because of the local continuity constraints, certain portions of the( , ) plane are excluded from the region the optimal warping path can traverse. To evaluate the global constraints let us define the following slope parameters: = min { } ; The slop of minimum asymptote.

= max { } ; The slop of maximum asymptote.


is the path index. To calculate the asymptotes of the possible moves consider equations 11 and 12 and figure 11. Maximum asymptote considering point 1,1 as a beginning is given by equation 13 () = 1 + ( () 1) [13]

[11] [12]

Amr M. Gody

Page

17

Speech Recognition Basics

7th

Edition

2009

Figure 11: Path description for many patterns

For the path shown in figure 10,


2

Amr M. Gody

Page

11 = 2 1 1 = min = 1 = 1 = 2 2 1 = 1 2 2
3

11 = 2 1 = max = 1 = 1 = 2 and 2 1 = 1 2 2
2 3

18

Speech Recognition Basics

7th

Edition

2009

Figure 12: Tracing jumps through certain local path constraint

Minimum asymptote is given by:

This makes it

We should remember that there is another local constraint that assumes that () = (). This is the end point constraint. We should find the effect of this constraint in combined with the path constraint on the global path constraint. Using the same method as above, Maximum asymptote is given by: () = + ( () )

1 + ( () 1) () 1 + ( () 1).

() = 1 + ( () 1)

[14]

And Minimum asymptote is give by:

Amr M. Gody

Page

() = + ( () )

19

Speech Recognition Basics

7th

Edition

2009

Figure 13: Global constraint

+ ( () ) () + ( () )

Figure 12 illustrate the whole picture. To make is the end point constraint will be

[15]

Equations 14 and 15 give the global constraints according to the given local paths. This global constraint reflects another point. Equation 15 depends on and . What should this bear for us. To answer this question let us test the above equalities in 15. In our case we have = 2 and =0.5 . If = 2 , this will end up to the collapse the area enclosed by 14 and 15 into a straight line. This is the maximum possible
Amr M. Gody

Page

20

Speech Recognition Basics

7th

Edition

2009

allowable difference between the two signals to be wrapped according to the local path defined above. 4. Vector quantization Vector quantization is one very efficient source-coding technique. Vector quantization is a procedure that encodes a vector of input into an integer (index) that is associated with an entry of a collection (codebook) of reproduction vectors. The reproduction vector chosen is the one that is closest to the input vector in a specified distortion sense. The coding efficiency is obviously achieved in the process of converting the (continuously valued) vector into a compact integer representation, which ranges from, for example, 1 to N, with N being the size of (number of entries in) the codebook. The performance of the vector quantizer, however, depends on whether the set of reproduction vectors, which are often called code words, is properly chosen such that the distortion is minimized. The block diagram of Figure 14 shows a binary split codebook generation algorithm that produces a good codebook based on a given training data set. Characteristics of the information source that produced the given training data are embedded in the codebook. This is the key parameter of using Vector quantization in recognition. The problem ends up with L code book as one code book for each class being recognized. For example a code book for each word. Then simply the unknown word will be fetched against each code book. The distortion will be calculated. The code book that evaluates to the minimum distortion is used to make the decision. The associated information source of that code book will evaluate the target word. Figure 15 illustrates the process.

Amr M. Gody

Page

21

Speech Recognition Basics

7th

Edition

2009

i=1 m = 2xi = 2

Find Centroid C_1

Loop Start

Split Each Centroid {C_1, C_2, , C_m} D_prev = 0

Classify m classes

For each class find centroid New {C_1, C_2, , C_m}

D_prev = D_current

i=i+1 m = 2xi Evaluate Distortion D_current

No

|D_current D_prev| <

Yes

m<M

Yes

No

Figure 14: Code book creation

Amr M. Gody

Page

22

Done

Speech Recognition Basics

7th

Edition

2009

D1(U)

Unknown U

D2(U)

Min_Index

W(min_index)

Dn(U)

Figure 15: Recognition using Vector quantization

Amr M. Gody

Page

23

Speech Recognition Basics

7th

Edition

2009

5. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale
3TU U3T

Amr M. Gody

Page

24

Chapter 6 Hidden Markov Model

Hidden Markov Model

Version 7

2010

Summary This chapter deals in introducing speech recognition using a very powerful statistical model called Hidden Markov Model (HMM). The concept and methodology of HMM will be discussed in this chapter. HMM is the most popular practical tool that is being used in Automatic speech Recognition (ASR) applications. HMM concerns on phonetic contents as well as temporal properties of the phonemes inside the phrase or the word. The time is not a direct factor in the recognition process. In this chapter the theory of HMM will be discussed. How HMM can be utilized to model discrete and continuous word recognition will be discussed. How HMM can implement Gaussian mixer probability distribution, to model multi modal phoneme properties, will be introduced.

Objectives Understanding HMM. Understanding speech signal as statistical process. Understanding Gaussian mixer modeling for multi modal phoneme properties. Practice using Matlab and C#.

Amr M. Gody

Page

Hidden Markov Model

Version 7

2010

1. HMM To understand HMM, let start by discrete Markov process. Consider the following example. Consider 3 states sound generator. This is a music box that generates only 3 sounds. When user presses the generate button, the music box generates a random of 3 sounds. The outputs are the sound itself and the associated light indicator as shown in figure 1.

Figure 1: 3-state sound generator. Let us go through the process by considering the state diagram that represents the above stochastic process. States are: 1- Sound 1; Green color. 2- Sound 2; Yellow color. 3- Sound 3; Red color.
Page Amr M. Gody

Hidden Markov Model

Version 7

2010

The process is: User presses the button to issue new sound. The observer registers both the sound ID and the color of the light indicator. Assume that the following state diagram, in figure 2, is obtained according to this experiment:

0.3

0.2

0.3

0.4

0.5

0.4 0.4 3 2

0.2

0.3

Figure 2: State diagram of the music machine. Let us tabulate the information from figure 2. 0.3 0.3 0.4 = 0.5 0.2 0.3 0.2 0.4 0.4

[1]

Amr M. Gody

Page

Equation 1 is the transition matrix. It gives the transition probability between any two states in the state diagram. The sum of any row should

Hidden Markov Model

Version 7

2010

evaluates to 1. This is logical as the sum gives all possible transition probabilities. Consider also the matrix in equation 2; 0.3 = 0.2 0.5

[2]

The Matrix in 2 gives the probability to be in certain state at the first press of the button. It is called the initial state probability matrix . For example to be in state 1 according to equation 2; p = 0.3. Both T and construct the model that represent the experiment. We can give the model certain symbol. = {, } [3]

Now after defining the system model parameters, we can have answers to the following questions: Q1) what is the probability of having the following sequence of observations given that the system model ? {Red, Red, Red, Green, Yellow, Red, Green, Yellow} A1) To make the answer it is preferable to form the question into probabilistic terms. So the above question is Calculate (|), where O is for Observation. = {1 , 2 , 3 , 4 , , }. This is an observation of t symbols. In our example 1 = , 2 = , , 8 = Figure 3 gives an insight vision of the process. The problem now is turned into simple expression. It is the multiplication of all terms:

Amr M. Gody

Page

= (3) (3|3)2 (1|3)2 (2|1)2 (3|2) 2 2 2 = 3 33 31 12 23 = 0.5 0.42 0.22 0.32 0.3 = 8.6 105
P(3) P(3|3) P(3|3) P(1|3) P(2|1) P(3|2)

Hidden Markov Model

Version 7

2010

P(1|3)

P(2|1)

Figure 3: Probabilistic process of Music box problem

Q2) Given that the system is in state 2, what is the probability to stay in state 2 for 9 consecutive turns? A2) This is a probability of Observation O given and initial state 1 = = {1 , 2 , 3 , 4 , , }

(|, i ) is our target. The difference off the first case is that the first state is given in the question itself. It is stated clearly in the question. Where 1 = 2 = = 9 = 10 =

Amr M. Gody

Probability of staying in state for d turns is We can generalize the result of Q2 as

8 (| , 1 = 2)= (, 1 = 2|) = (2) (2|2) (1 (2|2)) (2) (1 = 2) 8 2 22 (1 22 ) = = 0.28 (1 0.2) 2 = 2.048 106

Page

1 () = (1 )

Hidden Markov Model

Version 7

2010
[4]

Using equation 4 we can obtain the average number of turns stay in certain state is given by the following analysis: 1 2 n () (1) (2) () Comments In state for 1 turn In state for 2 turns In state for n turns = 1 (1 ) = 1 (1 )
=1

= ()
=1

Now let us start to change the experiment a little bit. In the above experiment the observations are the same as the states. What will happen if the states become hidden? In other word, the output or the observations are not the states itself; rather it is some events that are emitted by the different states. There are 3 music machines exactly the same as the one given in figure 1. There is a man that went into the room which contains the 3 machines. The man presses a button randomly in anyone of the 3 machines. The observer reside in another room. He receives the observation by asking the man whom implements the experiments for the color. He has no information about which machine is utilized to generate that color. Now the output is the colors but the state represents the machine. The machines are hidden. Figure 4 gives representation of the experiment. The actor brings the result to the observer. The actor implement the experiment in the room then brings the
Amr M. Gody

Page

Hidden Markov Model

Version 7

2010

result to the observer. The observer could not see which machine is used to generate the sound. Let us explore the information in this experiment. 1- The observations are not the states. 2- Each state represents a machine, which is hidden in the room. 3- The output at each state is a discrete. It is only 3 symbols {Green, Yellow and Red}. 4- The model includes a new entity that represents symbol 11 12 probability at each state. This is the matrix B. = 21 22 31 32 emitting 13 23 33

Figure 4: Hidden Music machines are used to generate the observations.


Amr M. Gody

Page

Hidden Markov Model

Version 7

2010

Each column in B matrix represents the symbols probability in the corresponding state. For example the probability of emitting Red in state 3 is 33 while the probability of emitting Green in state 2 is 21 .
HMM model

: is the matrix that gives the probability of being in certain state at time t = 1. For our example state represents the machine. Where: : is the matrix that gives the state transition probability. It is matrix, where N is the number of states. For our case, = 3. : is the matrix that gives the symbols probability in each state. It is matrix, where is the number of symbols and is the number of states. For our case, = 3 & = 3. 0.3 0.4 0.2 = 0.2 0.3 0.4 0.5 0.3 0.4

Now the model is = {, , }

Now let us consider the same state diagram in figure 2 and the same model with B as:

Let us write B in probabilistic form. () = ( = | = ). For example 2 () = 0.3, 1 () = 0.2 and 3 () = 0.2 Now let us ask the following questions: Q1) given the model , What is the probability of the following observation? {Red, Red, Red, Green, Yellow, Red, Green, Yellow} Q2) given the model , what is the state sequence associated to the mentioned observation in Q1?
Amr M. Gody

Page

Hidden Markov Model

Version 7

2010

Q3) given a training set, what is the model that best describe the training set?. In other word how you can adapt the model parameter to best fit the given training set?
Speech production model

Let us recall the system in figure 4 to relate it to speech production model. The closed room is the brain, the actor is the speech production mechanism while the observer is the listener. In figure 4, the system has only 3 states. This is equivalent to a brain that has only 3 different phones to express with any word. This is equivalent to a 3-phone language. It is a hypothetical language to realize the knowledge in expressing language model using HMM. Also there are 3 different sounds for each state. This is equivalent to the variations in speech properties that express the phone. This is to model the practical situation of having more than one pronunciation for each single phone. Figure 2, express the grammar that relates the phones one to another. One can think that if there are M words in the language, it is better to build HMM for each word. Then when having an unknown observation, we can score it against each model. The decision will be for the maximum score. Yes this is true.

Now let us back to our questions:


Evaluating Observation probability

Given the model , what is the probability of the following observation?

To simplify notation, let us enumerate the output as Green = 1, Yellow = 2 and Red = 3. This should not confuse with state symbols. Hence: = {3,3,3,1,2,3,1,2}

Amr M. Gody

Page

Hidden Markov Model

Version 7

2010

To evaluate the probability of that sequence given the model ; Let us go through the example. The observation starts by Red or symbol 3. According to the model this is the case that has 3 probabilities: 1- Being Red by machine 1 with probability = 0.5, 2- Or, Being Red by machine 2 with probability = 0.3, 3- Or, Being Red by machine 3 with probability = 0.4, So the probability of being Red at t = 1 will be ( = 1) = 0.3 0.5 + 0.2 0.3 + 0.5 0.4 =

Red at t=1.

Let us recall the model parameters:

Forward probability

0.3 0.4 0.2 0.3 0.3 0.4 0.3 = 0.5 0.2 0.3 = 0.2 = 0.2 0.3 0.4 0.5 0.3 0.4 0.2 0.4 0.4 0.5 () = (1 , 2 , , , = |)

Let us define a new term. It is the forward probability as To get the probability of certain observation of length T, we have to calculate the forward probability for all states then sum all. For our example T = 8. Hence; To calculate 8 () consider figure 5. It is calculated by induction. We start by calculating 1 () as = 1 3 Following the same direction:
Amr M. Gody

(|) = 3 8 () =1

[5]

Page

1 (1) = (1 , 1 = 1|) = 1 1 () = 0.3 0.5 = 0.15

10

2 (1)

Hence:

The same direction will be followed to calculate 2 (2) 2 (3). 2 (1) 0.0575 2 (2) = 0.0411 0.0632 2 (3)

From figure 5, to calculate 2 (1)

(1 , 2 , 2 = 1|) = 3 1 () 1 1 () =1 = (0.2 0.2 + 0.06 0.5 + 0.15 0.3) 0.5 = 0.0575

1 (2) = 0.2 0.3 = 0.06 1 (3) = 0.5 0.4 = 0.20

Hidden Markov Model

Version 7

2010

[6]

We will repeat the same process till ends by 8 (1) , 8 (2) and 8 (3). At that end we can calculate (|) as of equation 5.

Amr M. Gody

Page

Figure 5: Calculating the p(O| )

11

Hidden Markov Model

Version 7

2010

Exercise 1
Write a C# function that calculate p(O| ). The function prototype is: ( , ) is an object to describe the observations. is an object to describe HMM model. The function returns the score of the observation O against the model L.

Exercise 2
Write a C# function that evaluates the class of a given Observation. Assume that each class is described by an HMM model. Assume also the function of exercise 1 is given. The prototype of the function is ( , ) is an object to describe the observations. is an object list. The elements of the list are HMM model as of exercise 1. The function returns the class index in the given list.

Now let us answer the second question


Estimating states sequence

Given the model , what is the state sequence associated to the mentioned observation in Q1? Let us consider the same observation = {3,3,3,1,2,3,1,2} {, , , , , , , }

[7]

To answer this question, we seek for the best path in figure 5. Let us try to think about the difference between question 1 and 2. In question 1, we are trying to find the probability of certain observation against the given model . Recalling figure 5, in each time index we evaluate all possible occurrences of the given observation. But in our case now, we try to find the
Amr M. Gody

Page

12

Hidden Markov Model

Version 7

2010

most probable path that gives the given observation. Let us implement it on figure 5 given the observation in (7): Let us define the variables () and () as follows:
The Viterbi algorithm:

All possible state sequences that end to state at time index will be enumerated. The best probability for the given observation will be assigned for (). [8] Hence: +1 () = max () | () = arg max { ()} N: the number of states () = ( + 1) And +1 () = arg max () |
=1

() = max {(1 , 2 , 3 , , = , 1 , 2 , 3 , , |)}

Then obtaining the sequence from () as follows:

=1

(+1 )
=1

[9] [10]

[11]

Then back propagate over t to get all q


=1 1

[12]

Let us apply equations 9, 10, 11 and 12 on our problem:

Amr M. Gody

Page

13

Hidden Markov Model

1 2

3 4 5 6 7 8

1 (1) 11 = 0.15 0.3 = 0.0450 1 (1) 12 = 0.15 0.3 = 0.0450 max 1 (2) 21 = 0.06 0.5 = 0.03 max 1 (2) 22 = 0.06 0.2 = 0.012 1 (3) 31 = 0.2 0.2 = 0.04 1 (3) 32 = 0.2 0.4 = 0.08 1 ( ) = 0.0450 0.5 = 0.0225 2 ( ) = 0.024

(1) 1 1 () = 0.3 0.5 = 0.15

0.3 0.4 0.2 0.3 0.3 0.4 0.3 = 0.2 = 0.2 0.3 0.4 = 0.5 0.2 0.3 0.5 0.3 0.4 0.2 0.4 0.4 0.5
(2) 2 2 () = 0.2 0.3 = 0.06 1 (1) 13 = 0.15 0.4 = 0.06 max 1 (2) 23 = 0.06 0.3 = 0.018 1 (3) 33 = 0.2 0.4 = 0.08 3 ( ) = 0.032 (3) 3 3 () = 0.5 0.4 = 0.2

Version 7

2010

(1) (2) (3) 0 0 0 Red 1 3 3 Red

0.006 0.000576 8.19E-05 1.44E-05 1.77E-06 2.52E-07

0.00384 0.000819 5.76E-05 1.18E-05 2.52E-06 1.59E-07

0.00512 0.00048 9.83E-05 1.57E-05 1.26E-06 3.02E-07

2 2 2 2 2 2

3 3 3 3 3 1

3 1 2 3 3 2

Red Green Yellow Red Green Yellow

3 3 2 3 3 2 3

Figure 6: the best path sequence.

Amr M. Gody

Page

14

Hidden Markov Model

Version 7

2010

Code

C# Module that calculates state sequence based on certain model


using System; using System.Collections.Generic; using System.Text; namespace sq { class Program { static void Main(string[] args) { ConfigReader cr = new ConfigReader(); double[,] B ; double[,] T; double[] pi; string[] symbols; int[] o; int StateCount,SymbolCount,ObservationLength; double max ; int arg_max; //Reading Observation symbols = cr.Symbols; o = cr.o; //Reading system Model B = cr.B; T = cr.T; pi = cr.PI; StateCount = T.GetLength (0); SymbolCount = B.GetLength (0); ObservationLength = o.GetLength (0); //Calculating best sequense using Viterbi algorithm double[,] dlta = new double [ObservationLength ,StateCount ]; int[,] epsi = new int [ObservationLength ,StateCount ];

// Initialization for (int i = 0; i < StateCount; i++) { epsi[0,i] = 0; dlta[0, i] = pi[i] * B[o[0],i]; } //-- Induction for (int t = 1; t < ObservationLength; t++) { for (int i = 0; i < StateCount; i++) { max = dlta[t - 1, 0] * T[0,i] ; arg_max = 0; for (int j = 1; j < StateCount; j++) { double p = dlta[t - 1, j] * T[j, i]; if (p > max) { max = p; arg_max = j; } } dlta[t, i] = max* B[o[t], i]; epsi[t, i] = arg_max; } }

Amr M. Gody

Page

15

Hidden Markov Model

Version 7

2010

//-- Termination max = dlta [ObservationLength -1, 0]; arg_max = 0; for(int i=1;i<StateCount ;i++) { if(dlta [ObservationLength -1,i] > max) { max = dlta [ObservationLength -1,i]; arg_max = i; } } // -- State Sequence int[] q = new int[ObservationLength]; q[ObservationLength - 1] = arg_max; for(int t = ObservationLength -2;t>=0;t--) { q[t] = epsi[t+1, q[t + 1]]; } // Storing the results into a file System.IO.StreamWriter sr = new System.IO.StreamWriter("sq_results.txt", true); sr.WriteLine(); sr.WriteLine("//////////////////////////////"); sr.WriteLine(DateTime.Now.ToString()); sr.WriteLine("----------------------------"); sr.WriteLine("State Sequence finder module"); sr.WriteLine("-----------------------------"); sr.WriteLine("HMM parameters"); sr.WriteLine("[PI]"); for (int i = 0; i < StateCount; i++) sr.WriteLine(pi[i].ToString()); sr.WriteLine(); sr.WriteLine("[T]"); for (int i = 0; i < StateCount; i++) { for (int j = 0; j < StateCount; j++) sr.Write(T[i, j].ToString() + "\t"); sr.WriteLine(); } sr.WriteLine(); sr.WriteLine("[B]"); for (int i = 0; i < SymbolCount; i++) { for (int j = 0; j < StateCount; j++) sr.Write(B[i, j].ToString() + "\t"); sr.WriteLine(); } sr.WriteLine("--------------------------------"); sr.Write("t\t"); for(int i=0;i<StateCount ;i++) sr.Write ("dlta"+string .Format ("_{0}\t\t",i+1)); for (int i = 0; i < StateCount; i++) sr.Write("Epsi" + string.Format("_{0}\t\t", i + 1)); sr.Write("O\tq"); sr.WriteLine(); sr.WriteLine("------------------------------------------------------------------------------"); for (int t = 0; t < ObservationLength ; t++) { sr.Write (string .Format ("{0}\t",t+1)); for(int i=0;i<StateCount ;i++) sr.Write (string .Format ("{0}\t\t",System .Math .Round ( dlta [t,i],5))); for (int i = 0; i < StateCount; i++) sr.Write(string.Format("{0}\t\t", epsi [t,i])); sr.Write(string.Format ("{0}\t{1}",symbols [o[t]],q[t]+1)); sr.WriteLine();

Amr M. Gody

Page

16

Hidden Markov Model

Version 7

2010

} sr.WriteLine("------------------------------------------------------------------------------"); sr.Close(); } } }

using System; using System.Collections.Generic; using System.Text; namespace sq { public class ConfigReader { private System.Data.DataSet m_objDataSet; private int m_nNumberOfStates; private int m_nNumberOfSymbolsPerStates; private int m_nLength; /// <summary> /// Number of symbols being recognized /// </summary> private int m_nSymbolsCount; public ConfigReader() { try { string file = Environment.CurrentDirectory + "\\sq.xml"; if (!System.IO.File.Exists(file)) throw new Exception("Configuration file is not exist"); m_objDataSet = new System.Data.DataSet(); m_objDataSet.ReadXml(file); m_nNumberOfStates = m_objDataSet.Tables["pi"].Rows.Count; m_nNumberOfSymbolsPerStates = m_objDataSet.Tables["B"].Rows.Count; m_nLength = m_objDataSet.Tables["o"].Rows.Count; m_nSymbolsCount = m_objDataSet.Tables["symbol"].Rows.Count; } catch (Exception e) { throw e; } } public double[,] B { get { double[,] temp = new double[m_nNumberOfSymbolsPerStates , m_nNumberOfStates ]; for (int i = 0; i < temp.GetLength(0); i++) for (int j = 0; j < temp.GetLength(1); j++) temp[i,j] =Convert .ToDouble ( m_objDataSet.Tables["B"].Rows[i][j]); return temp; } set { } } public double[] PI { get {

Amr M. Gody

Page

17

Hidden Markov Model

Version 7

2010

double [] pi = new double [m_nNumberOfStates ]; for (int i = 0; i < m_nNumberOfStates; i++) pi[i] = Convert .ToDouble ( m_objDataSet.Tables["PI"].Rows[i][0]); return pi; } set { } } /// <summary> /// Transition Matrix /// </summary> public double[,] T { get { double[,] temp = new double[m_nNumberOfStates, m_nNumberOfStates]; for (int i = 0; i < temp.GetLength(0); i++) for (int j = 0; j < temp.GetLength(1); j++) temp[i, j] = Convert.ToDouble(m_objDataSet.Tables["T"].Rows[i][j]); return temp; } set { } } /// <summary> /// Number of states /// </summary> public int n { get { return m_nNumberOfStates ; } set { } } /// <summary> /// Number of symbols per state /// </summary> public int m { get { return m_nNumberOfSymbolsPerStates ; } set { } } /// <summary> /// Observation /// </summary> public int[] o { get { int [] temp = new int [m_nLength ]; for (int i = 0; i < m_nLength ; i++) temp [i] = Convert.ToInt32 (GetIndex ( m_objDataSet.Tables["o"].Rows[i][0].ToString ()));

Amr M. Gody

Page

18

Hidden Markov Model

Version 7

2010

return temp ; } set { } } /// <summary> /// Observation length /// </summary> public int Length { get { return m_nLength; } set { } } /// <summary> /// Symbols array. /// </summary> public string[] Symbols { get { string [] temp = new string [m_nSymbolsCount ]; for (int i = 0; i < m_nSymbolsCount ; i++) temp[i] = Convert.ToString(m_objDataSet.Tables["symbol"].Rows[i][0]); return temp; } set { } } public int GetIndex(string m) { for (int i = 0; i < m_nSymbolsCount; i++) if (m.CompareTo(m_objDataSet.Tables["symbol"].Rows[i][0].ToString()) == 0) return i; return -1; } } }

Amr M. Gody

Page

19

Hidden Markov Model

Version 7

2010

Building speech recognizer using HTK

2. Case Study The problem of speech recognition is illustrated here through building a complete speech recognition system. HTK 1 will be utilized to build and evaluate the system. a. Problem Definition Arabic dialer system is to be built using HTK. There are two methods of dialing: 1- Utterance like Dial number1, number 2, number 3 number N. The utterance manner is continues speech. b. Procedure 2- Recalling the name from the phone book.

Step 1 is to build the dictionary. The dictionary gives an illustration of how the word is spoken. It gives the answer in terms of the basic speech units in the target language. For our case it gives the utterance in Arabic phonemes. For example the word In arabic In english Pronounciation one w a2 ~h i d All words used in the system should have at least one entry in the dictionary. Step 2 is to build the grammar. The grammar is the guidance that will be followed to detect the tokens. The tokens are the words to be recognized or to be detected.
U U

Amr M. Gody

Page

Hidden Markov toolkit. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. For more information http://htk.eng.cam.ac.uk/

20

Hidden Markov Model

Version 7

2010

Step 3 is data preparation for training and testing purposes. This is very important step. The data should provide sufficient amount of samples to train all possible phone combinations for triphone recognition purpose. The database will be transcribed and annotated using SFS 2. Step 4 is to extract the features from the training data. This is a key step in the successful recognition. Selecting good features that are highly discriminating is very important. In this project Mel Cepstrum will be used. Step 5 is to build and to initialize the basic HMM models for monophony recognition process. The available training database and the pronunciation Dictionary will be used to initialize the basic models. Step 6 is updating the silence and pause Model. The model should allow for long duration of silence. This is not the case of monophone model. Step 7 is Model modifications to consider triphones instead of single isolated monophone. This is much practical situation to best fit transition periods between different phones. Step 8 is system evaluation. In this step the testing database will be used to evaluate system performance.

c. Dictionary This step is the first step in building the recognition system. In this phase we should count for all words that will be used in the process. We have to store them all into a text file. This file will be used later on to locate the proper pronunciation for each spoken word. The file name "dict" is chosen for this example to store the dictionary words. The phone symbols should be conserved along with all the subsequent process in the recognizer. The alphabetic used in this tutorial is illustrated in table 1.
Speech Filling System. It performs standard operations such as acquisition, replay, display and labeling, spectrographic and formant analysis and fundamental frequency estimation. It comes with a large body of readymade tools for signal processing, synthesis and recognition, as well as support for your own software development. http://www.phon.ucl.ac.uk/resource/sfs/
2

Amr M. Gody

Page

21

Hidden Markov Model

7 Version

0102

.Table 1: Arabic phone alphabetic symbols Comments Phone Symbol /@/ //b //t //t_h //d_j //~h //x //d //~z //r //z //s //s_h //Sa //Da //Ta //T_Ha /@~/ //g_h //f //q //k //l //m
Amr M. Gody

Context Symbol Dj

Arabic Character

Page

22

Hidden Markov Model

Version 7

2010

/n/ /h/ /w/ /y/ /a / /a2/ /u/ /u2/ /i/ /i2/ /sp/ /sil/

- -

Code

Below is the list of "dict" file contents.


dial zero one two three four five six seven eight nine Amr Osama Salwa SENT-END SENT-START . @ u T l u b Sa i f r w a2 ~h i d @ i t n i2 n2 t a l a2 t a h @ a r b a ~@ a h x a m s a h s i t a h s a b ~@ a h t a m a n i a h t i s ~@ a h ~@ a m r @ o s a m a h s a l w a h sil sil sil

Amr M. Gody

Page

23

Hidden Markov Model

Version 7

2010

Note

The words in the dictionary are written in English letters while the pronunciations are in Arabic phone symbols. This should not confuse the reader as the system will respond to the Arabic pronunciations. The words indicate the output on identifying the pronunciation. No matter what the way the system will announce the recognized Arabic word. In our case the system will output the text in English letters when it detected the Arabic pronunciation associated to it according to the dictionary. For example when the Arabic pronunciation "Sa i f r" is detected, the system will output the text "Zero" to announce for this recognized pronunciation. d. Grammar I consider this step as system outline design. The grammar is the guidance of what speaker should speak to provide the information to the system. It is just like designing a dialog screen in certain software application to pull information from the user. In our case, we are targeting dialer application. We need to design some acceptable dialogues to catch the user input. The dialog contains some Tags to be used for catching the required tokens. The tags acts juts like the labels in the dialog box to guide the user for what information he should provide in the associated area. Figure 7 will be used to illustrate the analogy between software application dialog and speech dialog. As shown in figure 7, there is two ways of dialing the number either by selecting the name from the list or by dialing the number through the push buttons then pressing the button . In this case the prompts are the text written to guide the user of what he should do. In this dialog we have two prompts. The first one is Select from phone book, the second one is the label on the button . Let us back to our speech dialog that we need to design to act the same function like that one provided in figure 7. We have the following dialogs:

Amr M. Gody

Page

24

Hidden Markov Model

Version 7

2010

Figure 7: Sample GUI illustrating different tags and tokens in dialler application.

Dialog 1:
U

Command followed by numbers to be dialled.


In this case the Tag =

Token = Dialog 2:
U

Just pronounce the name directly. In this case the Tag = NONE Token = Name
Page Amr M. Gody

25

Hidden Markov Model

Version 7

2010

In this case it is assumed that the system will retrieve the number later on from certain database that contains the number of this phone book entry. To formulate this grammar using HTK, it is needed to write it in text file. Figure 8 illustrates the word network that illustrates the grammar of this dialer system.

Figure 8: Word network that describe the possible grammar for the dialer recognizer.
Code

Below is the list of "gram" file contents. $digit = zero | one | two | three | four | five | six | seven | eight | nine; $name = Amr | Osama | Salwa; ( SENT-START ( dial <$digit> | $name) SENT-END ) The above text will be stored into a file. You may choose any name for this text file. The file name "gram" is chosen. Then the following command will be invoked to create the word net file:
Amr M. Gody

Page

26

Hidden Markov Model

Version 7

2010

Command line

hparse gram wdnet Hparse command will parse the file "gram" and generate the file "wdnet".
Code

Below is the list of "wdnet" file contents.


VERSION=1.0 N=20 L=41 I=0 W=SENT-END I=1 W=Salwa I=2 W=!NULL I=3 W=Osama I=4 W=Amr I=5 W=nine I=6 W=!NULL I=7 W=eight I=8 W=seven I=9 W=six I=10 W=five I=11 W=four I=12 W=three I=13 W=two I=14 W=one I=15 W=zero I=16 W=dial I=17 W=SENT-START I=18 W=!NULL I=19 W=!NULL J=0 S=2 E=0 J=1 S=6 E=0 J=2 S=17 E=1 J=3 S=1 E=2 J=4 S=3 E=2 J=5 S=4 E=2 J=6 S=17 E=3 J=7 S=17 E=4 J=8 S=6 E=5 J=9 S=16 E=5 J=10 S=5 E=6 J=11 S=7 E=6 J=12 S=8 E=6 J=13 S=9 E=6 J=14 S=10 E=6 J=15 S=11 E=6 J=16 S=12 E=6 J=17 S=13 E=6 J=18 S=14 E=6 J=19 S=15 E=6 Amr M. Gody

Page

27

Hidden Markov Model

Version 7

2010

J=20 J=21 J=22 J=23 J=24 J=25 J=26 J=27 J=28 J=29 J=30 J=31 J=32 J=33 J=34 J=35 J=36 J=37 J=38 J=39 J=40

S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=17 S=19 S=0

E=7 E=7 E=8 E=8 E=9 E=9 E=10 E=10 E=11 E=11 E=12 E=12 E=13 E=13 E=14 E=14 E=15 E=15 E=16 E=17 E=18

This is the word network file in lattice format. The file indicates that there are 20 Nodes (N = 20) and 41 links (L = 41). The first node ID ( I = 0) represents the word "SENT-END" (W=SENT-END). Figure 9 illustrates the word net network associated to our grammar. It is part of the network. The words are enclosed by the oval shape. The word ID is attached to the oval shape and enclosed by circle as shown in figure. The links are labels by the joint number as listed in the lattice file format.

Amr M. Gody

Page

28

Hidden Markov Model

Version 7

2010

Figure 9: Part of the word network expressed by the lattice format file "wdnet"

e. Feature Extraction The database is prepared using the SFS program. Database preparation means: 1- Recording: The data will be recorded and stored into suitable audio file format. For example WAV files. 2- Annotation: The recorded samples will be annotated and transcribed. This is to allow for the recognizer the way to find the suitable audio segments to train the models. SFS may be used to make both of the above tasks. It is All In One package. We can later on export the annotation file and the WAV file from the created SFS file.

Amr M. Gody

Page

29

Hidden Markov Model

Version 7

2010

Command line

sfs2wav -o

wavfile sfsFile

The command line sfs2wav is used to export the wavfile from the container sfsFile. wavFile and sfsFile are the file names. They may be replaced by your own file names. The annotation file exported using the following SFS command line.
Command line

anlist -h -o labFile

sfsFile;

You should repeat the mentioned two command lines on all database files to export the WAV and the associated annotation file in HTK format. The following is part of C# code that illustrates this process.
Code

path = C:\Database\TrainingSet string[] fillist = System.IO.Directory.GetFiles(path, "*.sfs"); foreach (string file in fillist) { string sfsFile = file; string labFile = file .Split ('.')[0] + ".lab"; string wavfile = file .Split ('.')[0] + ".wav"; string cmnd = "sfs2wav -o " + wavfile + " " + sfsFile; string res = Exec(cmnd); cmnd = "anlist -h -o " + labFile + " " + sfsFile; res = Exec(cmnd); }

Code

Amr M. Gody

Page

string Exec(string CommandLine) {

30

The below code is for the function Exec. This function is used to run the command line from within a C# program.

Hidden Markov Model

Version 7

2010

string buffer; string[] splitters = { " " }; string cmd = CommandLine.Split(splitters , StringSplitOptions.RemoveEmptyEntries )[0]; string args = CommandLine.Substring(cmd.Length + 1); Process a = new Process(); a.StartInfo.FileName = cmd; a.StartInfo.Arguments = args; a.StartInfo.RedirectStandardOutput = true ; a.StartInfo.UseShellExecute = false ; a.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; a.Start(); buffer = a.StandardOutput.ReadToEnd(); a.WaitForExit(); return buffer; }

Now we should have all training database in the suitable format for further processing using HTK tool set. Each file is stored in a standard WAV file and has a label file in standard HTK format.
Code

HTK formatted label file is shown below. The numbers are scaled in 100(ns) per unit. For example 462100 in the file below is 0.0462 (sec)
462100 1270800 w 1270800 3442800 a2 3442800 4829200 ~h 4829200 5961500 i 5961500 7297000 d

Before starting the use of any HTK tools we should prepare the configurations that will be used along this tutorial. HTK accepts the configuartion into a simple text file. Let us create a file with name "Config" to store the common configurations. The file is shown below.

Amr M. Gody

Page

31

Hidden Markov Model

Version 7

2010

Code

Configuration file. File name is Config.


# Coding parameters TARGETKIND = MFCC_0_D_A TARGETRATE = 100000.0 SAVECOMPRESSED = T SAVEWITHCRC = T WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12 ENORMALISE = F SOURCEFORMAT = WAV

The configuration file is used by all HTK tools to minimize the number of parameters passed in the command line. There are too many parameters can be configured using the configuration file, for a full reference you should back to HTK manual. The above are the common configurations. The configuration parameters used here are
TARGETKIND = MFCC_0_D_A

This indicates that the kind of features that will be used. Mell Cepstrum, C0 will be used as the energy component. The delta and acceleration coefficients are to be computed and appended to the static MFCC.
TARGETRATE = 100000.0

Gives frame period in 100 (ns) units. This is evaluates to 10 (ms) in our case.
SAVECOMPRESSED = T SAVEWITHCRC = T

The output should be saved in compressed format, and a CRC checksum should be added.
WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97

Amr M. Gody

Page

Hamming window with length 25 (ms) and premphasis with coefficient = 0.97 is used.

32

Hidden Markov Model

Version 7

2010

NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12

This is to configure the filtering and leftring stages. The number of channels in the filter bank = 26. The number of coefficients for Cepstrum = 12.
ENORMALISE = F SOURCEFORMAT = WAV

Here is the energy normalization is disabled for frame. And the source of the speech signal is standard WAV file format.

We will need to create a text file that contain a list of all database files and the associated output feature file. Part of this file is illustrated below.
Code

The script file that stores the list of database files which will be processed for feature extraction process. Let us name this file as "codetr.scp"
1015356936.wav 106707170.wav 1091955181.wav 1135829975.wav 1144342368.wav 1015356936.mfc 106707170.mfc 1091955181.mfc 1135829975.mfc 1144342368.mfc

Now we can apply the following command line to extract the features from the database files.
Command line

HCopy -T 1 -C config -S codetr.scp

Amr M. Gody

Page

f. HMM models design Now we need to design the models that will be used in the recognition process. 3 states left to right model will be used to model phone levels. In this case we assumed that the phone is a three parts. Tow transition parts at

33

Hidden Markov Model

Version 7

2010

phone boundaries and a middle part. It is needed to build a model and initiate it for each phone under test. We have labeled database and associated feature files. Those labeled files will be used to train each corresponding phone model under test. So we need to prepare the following files to build the models: 1- Master Label Format (MLF) file that contain the crossreference between the available training database and the phonetic contents. 2- Single HMM Prototype file. This file will be used as initial model for building a separate monophone HMM file.
Code

MLF file. Let us name it "phones0.mlf"


#!MLF!# "F:\TEMP\temp1\TrainSet\1816204115.lab" sil Sa i f r sil . "F:\TEMP\temp1\TrainSet\1639459726.lab" sil Sa i f r sil . "F:\TEMP\temp1\TrainSet\1903731370.lab" sil Sa i f r sil .

Amr M. Gody

Page

34

At this stage drop speech paused from MLF. We will not build a model for it now. We will use the silence model later on to build the SP model.

Hidden Markov Model

Version 7

2010

Code

The initial prototype for 3 emitting states and 2 Gaussian mixtures presented below. Let us name it "proto"
~h "proto" <BeginHMM> <VecSize> 39 <MFCC_0_D_A> <NumStates> 5 <State> 2 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <State> 3 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <State> 4 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <TransP> 5 0 0.16 0.84 0 0 0 0.04 0.47 0.49 0 0 0 0.26 0.38 0.36 0 0 0 0.22 0.78 0 0 0 0 0 <EndHMM>

Amr M. Gody

Page

35

Hidden Markov Model

Version 7

2010

Command line

The below command line is used to initialize the prototype model using the available database.
HCompV -C config -f 0.01 -m -S train.scp proto

Code

The file train.scp contains the list of the available training files. Below is part of the file "Train.scp"
F:\TEMP\temp1\TrainSet\1015356936.wav F:\TEMP\temp1\TrainSet\106707170.wav F:\TEMP\temp1\TrainSet\1135829975.wav F:\TEMP\temp1\TrainSet\1197356831.wav F:\TEMP\temp1\TrainSet\1236274632.wav F:\TEMP\temp1\TrainSet\1243860769.wav F:\TEMP\temp1\TrainSet\1265385611.wav F:\TEMP\temp1\TrainSet\1299215089.wav F:\TEMP\temp1\TrainSet\131557568.wav F:\TEMP\temp1\TrainSet\1316301303.wav F:\TEMP\temp1\TrainSet\1363171868.wav F:\TEMP\temp1\TrainSet\1421050189.wav

Note

It is assumed that the associated label files available in the same location as well as the corresponding WAV files. The above command line will generate two files 1- The modified file proto after initializing the parameters using the database. 2- The variance floor macro file named "vFloors". This macro file is generated due to the -f option in the command line. The variance floor in this case is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps.

Amr M. Gody

Page

36

Hidden Markov Model

Version 7

2010

Code

The file named "vFloors" is presented below.


~v varFloor1 <Variance> 39 3.558010e-001 2.060277e-001 1.728483e-001

Now we need to use the prototype file to create a separate initial HMM model for each phone under test. It is better to merge all models into a single file that contain them all. We will create two new files that will be used for all subsequent processes. 1- "Hmmdefs" file. This file contains all HMM definitions. 2- "Macros" file. This file contains the common macros in all HMM definition models.
Code

Below is the file "macros"


~o <STREAMINFO> 1 39 <VECSIZE> 39<NULLD><MFCC_D_A_0><DIAGC> ~v varFloor1 <Variance> 39 3.558010e-001 2.060277e-001 1.728483e-001

Code

Below is part of the file "hmmdefs".


~h "Sa" <BEGINHMM> <NUMSTATES> 5 <STATE> 2 <NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 Amr M. Gody

Page

37

Hidden Markov Model

Version 7

2010

Amr M. Gody

Page

<GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <STATE> 3 <NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <STATE> 4 <NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <TRANSP> 5 0.000000e+000 1.600000e-001 8.400000e-001 0.000000e+000 0.000000e+000 0.000000e+000 4.000000e-002 4.700000e-001 4.900000e-001 0.000000e+000 0.000000e+000 0.000000e+000 2.600000e-001 3.800000e-001 3.600000e-001 0.000000e+000 0.000000e+000 0.000000e+000 2.200000e-001 7.800000e-001 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 <ENDHMM> ~h "i" <BEGINHMM> <NUMSTATES> 5 <STATE> 2

38

Hidden Markov Model

Version 7

2010

<NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 . . . . <ENDHMM>

Figure 10: Master Macro Files (MMF) for HMM. To create the Master Macro Files (MMF) simply you need to copy the text from the prototype into the MMF file and repeat it for each phone. You have just to rename each model with one of the phones under test as indicated in figure 10. Then to create the file macros, you need to copy the first 3 lines in the prototype and the contents of the file vFloors which is mentioned shortly before into a single file. This is the macros file. See figure 10 for more details.
Code

The first 3 lines in the proto file.


Page ~o <STREAMINFO> 1 39 <VECSIZE> 39<NULLD><MFCC_D_A_0><DIAGC> Amr M. Gody

39

Hidden Markov Model

Version 7

2010

The macro ~h is for the HMM name. So each model starts with a macro that explains the name of the model. Let us navigate through the listed macros.

As being cleared in the files above, there are some lines starts with the symbol ~. This symbol indicates that the next character is macro identification symbol. Each macro has a unique meaning in HTK. For a complete reference of the macros you should back to HTK manual.

~h "i"

This is model name. Here is the name is i.


<BEGINHMM>

This tag indicates the beginning of the model definition.


<NUMSTATES> 5

This tag indicates that the model is 5 states. This is including the non emitting states. It is always assumed that the first and the last states are non emitting. They are used to identify the first state in the model.
<STATE> 2

This tag announces the beginning of the definition of state number 2. You may notice that there is no definition of state number 1. State number 1 and state number 5 are non emitting states so there is nothing to be defined for them.
<NUMMIXES> 2

Amr M. Gody

Page

This tag defines the number of Gaussian mixtures used to define the emitting probability distribution function for state 2.

40

Hidden Markov Model

Version 7

2010

<MIXTURE> 1 5.000000e-001

This tag announces the beginning of the definition of Mixture number 1. It also defines the mix ratio. It is 0.5 for this mixture.
<MEAN> 39

This tag defines the mean array. It is 39 elements. This should be the same length as the feature vector length. Then the mean array starts. Here are the first three elements in this array.
-7.067261e+000 1.175784e+000 2.454413e+000

<VARIANCE> 39

This tag announces the starting of the variance array. The variance array gives the correlation between the feature vector paramaters. If there is no correlation between the feature vector paarmaters this should be a diagonal matrix with NxN diminsion. The diagaonal eleemnts indicates the varainace or the AC component of each featutres vector elelemnt. In this case we do not need to store the zero eleemnts in the NxN array. It suitable to store only the diagonal N elements as shown below. Below is the first 3 elements of the 39 total elements.
3.558010e+001 2.060277e+001 1.728483e+001

<GCONST> 9.240792e+001

Then the above tags will be repeated for each state in the model. After that the definition of the transition matrix will begin by the following tag.
<TRANSP> 5

This is to announce the beginning of the transition matrix. It is 5 x 5 elements here. Below is the elements array
0.0e+000 1.6e-001 8.4e-001 0.0e+000 0.0e+000 0.0e+000 4.0e-002 4.7e-001 4.9e-001 0.0e+000 0.0e+000 0.0e+000 2.6e-001 3.8e-001 3.6e-001 0.0e+000 0.0e+000 0.0e+000 2.2e-001 7.8e-001 0.0e+000 0.0e+000 0.0e+000 0.0e+000 0.0e+000 Amr M. Gody

Page

41

Hidden Markov Model

Version 7

2010

Note that the non emitting states are included in the transition matrix. The probability of stay on state 1 is 0. This is indicated by the element a 11 = 0. We can only move to either state 2 or state 3 from state 1. This is expressed by having a value for those probabilities a 12 = 0.16. a 13 = 0.84. Going through the transition matrix we can figure out that the model is that one give by figure 11.

Figure 11: Left to Right model HMM.

Starter Model

Create a subfolder named hmm0. Copy the files hmmdefs and macros into it. This will be the first model definitions. Now we can make re-estimation for the parameters using the available database.
Re-estimating models parameters
herest -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0\macros -H hmm0\hmmdefs -M hmm1 monphones0

Amr M. Gody

Page

42

Hidden Markov Model

Version 7

2010

The file monophones0 contains a list of all monophones under test.


Code

The content of the file monophones0 which contains the list of all monophones under test is shown below.
Sa i f r sil w a2 ~h d @ t n i2 n2 a l h b ~@ x m s u Ta

The tool HErest will fetch the training database defined in the file train.scp for each phone listed in the file monophones0. It will use the file phones0.mlf to locate the files that contains the proper speech segments for training each phone. The parameters will be located in the macro files hmmdefs and macros which are stored into the subfolder hmm0. The final parameters estimate will be stored into similar macro files and will be stored into the subfolder hmm1. During the training the token propagation method is utilized. The pruning of good token is considered. Let us discuss the token propagation method. Consider figure 12. A token is certain register object that will be propagated through all available networks defined in this system. Assume a word of length T monophones.
Amr M. Gody

Page

43

Hidden Markov Model

Version 7

2010

To find the best path, many tokens will be launched to navigate every possible path of length T monophone. During the propagation, the probability will be accumulated as well as the monophone symbol. At the end, the path that evaluates to the maximum probability will be considered the winner path. In figure 12, 2 tokens are lunched to navigate the 2 paths as shown in figure.

Figure 12: Token propagation

To minimize the time of calculations in the large networks, the pruning method is applied. It is assumed that there is a bandwidth of the propagated tokens. It bandwidth is a margin of probability relative to the maximum probability at time t. Figure 13 illustrate how the pruning tokens are chosen. The pruning process is defined for HErest command line by the parameter t.
-t 250.0 150.0 1000

This parameter indicates that the bandwidth will be started by 250 during the re estimation process. If the token is not passing, then the bandwidth will be incrementally increased by an increment equals 150.0. The process is repeated till the training file is all processed or the bandwidth exceeding 1000. In this case an error message will be generated that indicates a problem in the training file.
Page Amr M. Gody

44

Hidden Markov Model

Version 7

2010

Figure 13: Pruning tokens.

Figure 14 explains implementing the pruning in token propagation method for large network. Now we need to modify the SIL model and to include the Speech Pause (SP) model into the starter model. It is much better to modify the SIL model to consume the environment noise and to extend much longer than the normal phones. This will be achieved by adding a transition from state 4 to state 2. In this case the time of occupy this model may be extended. The token will not be forced to propagate to state 5 (End state) as the regular phones, rather it may propagate back to state 2. Also we will add a transition from state 2 to state 4 to ensure that the token may get out the model shortly if needed to transfer to the next phone.
Amr M. Gody

Page

45

Hidden Markov Model

Version 7

2010

Figure 14: Pruning the good tokens in large networks.


SIL model

Although we manually added some transition parameters to the transition matrix, the parameters will be later on estimated through the available training data. We just add the track for the train (token). The database is to add more probability or to weak the initial probability of any track. Regarding SP, it represents a short pause. It is much similar to SIL but for a very short period. We can make a model for SP and append it manually to
Amr M. Gody

Page

46

Hidden Markov Model

Version 7

2010

the MMF (last hmmdefs). It is 3 states model. The second state will be shared between SIL and SP. This is to make use of the available database for SIL to avoid model under training problem due to the leak of database. SP may appear very rarely during the speech, so it is better to make use of something similar like SIL database in training the SP model.
Title
SIL model, State 3 is shared with SP model

Title
SP model

Figure 15: SIL and SP model sharing the middle state We will need to add the initial definition of SP model into the MMF file. This may be added manually.
Code

~h "sp" Amr M. Gody

Page

47

The initial model definition of SP model is shown below. This part should be appended into the MMF file (hmmdefs).

Hidden Markov Model

Version 7

2010

<BeginHMM> <VecSize> 39 <MFCC_0_D_A> <NumStates> 3 <State> 2 <NumMixes> 1 <Mixture> 1 1 <Mean> 39 0 0 0 0 0 0 0 0 0 <Variance> 39 0 0 0 0 0 0 0 0 0 <TransP> 3 0 0.26 0.74 0 0.05 0.95 0 0 0 <EndHMM>

SP model

Parameters values may be anything. It will be modified on the re estimation process using the training data.
Command line

The below command line is for editing the hmm model for SIL and SP to be much suitable. It is the way to edit HMM models to be matched with the design in figure 15
HHEd -H hmm2\macros -H hmm2\hmmdefs -M hmm3 sil.hed monophones1

Code

The below is the contents of the file "sil.hed"


AT AT AT TI 2 4 0.2 {sil.transP} 4 2 0.2 {sil.transP} 1 3 0.3 {sp.transP} silst {sil.state[3],sp.state[2]}

Now SP model is included in the Macro Definition File. We will need to modify the model list file monophones1 by adding SP for it. It is better to rename the file to be monophones1.
Code

Below is the content of the file monophones1.


Sa i f r Amr M. Gody

Page

48

Hidden Markov Model

Version 7

2010

sil w a2 ~h d @ t n i2 n2 a l h b ~@ x m s u Ta Sp

Now we may apply a single Re Estimation process on the last MMF files using the available training data. The last HMM macro files are available at the subfolder hmm3.
herest -C config -I phones1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm3\macros -H hmm3\hmmdefs -M hmm4 monophones1

Re-estimating models parameters

MLF with SP

The master label file phones1.mlf is the same as phones0.mlf with the exception of adding the SP label.
What we have now

Now we have the following files


File name Config Monophones1 Location Root Root Comment This file contain the common parameters for HTK tools. Contain the list of symbols considered in the recognition process Cross reference between training files and phonetic contents.

Phones1.mlf

Root

Amr M. Gody

Page

49

Hidden Markov Model

Version 7

2010

wdnet Dict

Root Root

Train.scp Recorded materials for training Features files for training

Root TrainSet

Word network in lattice format. Dictionary which contains how the words are spoken in terms of the symbols used in the recognition process. Symbols here is for Arabic monophones. Script file contain the list of training files. All WAV files needed for training the models. All feature files associated to the WAV files. It is in the same name as the WAV files with a different file extension. The file extension for it is .mfc All label files for the associated WAV files. This is in the same name with a different file extension. The file extension is .LAB. Master Macro File that contain model definitions of all symbols included in the recognition. Master Macro File that include the common macro definitions for all symbols.

TrainSet

Label files

TrainSet

hmmdefs

Hmm4

Macros

Hmm4

g. Evaluating the recognition process The above Hmm models need to be evaluated using some testing database. The database should be altered the same way as the training database to make it possible to count the correct answers. So we should prepare some database as indicated before and assign them for the testing process.
Code

Below is the file test.lab


93100 6237100 . 6237100 12241500 dial 12241500 16104700 . 16104700 22574500 one 22574500 25274200 . 25274200 31418200 two 31418200 36677800 . 36677800 43240700 five 43240700 47616000 . Amr M. Gody

Page

50

Hidden Markov Model

Version 7

2010

47616000 52689500 six 52689500 56320000 .

Code

The file test.scp will be prepared. The file list all samples prepared for testing process. This file is shown below.

Testing the recognizer

Here is below the recognizer will be tested using the file test.wav.
HVite -C Config -H hmm4\\macros -H hmm4\\hmmdefs -S test.wav -i recout.mlf w wdnet -p 100.0 -s 5.0 dict monophones1

This command will produce a master label format file called Recout.mlf.
Code

Below is the file recout.mlf


#!MLF!# "test.rec" 0 2100000 . -1479.104126 2100000 16200000 dial -12947.947266 16200000 16400000 . -171.385193 16400000 22100000 one -5647.436523 22100000 24400000 . -1825.443726 24400000 30800000 two -6835.759766 30800000 31000000 . -89.025650 31000000 33800000 five -2346.800781 33800000 36100000 . -2020.853516 36100000 52300000 six -16902.412109 52300000 56100000 . -3107.816406

. To evaluate the results, we may prepare the file testref.mlf. This file contains all label information of all files using in the test.

Amr M. Gody

Page

51

Hidden Markov Model

Version 7

2010

Code

The file testref.mlf is shown below:


#!MLF!# "test.lab" 93100 6237100 . 6237100 12241500 dial 12241500 16104700 . 16104700 22574500 one 22574500 25274200 . 25274200 31418200 two 31418200 36677800 . 36677800 43240700 five 43240700 47616000 . 47616000 52689500 six 52689500 56320000 . .

Testing the recognizer

Below is the command line to evaluate the recognizer


HResults -I testref.mlf monophones1 recout.mlf

The final results

Below is the result of the current system


====================== HTK Results Analysis====================== Date: Sat Dec 12 20:43:01 2009 Ref : testref.mlf Rec : recout.mlf ------------------------ Overall Results -----------------------SENT: %Correct=100.00 [H=1, S=0, N=1] WORD: %Corr=100.00, Acc=100.00 [H=11, D=0, S=0, I=0, N=11] =================================================================

Amr M. Gody

Page

52

Hidden Markov Model

Version 7

2010

3. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale [5] HTK manual. http://htk.eng.cam.ac.uk/ftp/software/htkbook.pdf.zip
3TU U3T 3TU U3T

Amr M. Gody

Page

53

You might also like