You are on page 1of 54

Hidden Markov Model:

theorical recall & applications


Introduction to HMM
and using of it in speech
synthesis context

Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
Why HMM?

The model is very rich in mathematical


structure and hence can form the
theorical basis for use in a wide range
of applications
The model, when applied properly,
work very well in practice

Claudio Zito
Introduction
Real-world processes generally produce
osservable outputs which can be
characterized as signal
Discrete or continuous
Stationary or non-stationary
Pure or corrupted
A problem of fundamental interest is
characterizing such real-world signals in
terms of a signal models
Claudio Zito
Introduction
Deterministic approach: the model generally
exploit some know specific proprieties of the
signal (the signal is a sin wave, sum of
exponetials, ecc). All that is request is to
estimate values of the parameters of the
signal model (e.g. amplitude, frequency, etc.)
Statistical approach: the model tries to
characterize only statistical proprieties of the
signal (Gaussian processes, Poisson
processes, Markov models, ect.)
Claudio Zito
Introduction
Deterministic approach: the model generally
exploit some know specific proprieties of the
signal (the signal is a sin wave, sum of
exponetials, ecc). All that is request is to
estimate values of the parameters of the
signal model (e.g. amplitude, frequency, etc.)
Statistical approach: the model tries to
characterize only statistical proprieties of the
signal (Gaussian processes, Poisson
processes, Markov models, ect.)
Claudio Zito
Introduction

The underlying assumption of the statistical


model is that the signal can be well
charaterized as a parametric random process,
and that the parameters of the stochastic
process can be determinate (estimate) in a
precise, well-defined manner

Claudio Zito
Markov Chain
 Consider a system with
N distinct states S1,
S2,…,SN
 At regularly spaced
discrete times, the
system undergoes a
change of state in
according to a set of
probabilities associated
with the state.

Claudio Zito
Markov Chain
 Let denote the time instances associated with state
chances as t=1,2,…
 Let denote the state at time t as qt
 A full probabilistic description of the system would, in
general, require specification of the current state
(time t) as well as all the predecessor states
 Special case of first-order Markov Chain this
probabilistic description is truncate to just current
state and its predecessor state
P[qt = Sj | qt-1 = Si, qt-2 = Sk, …]
= P[qt = Sj | qt-1 = Si]
Claudio Zito
Markov Chain

 Process is independent of time by leading to the set


of state transition probabilities aij of the form:
aij = P[qt+1 = Sj | qt = Si], 1 ≤ i,j ≤ N

 State transition coefficients having the follow


proprieties since they obey standard stochastic
constraints
aij ≥ 0
∑Nj=1aij = 1

Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
Extension to HMM
We extend the concept of MC to include the
case where the observation is a probabilistic
function of the state
The resulting model is a double embeded
stochastic process with an underlying
stocastic process that is not observable
(hidden), but can only be observed through
another set of stocastic processes that
produce the sequence of observations
Claudio Zito
Coin toss model
Assume the following scenario:
you are in a room with a barrier through which
you cannot see what is happening.
On the other side of the barrier is another person
who is performing a coin (or multiple coins)
tossing experiment.
The other person will not tell you about what he is
doing exactly, but only the result of each coin flip
Given the above scenario, the problem of
interest is how do we build an HMM to
explain the observed sequence of heads and
tails
Claudio Zito
Coin toss model:
how many state?
 In Fig (2) a 2-state
model, each state
One choice would
corrispond to
bea
to
 In
The assume
Fig
phisycal
different, (3) a that
coin. a
mechanism
biased, 3-state
which
model.
single accounts
This formodel
biased how
coin
 Each state
state transitions
corrisponds to usingare is
3
was being
characterized tossed
by and a
selected could
biased coins,itself be
and
a setweofcould
probability
choosing model
distribution
from
indipendent
among
of
coinheads
the
the three,andbased
situation
tossed, tails,
orwith and
some a
on
transition
2-state
other
some model between
probabilistic (i.e.
event
states are characterized
heads
by
or tails)
state transition
matrix.
Claudio Zito
Coin toss model
which model best match?
It should be clear that
the simple 1-coin model has only 1 unknown
parameter;
the 2-coin model has 4 unknown parameters;
and the 3-coin model has 9 unknown parameters.
It mught be the case that only a single coin is
being tossed, then using the 3-coin model
would be inappropriate, since the actually
physical event would not corrispond to the
model being used
Claudio Zito
Elements of an HMM
 N, number of states in the model
 Single states as S = {S1, S2,…,SN}
 The state at time t = qt
 M, the number of distinct observation symbols per state
 Single state as V = {V1, V2,…, VM}
 The state transition probability distribution A = {aij} where
aij = P[qt+1 = Sj | qt = Si], 1 ≤ i,j ≤ N
 The observation symbol probability distribution in state j, B =
{bj(k)} where
Bj(k) = P[Vk at t|qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M
 The initial state distribution π = {πi} where
πi = P[q1 = Si], 1≤i≤N

Claudio Zito
Elements of an HMM

Claudio Zito
Elements of an HMM
 Given appropriate values of N, M, A, B and π, the HMM can
be used as generator to give an observed sequence
O = O1O2…OT
As follows:
1. Choose a initial state q1 = Si in according to π
2. Set t = 1
3. Choose Ot = Vk according to bi(k)
4. Transit in a new state qt+1 = Sj in according to aij
5. Set t = t+1; return to 3) if t < T; otherwise terminate the
procedure
 For convenience we use the compact notation
λ = (A, B, π)

Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
To fix ideas: simple isolated
word speech recognizer
 Idea: for each word we want to design a separate N-state HMM
 Input: the speech signal of a given word represents as a time
sequence of coded spectral vectors
 Task 1: to optimally estimate model parameters for each word
model (problem 3)
 Task 2: to segment each of the word training sequences into
states, and then study the spectral vectors’ proprieties that lead
to the observations occurring in each state (problem 2)
 Task 3: once the set of HMMs has been designed and
optimized, recognition of unknown word is performed (problem
1)

Claudio Zito
To fix ideas: simple isolated
word speech recognizer

Claudio Zito
Output probability of HMM

Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
Solution to the
evaluation problem
We wish calculate the P(O| λ)
The most straightforward way is
P(O|Q,λ) = Πt=1TP(Ot|Qt,λ)
that involves O(2TNT) calculations
A more efficient procedure is called
“forward-backward procedure” that
involves O(N2T) calculations

Claudio Zito
Forward-backward procedure
 Forward variable αt(i) defined as
αt(i) = P(O1…Ot,qt=Si|λ)
 We can solve αt(i) inductively, as
follows:
 Initialization
α1(i) = πibi(O1), 1 ≤ i ≤ N
 Induction
αt+1(j) = [∑j=tN αt(i) aij ]bj(Ot+1),
1 ≤ t ≤ T-1,
1≤j≤N
∑ QP(q
P(O 1…O
 Termination =S
t,q
t+1 j|λ)
t=S i|λ) P(qt+1=Sj|
qt=S i)
P(O|λ) = ∑j=1N αT(i)

Claudio Zito
Forward-backward procedure
 backward variable βt(i) defined as
βt(i) = P(Ot+1…OT|qt=Si,λ)
 We can solve βt(i) inductively, as
follows:
 Initialization
β1(i) = 1, 1 ≤ i ≤ N
 Induction
βt(j) = ∑j=1Naij βt+1(i)bi(Ot+1),
t = T-1,T-2,…,1
1≤i≤N

 Backward part is not strictly


necessary to solve the evaluation
problem, but it will be used to help
solve the learning problem
Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
Solution to the
decoding problem
 We wish calculate the “optimal” state sequence
associated with the given observed sequence
 One possible optimality criterion is to choose the
states qt which are individually most likely… problem:
the “optimal” state sequence may not even be a valid
state sequence
 A widely used reasonable criterion is to find the
single state seq (path) which maximizes P(Q|O, λ),
and equivalentely P(Q,O|λ) (Viterbi algorithm)

Claudio Zito
Viterbi Algorithm
We need to define the quantity δt(i)
δt(i) = maxpathP[q1…qt=Si,O1…Ot|λ]
where path = q1…qt-1
By induction we have
δt+1(j) = [maxjδt(i)aij]bj(Ot+1)
And then, we need to keep track of the
argument which maximized the formula
above, per each t and j. We do this via ψt(j)
Claudio Zito
Viterbi Algorithm

 It is similar in
implementation to the
forward calculation
 Major difference is the
maximization in the
recursion step over
previous states which is
used in place of the
summing procedure

Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
Solution to the
learning problem
 We wish a method to adjust the model parameters (A, B, π) to
maximize the probability of the given observed sequence
 There is no known way to analitically solve the problem
 Given any finite observation sequence as training data, there is
no optimal way of estimating the model parameters
 We can choose λ = (A, B, π) such that P(O| λ) is locally
maximized using
 Iterative procedure (Baum-Welch)
 Gradient techniques

Claudio Zito
Baum-Welch algorithm
 A set of reasonable reestimation formulas for model
parameters are

(easily to calculate by using forward and backward variables)


 The iterative procedure using reestimated parameters improve
the probability of O being observed from the model until some
limiting point is reached
 Procedure output is called maximum likelihood estimated by
HMM
Claudio Zito
Notes on reestimation
procedure

Forward-backward algorithm leads to


local maxima only
The reestimation formulas can be
interpreted as an implementation of EM
algorithm

Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
History and development of
speech synthesis
The earliest efforts to produce synthetic
speech were made over two hundred years
ago. In St. Petersburg 1779 Russian Professor
Christian Kratzenstein explained physiological
differences between five long vowels (/a/,
/e/, /i/, /o/, and /u/) and made apparatus to
produce them artificially

Claudio Zito
History and development of
speech synthesis
 A few years later, in Vienna 1791, Wolfgang von Kempelen
introduced his "Acoustic-Mechanical Speech Machine", which
was able to produce single sounds and some sound
combinations.
 Kempelen started his work before Kratzenstein, in 1769, and
after over 20 years of research he also published a book in
which he described his studies on human speech production and
the experiments with his speaking machine.
 The essential parts of the machine were a pressure chamber for
the lungs, a vibrating reed to act as vocal cords, and a leather
tube for the vocal tract action. By manipulating the shape of the
leather tube he could produce different vowel sounds.
Consonants were simulated by four separate constricted
passages and controlled by the fingers.
Claudio Zito
History and development of
speech synthesis
 In about mid 1800's Charles Wheatstone constructed
his famous version of von Kempelen's speaking
machine.
 It was a bit more complicated and was capable to
produce vowels and most of the consonant sounds.
Some sound combinations and even full words were
also possible to produce.

Claudio Zito
History and development of
speech synthesis
 In late 1800's Alexander Graham
Bell with his father, inspired by
Wheatstone's speaking machine,
constructed same kind of
speaking machine.
 Bell made also some questionable
experiments with his terrier. He
put his dog between his legs and
made it growl, then he modified
vocal tract by hands to produce
speech-like sounds

Claudio Zito
History and development of
speech synthesis
 First device to be considered as
a speech synthesizer was VODER
(Voice Operating Demonstrator)
introduced by Homer Dudley.
 VODER was inspired by
VOCODER (Voice Coder)
developed at Bell Laboratories in
the mid-thirties.
 The original VOCODER was a
device for analyzing speech into
slowly varying acoustic
parameters that could then drive
a synthesizer to reconstruct the
approximation of the original
speech signal.
Claudio Zito
History and development of
speech synthesis

Claudio Zito
Text-To-Speech system
 Modern speech synthesis technologies involve:
 Front-end
 Analyses text and converts to linguistic specification
 Back-end
 Converts linguistic specification to speech
 Formant synthesis
 Concatenate small pieces of pre-recorded speech
 Generate speech from a model

Claudio Zito
From words to linguistic
specification

Claudio Zito
Concatenative corpus-based
unit selection TTS
 Waveform is made by
concatenating different
acoustical units form a
database
 No digital synthesis
 Pro
 More natural voice
 Cons
 Large database
 Target and concatenation
cost
 Speaker dipendent
Claudio Zito
Overview

HMM as statistical signal models


From MC to HMM. Example: coin toss
Three basic problems for HMMs
Short introduction to TTS systems
HMM-based speech synthesis system

Claudio Zito
HMM-based TTS system (HTS)
 It can generate various voice
characteristics
 The system overview
 In the training part,
spectrum and excitation
parameters are extracted
from speech database
and modeled by context
dependent HMMs.
 In the synthesis part,
context dependent HMMs
are concatenated
according to the text to be
synthesized.
Claudio Zito
From linguistic specification
to speech
/dh/ /ax/ /k/ /ae/ /t/ …

Claudio Zito
HTS implementation
on Festival
 Training data: 524 sentences from CMU
Communicator database
 HMM input: speech signal was sampled
at 16kHz, windowed by a 25-ms
Blackman window with 5-ms shift. Mel-
ceptral coefficients were obtained by
mel-celptral analysis technique
 Architecture: 5-state left-to-right HMMs
with single diagonal Gaussian output
distribution
 HMM output: spectrum part and
excitation part of the waveform
 Each context-dependent HMM
corrisponds to a phoneme-sized speech
unit
Claudio Zito
HMM-based TTS system (HTS)
 Synthesis part performs
 An arbitrarily given text to be
synthesized is converted to a
context-based label sequence.
 According to the label
sequence, a sentence HMM is
constructed by concatenating
context dependent HMMs.
 State durations, mel-cepstral
coefficients and values
including voiced/unvoiced of
the sentence HMM are
determined so as to maximize
the output probability for the
HMM
Claudio Zito
HMM-based TTS system (HTS)

Claudio Zito
Summary
 state-of-the-art unit selection speech synthesis can generate natural-
sounding high quality speech
 human-like talking machines require to generate speech with
 arbitrary speaker's voice characteristics
 various speaking styles including native and non-native speaking
styles in different languages
 varying emotional expressions
 it is still difficult to have such flexibility with unit-selection synthesizers,
since they need a large-scale speech corpus for each voice
 By statistical parametric speech synthesis based on HMMs
 Original speaker's voice characteristics can easily be reproduced
 Using a very small amount of adaptation speech data
 It still difficult generate a natural-sounding high quality speech

Claudio Zito
References
 L.R. Rabinier. A tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition. In Proceeding
IEEE, 77(2):257-285, 1989
 K. Tokuda, H. Zen, A.W. Black, AN HMM-BASED SPEECH
SYNTHESIS SYSTEM APPLIED TO ENGLISH, 2002
 History and Development of Speech Synthesis, url:
http://www.acoustics.hut.fi/publications/files/theses/lemmetty_m
st/chap2 .html HMM-based
 Speech Synthesis System (HTS),
url: http://hts.sp.nitech.ac.jp/
 Effective Multilingua Interaction in Mobile Environments
(EMIME), url: http://www.emime.org/learn/speech-
synthesis/read/tutorial-slides-of-hmm-based-speech-synthesis

Claudio Zito

You might also like