Professional Documents
Culture Documents
Claudio Zito
Overview
Claudio Zito
Overview
Claudio Zito
Why HMM?
Claudio Zito
Introduction
Real-world processes generally produce
osservable outputs which can be
characterized as signal
Discrete or continuous
Stationary or non-stationary
Pure or corrupted
A problem of fundamental interest is
characterizing such real-world signals in
terms of a signal models
Claudio Zito
Introduction
Deterministic approach: the model generally
exploit some know specific proprieties of the
signal (the signal is a sin wave, sum of
exponetials, ecc). All that is request is to
estimate values of the parameters of the
signal model (e.g. amplitude, frequency, etc.)
Statistical approach: the model tries to
characterize only statistical proprieties of the
signal (Gaussian processes, Poisson
processes, Markov models, ect.)
Claudio Zito
Introduction
Deterministic approach: the model generally
exploit some know specific proprieties of the
signal (the signal is a sin wave, sum of
exponetials, ecc). All that is request is to
estimate values of the parameters of the
signal model (e.g. amplitude, frequency, etc.)
Statistical approach: the model tries to
characterize only statistical proprieties of the
signal (Gaussian processes, Poisson
processes, Markov models, ect.)
Claudio Zito
Introduction
Claudio Zito
Markov Chain
Consider a system with
N distinct states S1,
S2,…,SN
At regularly spaced
discrete times, the
system undergoes a
change of state in
according to a set of
probabilities associated
with the state.
Claudio Zito
Markov Chain
Let denote the time instances associated with state
chances as t=1,2,…
Let denote the state at time t as qt
A full probabilistic description of the system would, in
general, require specification of the current state
(time t) as well as all the predecessor states
Special case of first-order Markov Chain this
probabilistic description is truncate to just current
state and its predecessor state
P[qt = Sj | qt-1 = Si, qt-2 = Sk, …]
= P[qt = Sj | qt-1 = Si]
Claudio Zito
Markov Chain
Claudio Zito
Overview
Claudio Zito
Extension to HMM
We extend the concept of MC to include the
case where the observation is a probabilistic
function of the state
The resulting model is a double embeded
stochastic process with an underlying
stocastic process that is not observable
(hidden), but can only be observed through
another set of stocastic processes that
produce the sequence of observations
Claudio Zito
Coin toss model
Assume the following scenario:
you are in a room with a barrier through which
you cannot see what is happening.
On the other side of the barrier is another person
who is performing a coin (or multiple coins)
tossing experiment.
The other person will not tell you about what he is
doing exactly, but only the result of each coin flip
Given the above scenario, the problem of
interest is how do we build an HMM to
explain the observed sequence of heads and
tails
Claudio Zito
Coin toss model:
how many state?
In Fig (2) a 2-state
model, each state
One choice would
corrispond to
bea
to
In
The assume
Fig
phisycal
different, (3) a that
coin. a
mechanism
biased, 3-state
which
model.
single accounts
This formodel
biased how
coin
Each state
state transitions
corrisponds to usingare is
3
was being
characterized tossed
by and a
selected could
biased coins,itself be
and
a setweofcould
probability
choosing model
distribution
from
indipendent
among
of
coinheads
the
the three,andbased
situation
tossed, tails,
orwith and
some a
on
transition
2-state
other
some model between
probabilistic (i.e.
event
states are characterized
heads
by
or tails)
state transition
matrix.
Claudio Zito
Coin toss model
which model best match?
It should be clear that
the simple 1-coin model has only 1 unknown
parameter;
the 2-coin model has 4 unknown parameters;
and the 3-coin model has 9 unknown parameters.
It mught be the case that only a single coin is
being tossed, then using the 3-coin model
would be inappropriate, since the actually
physical event would not corrispond to the
model being used
Claudio Zito
Elements of an HMM
N, number of states in the model
Single states as S = {S1, S2,…,SN}
The state at time t = qt
M, the number of distinct observation symbols per state
Single state as V = {V1, V2,…, VM}
The state transition probability distribution A = {aij} where
aij = P[qt+1 = Sj | qt = Si], 1 ≤ i,j ≤ N
The observation symbol probability distribution in state j, B =
{bj(k)} where
Bj(k) = P[Vk at t|qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution π = {πi} where
πi = P[q1 = Si], 1≤i≤N
Claudio Zito
Elements of an HMM
Claudio Zito
Elements of an HMM
Given appropriate values of N, M, A, B and π, the HMM can
be used as generator to give an observed sequence
O = O1O2…OT
As follows:
1. Choose a initial state q1 = Si in according to π
2. Set t = 1
3. Choose Ot = Vk according to bi(k)
4. Transit in a new state qt+1 = Sj in according to aij
5. Set t = t+1; return to 3) if t < T; otherwise terminate the
procedure
For convenience we use the compact notation
λ = (A, B, π)
Claudio Zito
Overview
Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
To fix ideas: simple isolated
word speech recognizer
Idea: for each word we want to design a separate N-state HMM
Input: the speech signal of a given word represents as a time
sequence of coded spectral vectors
Task 1: to optimally estimate model parameters for each word
model (problem 3)
Task 2: to segment each of the word training sequences into
states, and then study the spectral vectors’ proprieties that lead
to the observations occurring in each state (problem 2)
Task 3: once the set of HMMs has been designed and
optimized, recognition of unknown word is performed (problem
1)
Claudio Zito
To fix ideas: simple isolated
word speech recognizer
Claudio Zito
Output probability of HMM
Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
Solution to the
evaluation problem
We wish calculate the P(O| λ)
The most straightforward way is
P(O|Q,λ) = Πt=1TP(Ot|Qt,λ)
that involves O(2TNT) calculations
A more efficient procedure is called
“forward-backward procedure” that
involves O(N2T) calculations
Claudio Zito
Forward-backward procedure
Forward variable αt(i) defined as
αt(i) = P(O1…Ot,qt=Si|λ)
We can solve αt(i) inductively, as
follows:
Initialization
α1(i) = πibi(O1), 1 ≤ i ≤ N
Induction
αt+1(j) = [∑j=tN αt(i) aij ]bj(Ot+1),
1 ≤ t ≤ T-1,
1≤j≤N
∑ QP(q
P(O 1…O
Termination =S
t,q
t+1 j|λ)
t=S i|λ) P(qt+1=Sj|
qt=S i)
P(O|λ) = ∑j=1N αT(i)
Claudio Zito
Forward-backward procedure
backward variable βt(i) defined as
βt(i) = P(Ot+1…OT|qt=Si,λ)
We can solve βt(i) inductively, as
follows:
Initialization
β1(i) = 1, 1 ≤ i ≤ N
Induction
βt(j) = ∑j=1Naij βt+1(i)bi(Ot+1),
t = T-1,T-2,…,1
1≤i≤N
Claudio Zito
Viterbi Algorithm
We need to define the quantity δt(i)
δt(i) = maxpathP[q1…qt=Si,O1…Ot|λ]
where path = q1…qt-1
By induction we have
δt+1(j) = [maxjδt(i)aij]bj(Ot+1)
And then, we need to keep track of the
argument which maximized the formula
above, per each t and j. We do this via ψt(j)
Claudio Zito
Viterbi Algorithm
It is similar in
implementation to the
forward calculation
Major difference is the
maximization in the
recursion step over
previous states which is
used in place of the
summing procedure
Claudio Zito
The three basic problems
for HMMs
1. Given the observed sequence O = O1O2…OT
and the model λ = (A, B, π), how do we
efficiently compute P(O| λ)?
2. Given the observed sequence O = O1O2…OT
and the model λ, how do we choose a
corrisponding state sequence Q = q1q2…qT
which is optimal in some meaningful sense?
3. How do we adjust the model parameters λ
= (A, B, π) to maximize P(O| λ)?
Claudio Zito
Solution to the
learning problem
We wish a method to adjust the model parameters (A, B, π) to
maximize the probability of the given observed sequence
There is no known way to analitically solve the problem
Given any finite observation sequence as training data, there is
no optimal way of estimating the model parameters
We can choose λ = (A, B, π) such that P(O| λ) is locally
maximized using
Iterative procedure (Baum-Welch)
Gradient techniques
Claudio Zito
Baum-Welch algorithm
A set of reasonable reestimation formulas for model
parameters are
Claudio Zito
Overview
Claudio Zito
History and development of
speech synthesis
The earliest efforts to produce synthetic
speech were made over two hundred years
ago. In St. Petersburg 1779 Russian Professor
Christian Kratzenstein explained physiological
differences between five long vowels (/a/,
/e/, /i/, /o/, and /u/) and made apparatus to
produce them artificially
Claudio Zito
History and development of
speech synthesis
A few years later, in Vienna 1791, Wolfgang von Kempelen
introduced his "Acoustic-Mechanical Speech Machine", which
was able to produce single sounds and some sound
combinations.
Kempelen started his work before Kratzenstein, in 1769, and
after over 20 years of research he also published a book in
which he described his studies on human speech production and
the experiments with his speaking machine.
The essential parts of the machine were a pressure chamber for
the lungs, a vibrating reed to act as vocal cords, and a leather
tube for the vocal tract action. By manipulating the shape of the
leather tube he could produce different vowel sounds.
Consonants were simulated by four separate constricted
passages and controlled by the fingers.
Claudio Zito
History and development of
speech synthesis
In about mid 1800's Charles Wheatstone constructed
his famous version of von Kempelen's speaking
machine.
It was a bit more complicated and was capable to
produce vowels and most of the consonant sounds.
Some sound combinations and even full words were
also possible to produce.
Claudio Zito
History and development of
speech synthesis
In late 1800's Alexander Graham
Bell with his father, inspired by
Wheatstone's speaking machine,
constructed same kind of
speaking machine.
Bell made also some questionable
experiments with his terrier. He
put his dog between his legs and
made it growl, then he modified
vocal tract by hands to produce
speech-like sounds
Claudio Zito
History and development of
speech synthesis
First device to be considered as
a speech synthesizer was VODER
(Voice Operating Demonstrator)
introduced by Homer Dudley.
VODER was inspired by
VOCODER (Voice Coder)
developed at Bell Laboratories in
the mid-thirties.
The original VOCODER was a
device for analyzing speech into
slowly varying acoustic
parameters that could then drive
a synthesizer to reconstruct the
approximation of the original
speech signal.
Claudio Zito
History and development of
speech synthesis
Claudio Zito
Text-To-Speech system
Modern speech synthesis technologies involve:
Front-end
Analyses text and converts to linguistic specification
Back-end
Converts linguistic specification to speech
Formant synthesis
Concatenate small pieces of pre-recorded speech
Generate speech from a model
Claudio Zito
From words to linguistic
specification
Claudio Zito
Concatenative corpus-based
unit selection TTS
Waveform is made by
concatenating different
acoustical units form a
database
No digital synthesis
Pro
More natural voice
Cons
Large database
Target and concatenation
cost
Speaker dipendent
Claudio Zito
Overview
Claudio Zito
HMM-based TTS system (HTS)
It can generate various voice
characteristics
The system overview
In the training part,
spectrum and excitation
parameters are extracted
from speech database
and modeled by context
dependent HMMs.
In the synthesis part,
context dependent HMMs
are concatenated
according to the text to be
synthesized.
Claudio Zito
From linguistic specification
to speech
/dh/ /ax/ /k/ /ae/ /t/ …
Claudio Zito
HTS implementation
on Festival
Training data: 524 sentences from CMU
Communicator database
HMM input: speech signal was sampled
at 16kHz, windowed by a 25-ms
Blackman window with 5-ms shift. Mel-
ceptral coefficients were obtained by
mel-celptral analysis technique
Architecture: 5-state left-to-right HMMs
with single diagonal Gaussian output
distribution
HMM output: spectrum part and
excitation part of the waveform
Each context-dependent HMM
corrisponds to a phoneme-sized speech
unit
Claudio Zito
HMM-based TTS system (HTS)
Synthesis part performs
An arbitrarily given text to be
synthesized is converted to a
context-based label sequence.
According to the label
sequence, a sentence HMM is
constructed by concatenating
context dependent HMMs.
State durations, mel-cepstral
coefficients and values
including voiced/unvoiced of
the sentence HMM are
determined so as to maximize
the output probability for the
HMM
Claudio Zito
HMM-based TTS system (HTS)
Claudio Zito
Summary
state-of-the-art unit selection speech synthesis can generate natural-
sounding high quality speech
human-like talking machines require to generate speech with
arbitrary speaker's voice characteristics
various speaking styles including native and non-native speaking
styles in different languages
varying emotional expressions
it is still difficult to have such flexibility with unit-selection synthesizers,
since they need a large-scale speech corpus for each voice
By statistical parametric speech synthesis based on HMMs
Original speaker's voice characteristics can easily be reproduced
Using a very small amount of adaptation speech data
It still difficult generate a natural-sounding high quality speech
Claudio Zito
References
L.R. Rabinier. A tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition. In Proceeding
IEEE, 77(2):257-285, 1989
K. Tokuda, H. Zen, A.W. Black, AN HMM-BASED SPEECH
SYNTHESIS SYSTEM APPLIED TO ENGLISH, 2002
History and Development of Speech Synthesis, url:
http://www.acoustics.hut.fi/publications/files/theses/lemmetty_m
st/chap2 .html HMM-based
Speech Synthesis System (HTS),
url: http://hts.sp.nitech.ac.jp/
Effective Multilingua Interaction in Mobile Environments
(EMIME), url: http://www.emime.org/learn/speech-
synthesis/read/tutorial-slides-of-hmm-based-speech-synthesis
Claudio Zito