Professional Documents
Culture Documents
Speech Speech
Meaning
Action DM Meaning
Dialog Management
3
The Speech Dialog Circle
Voice reply to customer
Speech
Which date do you ?
Text-to-Speech
Synthesis TTS ASR Automatic Speech
Recognition
Data,
Words: Whats next? Rules Words spoken
Which date do you want to I need a flight from Catania
to Roma roundtrip
fly from Catania to Roma?
Spoken Language SLG SLU Spoken Language
Generation Understanding
Action Meaning
DM
Get departure dates ORIGIN: Catania
DESTINATION: Roma
Dialog
FLIGHT_TYPE:
Managment
4 ROUNDTRIP
Automatic Speech Recognition (ASR)
5
A Brief History of Speech Recognition
Speech Recognition
Mathematical Formalization
Accuracy
Global optimization
Automatic Learning from data
Heuristics
Handcrafted rules
Local optimization Time
Cristopher C. HMMs
discovers a new
continent.
6
Milestones in ASR Technology Research
Small Large
Medium Large Very Large
Vocabulary, Vocabulary;
Vocabulary, Vocabulary, Vocabulary;
Acoustic Syntax,
Template-based Statistical-based Semantics,
Phonetics- Semantics Multimodal
based
Dialog, TTS
Isolated Words; Connected
Connected Digits; Words; Continuous Spoken dialog;
Isolated
Speech; Speech Multiple
Words Continuous Continuous
Understanding modalities
Speech Speech
p !b a k Score = 29.4
Payback
8
Training Speech Recognizer
This is a test. Th-i -s i-s a t - e - s - t
Training sample
a a!! !
!
b
ch d th e !r! f g
h i ! j k l m n
ng !!i p r s sh t
Thousands of training
th ! v w y z zh #
samples are combined to
build 40 sub-word models,
one for each phoneme. Phoneme Labels
9
Automatic Speech Recognition (ASR)
10
ASR System Components
Feature Extraction
Framing and short-time spectral/cepstral analysis
Acoustic Modeling of Speech Units
Fundamental speech unit selection
Statistical pattern matching (HMM unit) modeling
Lexical Modeling
Pronunciation network
Syntactic and Semantic Modeling
Deterministic or stochastic finite state grammar
N-gram language model
Search and Decision Strategies
Best-first or depth-first, DP-based search
Modular vs. integrated decision strategies
11
ASR Terminology
Vocabulary
Words that can be recognized in an application
More words imply more errors and more computation
Word spotting
Listening for a few specific words within an utterance
Extraneous speech screening (rejection)
Capability to decide whether a candidate key word is a close
enough match to be declared a valid key word
Grammars
Syntax (word order) that can be used
The way words are put together to form phrases & sentences,
some are more likely than others
Can be deterministic or stochastic
Semantics
Usually not properly modeled or represented
12
ASR System and Technology Dimensions
Isolated word vs. continuous speech recognition
Isolated = pauses required between each word
Continuous = no pauses required
Speech unit selection: whole vs. sub-word
Whole word = requires data collection of the words to
be recognized
Sub-word = recognizing basic units of speech
(phonemes), so word recognition is easier for new
applications
Read vs. spontaneous (degree of fluency)
Small vs. medium or large vocabulary
Multilingual vs. dialect/accent variations
13
Basic ASR Formulation (Bayes Method)
W Speech s(n) ^
W
Speakers Acoustic X Linguistic
Production
Intention Processor Decoder
Mechanisms
P ( X | W ) P (W )
= arg max
W P( X )
= arg max PA ( X | W ) PL (W )
W
Step 1 Step 2
Step 3
14
Steps in Speech Recognition
Step 1- acoustic modeling: assign probabilities to
acoustic realizations of a sequence of words.
compute PA(X|W) using hidden Markov models
of acoustic signals and words
Step 2- language modeling: assign probabilities to
sequences of words in the language. Train PL
(W) from generic text or from transcriptions of
task-specific dialogues
Step 3- hypothesis search: find the word sequence
with the maximum a posteriori probability.
Search through all possible word sequences to
determine arg max over W
15
Step 1-The Acoustic Model
g We build acoustic models by learning statistics of the acoustic
features, X , from a training set where we compute the variability
of the acoustic features during the production of the sounds
represented by the models
g It is impractical to create a separate acoustic model, PA ( X | W ), for
every possible word in the language--it requires too much
training data for words in every possible context
g Instead we build acoustic-phonetic models for the ~50 phonemes
in the English language and construct the model for a word by
concantenating (stringing together sequentially) the models for
the constituent phones in the word
g We build sentences (sequences of words) by concatenating word models
16
Step 2-The Language Model
The language model describes the probability of a sequence
of words that form a valid sentence in the language
A simple statistical method works well based on a Markovian
assumption, namely that the probability of a word in a sentence
is conditioned on only the previous N-words, namely an N-gram
language model
PL (W ) = PL (w1,w 2 ,...,w k )
k
= PL (w n | w n 1,w n 2 ,...,w n N )
n =1
18
Speech Recognition Processes (I)
Choose the task => sounds, word vocabulary,
task syntax (grammar), task semantics
19
Speech Recognition Processes (II)
Train the Models => create a method for
building acoustic word models from a
speech training data set, a word lexicon, a
word grammar (language model), and a
task grammar from a text training data set
Speech training set e.g., 100 people each
speaking the 11 digits 10 times in isolation
Text data training set e.g., listing of valid
telephone numbers (or equivalently algorithm
that generates valid telephone numbers)
20
Speech Recognition Processes (III)
Evaluate performance => determine word
error rate, task error rate for recognizer
Speech testing data set25 new people each
speaking 10 telephone numbers as sequences of
isolated digits
Evaluate digit error rate, phone number error rate
Testing algorithm => method for evaluating
recognizer performance from the testing set
of speech utterances
21
Speech Recognition Process
TTS ASR
SLG SLU
Acoustic
DM Model
(HMM)
Xn
Input
Feature Pattern Confidence
Speech Analysis Classification Hello World
Scoring
(Spectral (Decoding, (Utterance
Analysis) Search) Verification)
(0.9) (0.8)
^
s(n), W W
Language
Word
Model (N-
Lexicon
gram)
22
Feature Extraction
Acoustic
Goal: extract a representation for speech Model
M,N s (n ) Pitch cm ( l )
Spectral Formants
Analysis Equalization
Segmentation Energy
Bias removal
(blocking) Zero-Crossing or normalization cm (l )
W (n) sm (n ) Temporal
Windowing Derivative
sm (n ) Delta cepstrum cm (l )
Delta^2 cepstrum
25
ASR: Statistical Methodology
We assume that the speech features are
generated according to the probability models
of the underlying linguistic units
During the training phase, the model
parameters are learned from labeled data
(features).
During the testing phase, the learned models
are used to find the hypothesis with the
maximum a posteriori (MAP) probability given
speech features
26
ASR Solution
W = arg max P (W ) p ( X | W )
W
27
Acoustic
Model
29
Hidden Markov Models (HMMs)
30
Coin-toss Models
31
HMM: An Ideal Speech Model
32
Acoustic Modeling of Speech Units
and System Performance
Word Accuracy
HMM
SI
Training
Data # of Training Data
Phone model : s
s1 s2 s3
i1
Word model: is (ih-z)
i1
i1 i2 i3 s1 s2 s3
ih z
34
Isolated Word HMM
a11 a22 a33 a44 a55
1 2 3 4 5
35
Basic Problems in HMMs
Given acoustic observation X and model :
37
Evaluation: Forward-Backward
Algorithm
Define the forward probability as
!i (t) = p(X1,..., Xt , St = i)
Then
! j (1) = a1 j b j (X1 )
N!1
! j (t) = "!i (t !1)aij b j (Xt ),
i=2
N!1
! N (T ) = "!i (T )aiN.
i=2
38
Evaluation: Forward-Backward
Algorithm
Define the backward probability as
!i (t) = p(Xt+1,..., XT | St = i)
Then
!i (T ) = a1N
N!1
!i (t) = " aij b j (Xt+1 )! j (t +1),
j=2
N!1
!1 (1) = " a1 j b j (X1 )! j (1)
j=2
39
Evaluation: Data Likelihood
The joint probability of St=j and X is
T
p(X , St = j) = ! j (t)" j (t)
1
40
Evaluation: The Baum-Welch Algorithm
t N
t (i) = P( X , st = i | ) = t 1 ( j )a ji bi ( X t )
1 2 t T; 1 i N
j =1
N
t ( j ) = P( X tT+1 | st = j, ) = a ji bi ( X t +1 ) t +1 (i) t =T -11; 1 j N
i =1
t -2 t-1 t t+1
s1 a1i aj1 s1
t (i, j ) = P( st 1 = i, st = j | X 1T , )
Xt aj2
s2 a2i si sj s2 P( st 1 = i, st = j , X 1T | )
=
a3i aj3 P( X 1T | )
s3 s3 t 1 (i)aij b j ( X t ) t ( j )
aijbj(Xt) = N
aNi ajN T (k )
k =1
sN sN
41
Decoding: Viterbi Algorithm
Optimal alignment
between X and S
sM
Vt-1(j+1)
j+1
aj+1 j
Vt-1(j) aj j Vt(j)
j
bj(xt)
Vt-1(j-1) aj-1 j
j -1
t -1 t b2(2)
s2
s1
x1 x2 xT
42
Reestimation: Training Algorithm
Find model parameters =(A, B, ) that maximize the
probability p(X| ) of observing some data X:
T
P ( X | ) = P ( X, S | ) = ast 1st bst ( X t )
S S t =1
) = P( X, S | ) log P( X, S |
Q(, ) = Qa (, a i ) + Qb (, b j )
all S P ( X | )
i j
43
Iterative Training Procedure
Several iterations of the process:
Input Speech
Database
Estimate Maximize
Old (Initial) Posteriors Parameters Updated
HMM Model t(j,k), t(i,j) aij, cjk, jk, jk HMM Model
44
Continuous Observation Density HMMs
Most general form of pdf with a valid re-estimation procedure is:
M
b j ( x ) = c jm x, jm ,U jm , 1 j N
m =1
c
m =1
jm = 1, 1 j N
b ( x )dx = 1,
j 1 j N
45
Word Lexicon Acoustic
Model
Goal:
Map legal phone sequences into words Feature
Extraction
Pattern
Classification
Confidence
Scoring
according to phonotactic rules. For example,
Language Word
David /d/ /ey/ /v/ /ih/ /d/ Model Lexicon
Multiple Pronunciation:
Several words may have multiple pronunciations. For example
Challenges:
How do you generate a word lexicon automatically; how do you add
new variant dialects and word pronunciations.
46
Lexical Modeling
Assume each HMM a monophone model
American English: 42 monophone 42 HMMs
Concatenation of phone models (phone HMMs)
Lexicon: /science/ = /s/+/ai/+/e/+/n/+/s/ or /s/+/ai/+/n/+/s/
Multiple pronunciations and pronunciation network
/0/
47
Word-Juncture Modeling
Co-articulation Effect
Simple concatenation of word models (word HMMs)
Hard change: did you = /d/+/i/+/dzj/+/u/
Soft change: possible pronunciation variations
Source of major errors in many ASR systems
Easier to handle in syllabic languages with open syllables
(vowel or nasal endings, e.g. Japanese, Mandarin)
/dzj/
/sil/
/d/ /i/ /d/ /j/ /u/
/0/
48
Modeling Triphone
Monophone modeling is too simple to model
coarticulation phenomenon ubiquitous in speech
Modeling context-dependent phonemes: triphone
American English: 42X42X42 triphones 74,088 HMMs
/a-s+ai/ /n-s+a/
/z-s+ai/ /n-s+z/
49
Grammar Network Expansion: Monophones
50
Grammar Network Expansion: Triphones
51
Grammar Network Expansion: Cross-Word
52
Language Model Acoustic
Model
Challenges: How do you build a language model rapidly for a new task?
person
54
Chomsky Grammar Hierarchy
Types Constraints Automata
55
Other Examples of Grammar Network
Word-loop grammar:
56
N-Grams
P (W) = P (w 1,w 2 ,,w N )
= P (w 1)P (w 2 | w 1)P (w 3 | w 1,w 2 )! P (w N | w 1,w 2 ,,w N !1)
N
= " P (w i | w 1,w 2 ,,w i !1)
i =1
Trigram Estimation
C (w i 2,w i 1,w i )
P (w i | w i 1,w i 2 ) =
C(w i 2,w i 1 )
57
Example: bi-grams
(Probability of a word given the previous word)
58
Generalization for N-grams (back-off)
60
ASR: Viterbi Search
Assume we build the grammar network for the task, with all
trained HMMs attached in the network
Unknown utterance a sequence of feature vectors Y
Speech recognition is nothing more than a viterbi search:
The whole network viewed as a composite HMM
Y viewed as input data, find the optimal path (viterbi path) S*
traversing the whole network (from START to END)
S * = arg max Pr( S ) p(Y , S | ) = arg max Pr(WS ) p(Y , S | )
S S
61
ASR Issues
Training stage:
Acoustic modeling: how to select speech unit and
estimate HMMs reliably and efficiently from available
training data.
Language modeling: how to estimate n-gram model
from text training data; handle data sparseness
problem.
Test stage:
Search: given HMMs and n-gram model, how to
efficiently search the optimal path from the huge
grammar network.
Search space is extremely large
Call for an efficient pruning strategy
62
Acoustic Modeling
Selection of Speech Units: modeled by a HMMDigit
string recognition: a digit by a HMM 10-12 HMMs
Large vocabulary: monophonebiphonetriphone
HMM topology selection
Phoneme: 3-state left-right without skipping state
Digit/word: 6-12 states left-right no state skipping
HMM type selection
Top choice: Gaussian mixture CDHMM
Number of Gaussian mixtures in each state (e.g., 16)
HMM parameters estimation:
ML (Baum-Welch algorithm) and others
63
Selecting Speech Segments for each HMM
Monophone HMMs
/b/ /u/
/t/ /i/
/p/
$-b+u b-u+k u-k+t k-t+i t-i+s i-s+t s-t+r t-r+i r-i+p i-p+$
Reference Segmentation
Triphone HMMs
/$-b+u/ /b-u+k/
/k-t+i/ /t-i+s/
/i-p+$/
$-b+u b-u+k u-k+t k-t+i t-i+s i-s+t s-t+r t-r+i r-i+p i-p+$
Reference Segmentation
64
Reference Segmentation
Where the segmentation information comes from?
Human labeling: tedious, time-consuming;
Only a small amount is affordable; used for bootstrap.
Automatic segmentation if a simple HMM set is available.
Forced-alignment: Viterbi algorithm; Need transcription only
HMMs + transcription segmentation information
phoneme network
Composite HMM
65
Embedded Training
Only need transcription for each utterance; no
segmentation is needed; automatically tune to
optimal segmentation during training
Transcription: This is a test.
Word network
Phoneme network
Composite HMM
Methods:
traditional techniques for improving system robustness are based on
signal enhancement, feature normalization or/and
model adaptation.
Perception Approach:
extract fundamental acoustic information in narrow bands of speech.
Robust integration of features across time and frequency.
71
Robust Speech Recognition
A mismatch in the speech signal between the training
phase and testing phase results in performance degradation
72
Rejection
Problem:
Extraneous acoustic events, noise,
background speech and out-of-domain speech
deteriorate system performance.
Measure of Confidence:
Associating word strings with a verification cost that
provide an effective measure of confidence
(Utterance Verification).
Effect:
Improvement in the performance of the recognizer,
understanding system and dialogue manager.
73 ECE6255 Spring 2010
State-of-the-Art Performance
TTS ASR
SLG SLU
DM Acoustic
Model
Input Recognized
Pattern
Speech Feature Classification Confidence Sentence
Extraction (Decoding, Scoring
Search)
Language Word
Model Lexicon
NAB
Resource
Management ATIS
Year
77 ECE6255 Spring 2010
Accuracy for Speech Recognition
80
Word Accuracy
70
60
50
40
30
1996 1997 1998 1999 2000 2001 2002
Year
Operational Performance
- End-point detection
- User barge-in
- Utterance rejection
- Confidence scoring
Thank you!
80
Speech Recognition Capabilities
Spontaneous
natural
Speech conversation
user driven
word dialogue
Fluent spotting transcription
Speech
system driven
Speaking Style
personal
dialogue
assistant
digit
Read strings
Speech 2001
office
speaker dictation
verification
Connected
Speech Name
Dialing
voice commands
By B. S. 1997
Isolated
Atal Words
2 20 200 2000 20000