9 HTK Tutorial

Multimedia Department
K.Marasek
05.07.2005
HTK Tutorial
Prepared using HTKBook
Software architecture
K.Marasek
05.07.2005
toolkit for Hidden Markov Modeling

optimized for Speech Recognition
very flexible and complete
very good documentation (HTK Book)
Data Preparation Tools
Training Tools
Recognition Tools
Analysis Tool
General concepts
Set of programs with command-line style interface

Each tool has a number of required arguments plus optional arguments. The latter are
always prefixed by a minus sign.
HFoo -T 1 -f 34.3 -a -s myfile file1 file2
Options whose names are a capital letter have the same meaning across all tools. For
example, the -T option is always used to control the trace output of a HTK tool.
In addition to command line arguments, the operation of a tool can be controlled by
parameters stored in a configuration file. For example, if the command
HFoo -C config -f 34.3 -a -s myfile file1 file2

is executed, the tool HFoo will load the parameters stored in the configuration file config during
its initialisation procedures
K.Marasek
05.07.2005
the HTK data formats
audio: many common formats plus HTK binary

features: HTK binary
labels: HTK (single or Master Label les) text
models: HTK (single or Master Macro les) text or binary
other: HTK text
Data preparation tools
data manipulation tools:
HCopy parametrze signals

HQuant - vector quantization
HLEd label editor
HHEd model editor (master model file)
HDMan - dictionary editor
HBuild language model conversion
HParse lattice file preparation
(grammar conversion)
data visualization tools:
HSLab - speech label manipulation

HList - data display and manipulation
HSGen generate sentences out of
regular grammar
K.Marasek
05.07.2005
Training tools
The actual training process takes place in stages and it is illustrated in more
detail in Fig. 2.3 . Firstly, an initial set of models must be created. If there
is some speech data available for which the location of the sub-word (i.e.
phone) boundaries have been marked, then this can be used as bootstrap
data. In this case, the tools HInit and HRest provide isolated word style
training using the fully labelled bootstrap data. Each of the required HMMs
is generated individually. HInit reads in all of the bootstrap training data
and cuts out all of the examples of the required phone. It then iteratively
computes an initial set of parameter values using a segmental k-means
procedure. On the first cycle, the training data is uniformly segmented,
each model state is matched with the corresponding data segments and
then means and variances are estimated. If mixture Gaussian models are
being trained, then a modified form of k-means clustering is used. On the
second and successive cycles, the uniform segmentation is replaced by
Viterbi alignment. The initial parameter values computed by HInit are then
further re-estimated by HRest. Again, the fully labelled bootstrap data is
used but this time the segmental k-means procedure is replaced by the
Baum-Welch re-estimation procedure described in the previous chapter.
When no bootstrap data is available, a so-called flat start can be used. In
this case all of the phone models are initialised to be identical and have
state means and variances equal to the global speech mean and variance.
The tool HCompV can be used for this.
Once an initial set of models has been created, the tool HErest is used to
perform embedded training using the entire training set. HErest performs
a single Baum-Welch re-estimation of the whole set of HMM phone
models simultaneously. For each training utterance, the corresponding
phone models are concatenated and then the forward-backward algorithm
is used to accumulate the statistics of state occupation, means, variances,
etc., for each HMM in the sequence. When all of the training data has
been processed, the accumulated statistics are used to compute reestimates of the HMM parameters. HErest is the core HTK training tool. It
is designed to process large databases, it has facilities for pruning to
reduce computation and it can be run in parallel across a network of
machines
K.Marasek
05.07.2005
Recognition and analysis tools
HVite performs Viterbi-based speech recognition.

HVITE takes as input a network describing the allowable
word sequences, a dictionary defining how each word is
pronounced and a set of HMMs. It operates by
converting the word network to a phone network and
then attaching the appropriate HMM definition to each
phone instance. Recognition can then be performed on
either a list of stored speech files or on direct audio
input. As noted at the end of the last chapter, HVITE can
support cross-word triphones and it can run with multiple
tokens to generate lattices containing multiple
hypotheses. It can also be configured to rescore lattices
and perform forced alignments.
HResults uses dynamic programming to align the two
transcriptions and then count substitution, deletion and
insertion errors. Options are provided to ensure that the
algorithms and output formats used by HRESULTS are
compatible with those used by the US National Institute
of Standards and Technology (NIST). As well as global
performance measures, HRESULTS can also provide
speaker-by-speaker breakdowns, confusion matrices
and time-aligned transcriptions. For word spotting
applications, it can also compute Figure of Merit (FOM)
scores and Receiver Operating Curve (ROC)
information.
K.Marasek
05.07.2005
How to use HTK in 10 easy steps
Step 1. Set the task
Prepare the grammar in the BNF format:
[.] optional
{.} zero or more
(.) block
<.> loop
<<.>> context dep. loop
.|. alternative
Compile grammar to lattice format
D:\htk-3.1\bin.win32\HParse location-grammar lg.lat

$location= where is | how to find | how to come to;
$ex= sorry | excuse me | pardon;
$intro= can you tell me | do you know ;
$address= acton town | admirality arch | baker street | bond street| big ben | blackhorse road |
buckingham palace | cambridge | canterbury | charing cross road | covent garden | downing
street | ealing | edgware road | finchley road |
gloucester road | greenwich | heathrow airport | high street | house of parliament | hyde park |
kensington | king's cross | leicester square | marble arch | old street | paddington station |
piccadilly circus | portobello market | regent's park | thames river | tower bridge | trafalgar
square | victoria station | westminster abbey | whitehall | wimbledon | windsor;
$end= please;
(!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)

K.Marasek
05.07.2005
Step 2 prepare pronunciation dictionary
K.Marasek
05.07.2005
Find the list of words using in the task lg.wlist

Prepare dictionary by hand, automatically or using standard pronounciation
dictionary (e.g. Beep for British English)
Or use the whole Beep dictionary
where [where] 1.0 w e@

where [where] 1.0 w e@ r
is [is] 1.0 I z
how [how] 1.0 h aU
admirality [admirality] 1.0 { d m @ r @ l i: t i:
palace [palace] 1.0 p { l I s
Step 3 - Record the Training and Test Data
1. how to come to baker street _SIL_ !EXIT

2. ealing please _SIL_ !EXIT
3. heathrow airport !EXIT
4. leicester square _SIL_ !EXIT
5. king's cross please _SIL_ !EXIT
6. hyde park _SIL_ !EXIT
7. _SIL_ greenwich please _SIL_ _SIL_ _SIL_ _SIL_ _SIL_ !EXIT
8. old street !EXIT
9. high street _SIL_ _SIL_ _SIL_ _SIL_ !EXIT
10. whitehall !EXIT
11. old street !EXIT
12. canterbury please !EXIT
13. into edgware road !EXIT
14. whitehall _SIL_ !EXIT
15. whitehall _SIL_ !EXIT
16. finchley road please please please _SIL_ !EXIT
K.Marasek
05.07.2005
HTK has a tool for prompts recordings HSLab but it is working under Linux only
Usually other programs used for that
First generate prompts than record them
D:\htk-3.1\bin.win32\HSGen -l -n 200 lg.lat beep.dic > lg.200
Record prompts and store in chosen format: 16 kHz, 16-bit, headerless (?)
Step 4 - Create the Transcription Files
In the HTK all transcription files can be merged into one Master Label File (MLF)
usually it is enough to have word level transcripts
If phone level necessary it can be automatically generated using HLEd
#!MLF!#
"*/S0001.lab"
how
to
come to
baker
street
"*/S0002.lab"
ealing
please
(etc...)
K.Marasek
05.07.2005
Step 5 - Parametrize the Data
Use HCopy: compute MFCC and delta

parameters
Use config file to set all the options
(hcopy.conf)
HCopy -T 1 -C hcopy.conf -S
file.list
### hcopy.conf
#fbank analysis
###input file specific section
NUMCHANS
= 24
SOURCEFORMAT = NOHEAD
LOFREQ
= 80
HEADERSIZE = 0
HIFREQ
= 7500
#16kHz corresponds to 0.0625 msec
#don't take the sqrt:
SOURCERATE = 625
USEPOWER
###
#cepstrum calculation
###analysis section
NUMCEPS
= 12
###
CEPLIFTER
= 22
# no DC offset correction
#energy
ZMEANSOURCE = FALSE
ENORMALISE = FALSE
# no random noise added
ESCALE
ADDDITHER = 0.0
RAWENERGY = FALSE
#preemphasis
#delta and delta-delta
PREEMCOEF = 0.97
DELTAWINDOW = 2
#windowing
ACCWINDOW = 2
TARGETRATE = 100000
SIMPLEDIFFS = FALSE
WINDOWSIZE = 250000
###
USEHAMMING = TRUE
###output file specific section
= TRUE
= 1.0
###
TARGETKIND = MFCC_D_A_0
TARGETFORMAT
= HTK
SAVECOMPRESSED
= TRUE
SAVEWITHCRC = TRUE
K.Marasek
05.07.2005
~o <VecSize> 39 <MFCC_D_A_0> <StreamInfo> 1 39

~h "p"
Step 6 Create Monophone

HMMs
define a prototype model and clone

for all phones
<BeginHMM>
<NumStates> 5
<State> 2 <NumMixes> 1
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<TransP> 5
0.000e+0 1.000e+0 0.000e+0 0.000e+0 0.000e+0
0.000e+0 6.000e-1 4.000e-1 0.000e+0 0.000e+0
0.000e+0 0.000e+0 6.000e-1 4.000e-1 0.000e+0
0.000e+0 0.000e+0 0.000e+0 6.000e-1 4.000e-1
0.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0
<EndHMM>
K.Marasek
05.07.2005
Step 7 Initialize models

Use Hinit: HInit -S trainlist -H globals -M dir1 proto
Firstly, the Viterbi algorithm is used to find the most likely state sequence
corresponding to each training example, then the HMM parameters are estimated. As
a side-effect of finding the Viterbi state alignment, the log likelihood of the training data
can be computed. Hence, the whole estimation process can be repeated until no
further increase in likelihood is obtained.
if no initial data use HCompV for

flat start initialization
will scan a set of data files, compute
the global mean and variance and set
all of the Gaussians in a given HMM
to have the same mean and variance
K.Marasek
05.07.2005
Step 8 - Isolated Unit Re-Estimation using HRest
K.Marasek
05.07.2005
Its operation is very similar to HInit except that, it expects

the input HMM definition to have been initialised and it uses
Baum-Welch re-estimation in place of Viterbi training
whereas Viterbi training makes a hard decision as to which
state each training vector was ``generated'' by, BaumWelch takes a soft decision. This can be helpful when
estimating phone-based HMMs since there are no hard
boundaries between phones in real speech and using a
soft decision may give better results.
HRest -S trainlist -H dir1/globals -M dir2 -l ih -L labs
dir1/ih
This will load the HMM definition for /ih/ from dir1, reestimate the parameters using the speech segments
labelled with ih and write the new definition to directory dir2.
Step 9 - Embedded Training using HERest

HERest embedded training simultaneously updates all
of the HMMs in a system using all of the training data.
On startup, HERest loads in a complete set of HMM

definitions. Every training file must have an associated
label file which gives a transcription for that file. Only the
sequence of labels is used by HERest, however, and
any boundary location information is ignored. Thus,
these transcriptions can be generated automatically
from the known orthography of what was said and a
pronunciation dictionary.
HERest processes each training file in turn. After

loading it into memory, it uses the associated
transcription to construct a composite HMM which
spans the whole utterance. This composite HMM is
made by concatenating instances of the phone HMMs
corresponding to each label in the transcription. The
Forward-Backward algorithm is then applied and the
sums needed to form the weighted averages
accumulated in the normal way. When all of the training
files have been processed, the new parameter
estimates are formed from the weighted sums and the
updated HMM set is output.
HERest -t 120.0 60.0 240.0 -S trainlist -I

labs \ -H dir1/hmacs -M dir2 hmmlist
-t : beam limits
Can be used to prepare context-dependent models
K.Marasek
05.07.2005
Step 10 - Use HVite to recognize utterances and HResults to

evaluate recognition rate
D:\htk-3.1\bin.win32\HVite -g -w lg.lat -H wsjcam0.mmf S test.list -C hvite.conf
i recresults. mlf beep.dic wsjcam0.mlist
A lot of other options to be set (beam width, scale factors, weights, etc.)
On line:
D:\htk-3.1\bin.win32\HVite -g -w lg.lat -H wsjcam0.mmf -C live.conf beep.dic
wsjcam0.mlist
Statistics of results:
HResults -I testref.mlf tiedlist recout.mlf
====================== HTK Results Analysis ==============
Ref : testrefs.mlf
Rec : recout.mlf
------------------------ Overall Results ----------------SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1,
N=855] ==========================================================
N = total number, I = insertions, S = substitutions,
D = deletions
correct: H = N-S-D, %Corr=H/N, Acc=(H-I)/N
K.Marasek
05.07.2005
Bye Bye
K.Marasek
05.07.2005
Thanks for your participation!

9 HTK Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 HTK Tutorial

Uploaded by

Copyright:

Available Formats

Multimedia Department

toolkit for Hidden Markov Modeling

Set of programs with command-line style interface

HFoo -T 1 -f 34.3 -a -s myfile file1 file2

HFoo -C config -f 34.3 -a -s myfile file1 file2

the HTK data formats

audio: many common formats plus HTK binary

Data preparation tools

data manipulation tools:

HCopy parametrze signals

data visualization tools:

HSLab - speech label manipulation

Recognition and analysis tools

HVite performs Viterbi-based speech recognition.

How to use HTK in 10 easy steps

Step 1. Set the task

Prepare the grammar in the BNF format:

Compile grammar to lattice format

D:\htk-3.1\bin.win32\HParse location-grammar lg.lat

(!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)

How to use HTK in 10 easy steps

Step 2 prepare pronunciation dictionary

Find the list of words using in the task lg.wlist

where [where] 1.0 w e@

How to use HTK in 10 easy steps

Step 3 - Record the Training and Test Data

1. how to come to baker street _SIL_ !EXIT

How to use HTK in 10 easy steps

Step 4 - Create the Transcription Files

How to use HTK in 10 easy steps

Step 5 - Parametrize the Data

Use HCopy: compute MFCC and delta

###input file specific section

#16kHz corresponds to 0.0625 msec

#don't take the sqrt:

# no random noise added

#delta and delta-delta

###output file specific section

~o <VecSize> 39 <MFCC_D_A_0> <StreamInfo> 1 39

Step 6 Create Monophone

define a prototype model and clone

How to use HTK in 10 easy steps

Step 7 Initialize models

if no initial data use HCompV for

How to use HTK in 10 easy steps

Step 8 - Isolated Unit Re-Estimation using HRest

Its operation is very similar to HInit except that, it expects

How to use HTK in 10 easy steps

Step 9 - Embedded Training using HERest

On startup, HERest loads in a complete set of HMM

HERest processes each training file in turn. After

HERest -t 120.0 60.0 240.0 -S trainlist -I

Can be used to prepare context-dependent models

How to use HTK in 10 easy steps

Step 10 - Use HVite to recognize utterances and HResults to

Thanks for your participation!

You might also like