You are on page 1of 17

Multimedia Department


HTK Tutorial
Prepared using HTKBook

Multimedia Department

Software architecture


toolkit for Hidden Markov Modeling

optimized for Speech Recognition
very flexible and complete
very good documentation (HTK Book)
Data Preparation Tools
Training Tools
Recognition Tools
Analysis Tool

General concepts

Set of programs with command-line style interface

Each tool has a number of required arguments plus optional arguments. The latter are
always prefixed by a minus sign.

Multimedia Department

HFoo -T 1 -f 34.3 -a -s myfile file1 file2

Options whose names are a capital letter have the same meaning across all tools. For
example, the -T option is always used to control the trace output of a HTK tool.
In addition to command line arguments, the operation of a tool can be controlled by
parameters stored in a configuration file. For example, if the command

HFoo -C config -f 34.3 -a -s myfile file1 file2

is executed, the tool HFoo will load the parameters stored in the configuration file config during
its initialisation procedures


the HTK data formats

audio: many common formats plus HTK binary

features: HTK binary
labels: HTK (single or Master Label les) text
models: HTK (single or Master Macro les) text or binary
other: HTK text

Data preparation tools

Multimedia Department

data manipulation tools:

HCopy parametrze signals

HQuant - vector quantization
HLEd label editor
HHEd model editor (master model file)
HDMan - dictionary editor
HBuild language model conversion
HParse lattice file preparation
(grammar conversion)

data visualization tools:

HSLab - speech label manipulation

HList - data display and manipulation
HSGen generate sentences out of
regular grammar


Training tools

Multimedia Department

The actual training process takes place in stages and it is illustrated in more
detail in Fig. 2.3 . Firstly, an initial set of models must be created. If there
is some speech data available for which the location of the sub-word (i.e.
phone) boundaries have been marked, then this can be used as bootstrap
data. In this case, the tools HInit and HRest provide isolated word style
training using the fully labelled bootstrap data. Each of the required HMMs
is generated individually. HInit reads in all of the bootstrap training data
and cuts out all of the examples of the required phone. It then iteratively
computes an initial set of parameter values using a segmental k-means
procedure. On the first cycle, the training data is uniformly segmented,
each model state is matched with the corresponding data segments and
then means and variances are estimated. If mixture Gaussian models are
being trained, then a modified form of k-means clustering is used. On the
second and successive cycles, the uniform segmentation is replaced by
Viterbi alignment. The initial parameter values computed by HInit are then
further re-estimated by HRest. Again, the fully labelled bootstrap data is
used but this time the segmental k-means procedure is replaced by the
Baum-Welch re-estimation procedure described in the previous chapter.
When no bootstrap data is available, a so-called flat start can be used. In
this case all of the phone models are initialised to be identical and have
state means and variances equal to the global speech mean and variance.
The tool HCompV can be used for this.
Once an initial set of models has been created, the tool HErest is used to
perform embedded training using the entire training set. HErest performs
a single Baum-Welch re-estimation of the whole set of HMM phone
models simultaneously. For each training utterance, the corresponding
phone models are concatenated and then the forward-backward algorithm
is used to accumulate the statistics of state occupation, means, variances,
etc., for each HMM in the sequence. When all of the training data has
been processed, the accumulated statistics are used to compute reestimates of the HMM parameters. HErest is the core HTK training tool. It
is designed to process large databases, it has facilities for pruning to
reduce computation and it can be run in parallel across a network of


Recognition and analysis tools

Multimedia Department

HVite performs Viterbi-based speech recognition.

HVITE takes as input a network describing the allowable
word sequences, a dictionary defining how each word is
pronounced and a set of HMMs. It operates by
converting the word network to a phone network and
then attaching the appropriate HMM definition to each
phone instance. Recognition can then be performed on
either a list of stored speech files or on direct audio
input. As noted at the end of the last chapter, HVITE can
support cross-word triphones and it can run with multiple
tokens to generate lattices containing multiple
hypotheses. It can also be configured to rescore lattices
and perform forced alignments.
HResults uses dynamic programming to align the two
transcriptions and then count substitution, deletion and
insertion errors. Options are provided to ensure that the
algorithms and output formats used by HRESULTS are
compatible with those used by the US National Institute
of Standards and Technology (NIST). As well as global
performance measures, HRESULTS can also provide
speaker-by-speaker breakdowns, confusion matrices
and time-aligned transcriptions. For word spotting
applications, it can also compute Figure of Merit (FOM)
scores and Receiver Operating Curve (ROC)


How to use HTK in 10 easy steps

Step 1. Set the task

Multimedia Department

Prepare the grammar in the BNF format:

[.] optional
{.} zero or more
(.) block
<.> loop
<<.>> context dep. loop
.|. alternative

Compile grammar to lattice format

D:\htk-3.1\bin.win32\HParse location-grammar

$location= where is | how to find | how to come to;
$ex= sorry | excuse me | pardon;
$intro= can you tell me | do you know ;
$address= acton town | admirality arch | baker street | bond street| big ben | blackhorse road |
buckingham palace | cambridge | canterbury | charing cross road | covent garden | downing
street | ealing | edgware road | finchley road |
gloucester road | greenwich | heathrow airport | high street | house of parliament | hyde park |
kensington | king's cross | leicester square | marble arch | old street | paddington station |
piccadilly circus | portobello market | regent's park | thames river | tower bridge | trafalgar
square | victoria station | westminster abbey | whitehall | wimbledon | windsor;
$end= please;

(!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)


How to use HTK in 10 easy steps

Step 2 prepare pronunciation dictionary

Multimedia Department


Find the list of words using in the task lg.wlist

Prepare dictionary by hand, automatically or using standard pronounciation
dictionary (e.g. Beep for British English)
Or use the whole Beep dictionary

where [where] 1.0 w e@

where [where] 1.0 w e@ r
is [is] 1.0 I z
how [how] 1.0 h aU
admirality [admirality] 1.0 { d m @ r @ l i: t i:
palace [palace] 1.0 p { l I s

How to use HTK in 10 easy steps

Multimedia Department

Step 3 - Record the Training and Test Data

1. how to come to baker street _SIL_ !EXIT

2. ealing please _SIL_ !EXIT
3. heathrow airport !EXIT
4. leicester square _SIL_ !EXIT
5. king's cross please _SIL_ !EXIT
6. hyde park _SIL_ !EXIT
7. _SIL_ greenwich please _SIL_ _SIL_ _SIL_ _SIL_ _SIL_ !EXIT
8. old street !EXIT
9. high street _SIL_ _SIL_ _SIL_ _SIL_ !EXIT
10. whitehall !EXIT
11. old street !EXIT
12. canterbury please !EXIT
13. into edgware road !EXIT
14. whitehall _SIL_ !EXIT
15. whitehall _SIL_ !EXIT
16. finchley road please please please _SIL_ !EXIT


HTK has a tool for prompts recordings HSLab but it is working under Linux only
Usually other programs used for that
First generate prompts than record them
D:\htk-3.1\bin.win32\HSGen -l -n 200 beep.dic > lg.200

Record prompts and store in chosen format: 16 kHz, 16-bit, headerless (?)

How to use HTK in 10 easy steps

Step 4 - Create the Transcription Files

Multimedia Department

In the HTK all transcription files can be merged into one Master Label File (MLF)
usually it is enough to have word level transcripts
If phone level necessary it can be automatically generated using HLEd

come to



How to use HTK in 10 easy steps

Step 5 - Parametrize the Data

Multimedia Department

Use HCopy: compute MFCC and delta

Use config file to set all the options
HCopy -T 1 -C hcopy.conf -S

### hcopy.conf

#fbank analysis

###input file specific section


= 24



= 80



= 7500

#16kHz corresponds to 0.0625 msec

#don't take the sqrt:




#cepstrum calculation

###analysis section


= 12



= 22

# no DC offset correction




# no random noise added





#delta and delta-delta










###output file specific section


= 1.0






~o <VecSize> 39 <MFCC_D_A_0> <StreamInfo> 1 39

~h "p"

Step 6 Create Monophone


define a prototype model and clone

for all phones

<NumStates> 5
<State> 2 <NumMixes> 1
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39

Multimedia Department

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 3 <NumMixes> 1
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 4 <NumMixes> 1
<Stream> 1
<Mixture> 1 1.0000
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<TransP> 5
0.000e+0 1.000e+0 0.000e+0 0.000e+0 0.000e+0
0.000e+0 6.000e-1 4.000e-1 0.000e+0 0.000e+0
0.000e+0 0.000e+0 6.000e-1 4.000e-1 0.000e+0
0.000e+0 0.000e+0 0.000e+0 6.000e-1 4.000e-1
0.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0

How to use HTK in 10 easy steps

Step 7 Initialize models

Use Hinit: HInit -S trainlist -H globals -M dir1 proto
Firstly, the Viterbi algorithm is used to find the most likely state sequence
corresponding to each training example, then the HMM parameters are estimated. As
a side-effect of finding the Viterbi state alignment, the log likelihood of the training data
can be computed. Hence, the whole estimation process can be repeated until no
further increase in likelihood is obtained.

if no initial data use HCompV for

flat start initialization
will scan a set of data files, compute
the global mean and variance and set
all of the Gaussians in a given HMM
to have the same mean and variance

Multimedia Department


How to use HTK in 10 easy steps

Step 8 - Isolated Unit Re-Estimation using HRest

Multimedia Department


Its operation is very similar to HInit except that, it expects

the input HMM definition to have been initialised and it uses
Baum-Welch re-estimation in place of Viterbi training
whereas Viterbi training makes a hard decision as to which
state each training vector was ``generated'' by, BaumWelch takes a soft decision. This can be helpful when
estimating phone-based HMMs since there are no hard
boundaries between phones in real speech and using a
soft decision may give better results.
HRest -S trainlist -H dir1/globals -M dir2 -l ih -L labs
This will load the HMM definition for /ih/ from dir1, reestimate the parameters using the speech segments
labelled with ih and write the new definition to directory dir2.

How to use HTK in 10 easy steps

Step 9 - Embedded Training using HERest

HERest embedded training simultaneously updates all
of the HMMs in a system using all of the training data.

On startup, HERest loads in a complete set of HMM

definitions. Every training file must have an associated
label file which gives a transcription for that file. Only the
sequence of labels is used by HERest, however, and
any boundary location information is ignored. Thus,
these transcriptions can be generated automatically
from the known orthography of what was said and a
pronunciation dictionary.

HERest processes each training file in turn. After

loading it into memory, it uses the associated
transcription to construct a composite HMM which
spans the whole utterance. This composite HMM is
made by concatenating instances of the phone HMMs
corresponding to each label in the transcription. The
Forward-Backward algorithm is then applied and the
sums needed to form the weighted averages
accumulated in the normal way. When all of the training
files have been processed, the new parameter
estimates are formed from the weighted sums and the
updated HMM set is output.

HERest -t 120.0 60.0 240.0 -S trainlist -I

labs \ -H dir1/hmacs -M dir2 hmmlist
-t : beam limits

Can be used to prepare context-dependent models

Multimedia Department


How to use HTK in 10 easy steps

Step 10 - Use HVite to recognize utterances and HResults to

evaluate recognition rate
D:\htk-3.1\bin.win32\HVite -g -w -H wsjcam0.mmf S test.list -C hvite.conf
i recresults. mlf beep.dic wsjcam0.mlist
A lot of other options to be set (beam width, scale factors, weights, etc.)
On line:
D:\htk-3.1\bin.win32\HVite -g -w -H wsjcam0.mmf -C live.conf beep.dic
Statistics of results:
HResults -I testref.mlf tiedlist recout.mlf
====================== HTK Results Analysis ==============

Multimedia Department

Ref : testrefs.mlf
Rec : recout.mlf
------------------------ Overall Results ----------------SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1,
N=855] ==========================================================
N = total number, I = insertions, S = substitutions,
D = deletions
correct: H = N-S-D, %Corr=H/N, Acc=(H-I)/N


Bye Bye

Multimedia Department


Thanks for your participation!

You might also like