Speech

Feature extractor
Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs)

Feature vectors
Acoustic Observations
Acoustic Observations Hidden States
Acoustic Observations Hidden States Acoustic Observation likelihoods
Six
Constructs the HMMs of phones Produces observation likelihoods
Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k
Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k TIDIGITS, RM1, AN4, HUB4
Word likelihoods
ARPA format Example:
1-grams: -3.7839 board -0.1552 -2.5998 bottom -0.3207 -3.7839 bunch -0.2174 2-grams: -0.7782 as the -0.2717 -0.4771 at all 0.0000 -0.7782 at the -0.2915 3-grams: -2.4450 in the lowest -0.5211 in the middle -2.4450 in the on
public <basicCmd> = <startPolite> <command> <endPolite>; public <startPolite> = (please | kindly | could you ) *; public <endPolite> = [ please | thanks | thank you ]; <command> = <action> <object>; <action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);
Maps words to phoneme sequences
Example from cmudict.06d

P OW L T AH S P OW L T AH S IH Z P AW L T AH N P OW L T R IY P AW N S P AW N S T P AW N S IY P AW N S IH NG P UW NG K IY
POULTICE POULTICES POULTON POULTRY POUNCE POUNCED POUNCEY POUNCING POUNCY
Constructs the search graph of HMMs from:

Acoustic model Statistical Language model ~or~ Grammar Dictionary
Can be statically or dynamically constructed
FlatLinguist
FlatLinguist DynamicFlatLinguist
FlatLinguist DynamicFlatLinguist LexTreeLinguist
Maps feature vectors to search graph
Searches the graph for the best fit
Searches the graph for the best fit P(sequence of feature vectors| word/phone) aka. P(O|W)
-> how likely is the input to have been generated by the word
F ay ay ay ay v v v v v F f ay ay ay ay v v v v F f f ay ay ay ay v v v F f f f ay ay ay ay v v F f f f ay ay ay ay ay v F f f f f ay ay ay ay v F f f f f f ay ay ay v
Time O1 O2 O3
Uses algorithms to weed out low scoring paths during decoding
Words!
Most common metric Measure the # of modifications to transform recognized sentence into reference sentence
Reference: This is a reference sentence. Result: This is neuroscience.
Reference: This is a reference sentence. Result: This is neuroscience. Requires 2 deletions, 1 substitution
Reference: This is a reference sentence. Result: This is neuroscience.

deletions substitutions insertions WER ! 100 v Length
Reference: This is a reference sentence. Result: This is neuroscience. D S D

2 1 0 3 WER ! 100 v ! 100 v ! 60% 5 5
Limited Vocab Multi-Speaker
Limited Vocab Multi-Speaker Extensive Vocab Single Speaker
Vocabulary Digits 0-9 100 Word 1,000 Word 5,000 Word 64,000 Word
Sphinx4 WER .549% 1.192% 2.88% 6.97% 18.756%
*If you have noisy audio input multiply expected error rate x 2
Other variables: -Continuous vs. Isolated -Conversational vs. Read -Dialect
Questions?
Time O1 O2 O3
P(ay | f) * P(O2|ay)
P(f|f) * P(O2 | f) Time O1 O2 O3
P (O1) * P(ay | f) * P(O2|ay)
Time O1 O2 O3
Time O1 O2 O3
Common Sphinx4 FAQs can be found online: http://cmusphinx.sourceforge.net/sphinx4/do c/Sphinx4-faq.html What followes are some less-FAQs
Q. Is a search graph created for every recognition result or one for the recognition app? A. This depends on which Linguist is used. The flat linguist generates the entire search graph and holds it in memory. It is only useful for small vocab recognition tasks. The lexTreeLinguist dynamically generates search states allowing it to handle very large vocabularies
Q. How does the Viterbi algorithm save computation over exhaustive search? A. The Viterbi algorithm saves memory and computation by reusing subproblems already solved within the larger solution. In this way probability calculations which repeat in different paths through the search graph do not get calculated multiple times Viterbi cost = n2 n3 Exhaustive search cost = 2n -3n
Q. Does the linguist use a grammar to construct the search graph if it is available? A. Yes, a grammar graph is created
Q. What algorithm does the Pruner use? A. Sphinx4 uses absolute and relative beam pruning
Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Relative Beam Width probability threshold

<property name="relativeBeamWidth" value="1E-120"/>


Word Insertion Probability Word break likelihood

<property name="wordInsertionProbability" value="0.7"/>


Word Insertion Probability Word break likelihood

<property name="wordInsertionProbability" value="0.7"/>
Language Weight Boosts language model scores

<property name="languageWeight" value="10.5"/>
Silence Insertion Probability Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>
Silence Insertion Probability Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>
Filler Insertion Probability Likelihood of inserting filler words

<property name="fillerInsertionProbability" value="1E-10"/>
To call a Java example from Python:
import subprocess subprocess.call(["java", "-mx1000m", "-jar", "/Users/Username/sphinx4/bin/Transcriber.jar )
Speech and Language Processing 2nd Ed. Daniel Jurafsky and James Martin Pearson, 2009 Artificial Intelligence 6th Ed. George Luger Addison Wesley, 2009 Sphinx Whitepaper http://cmusphinx.sourceforge.net/sphinx4/#whitep aper Sphinx Forum https://sourceforge.net/projects/cmusphinx/forums

Speech

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech

Uploaded by

Copyright:

Available Formats

Feature extractor

Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs)

Acoustic Observations Hidden States

Acoustic Observations Hidden States Acoustic Observation likelihoods

Constructs the HMMs of phones Produces observation likelihoods

ARPA format Example:

Maps words to phoneme sequences

Example from cmudict.06d

POULTICE POULTICES POULTON POULTRY POUNCE POUNCED POUNCEY POUNCING POUNCY

Constructs the search graph of HMMs from:

Can be statically or dynamically constructed

FlatLinguist DynamicFlatLinguist LexTreeLinguist

Maps feature vectors to search graph

Searches the graph for the best fit

Uses algorithms to weed out low scoring paths during decoding

Reference: This is a reference sentence. Result: This is neuroscience.

Reference: This is a reference sentence. Result: This is neuroscience.

Reference: This is a reference sentence. Result: This is neuroscience. D S D

Limited Vocab Multi-Speaker

Limited Vocab Multi-Speaker Extensive Vocab Single Speaker

Sphinx4 WER .549% 1.192% 2.88% 6.97% 18.756%

Other variables: -Continuous vs. Isolated -Conversational vs. Read -Dialect

P(f|f) * P(O2 | f) Time O1 O2 O3

P (O1) * P(ay | f) * P(O2|ay)

Absolute Beam Width - # active search paths

Absolute Beam Width - # active search paths

Relative Beam Width probability threshold

Absolute Beam Width - # active search paths

Relative Beam Width probability threshold

Word Insertion Probability Word break likelihood

Absolute Beam Width - # active search paths

Relative Beam Width probability threshold

Word Insertion Probability Word break likelihood

Language Weight Boosts language model scores

Silence Insertion Probability Likelihood of inserting silence

Silence Insertion Probability Likelihood of inserting silence

Filler Insertion Probability Likelihood of inserting filler words

To call a Java example from Python:

import subprocess subprocess.call(["java", "-mx1000m", "-jar", "/Users/Username/sphinx4/bin/Transcriber.jar )

You might also like