Speech Recognition Using Wavelets

Abstract
Speech recognition systems have come a long way in the last forty years, there is still room for improvement. Although readily available, these systems are sometimes inaccurate and insufficient. In an effort to provide a more efficient representation of the speech signal, application of the wavelet analysis is considered. Here we present an effective and robust method for extracting features for speech processing. Based on the time frequency multi resolution property of wavelet transform, the input speech signal is decomposed into various frequency channels. Further we can recognize the original speech using Wavelet Transform. The major issues concerning the design of this Wavelet based speech recognition system are choosing optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature vectors from the wavelet coefficients.
Dynamic Time Warping (DTW) is a pattern matching approach that can be used for limited vocabulary speech recognition, which is based on a temporal alignment of the input signal with the template models. The main drawback of this method is its high computational cost when the length of the signals increases. The main aim of the project work is to provide a modified version of the DTW, based on the Discrete Wavelet Transform (DWT), which reduces its original complexity. Daubechies wavelet family with level 4 & level 7 are experimented and the corresponding results are reported. The above proposed approaches are implemented in software and also implemented by using FPGA.
Table of Contents
Abstract ............................................................................................................................................ i List of Tables .................................................................................................................................. v List of Figures ................................................................................................................................ vi 1. INTRODUCTION ...................................................................................................................... 1 1.1 Definition .............................................................................................................................. 1 1.2 Application area, Features & Issues ...................................................................................... 1 1.2.1 Features ........................................................................................................................... 2 1.2.2 Issues .............................................................................................................................. 2 1.3 Recognition Systems ............................................................................................................. 2 1.3.1 Speaker Dependent / Independent System ..................................................................... 2 1.3.2 Isolated Word Recognition ............................................................................................. 3 1.3.3 Continuous Speech Recognition ..................................................................................... 3 1.3.4 Vocabulary Size .............................................................................................................. 3 1.3.5 Keyword Spotting ........................................................................................................... 3 1.4 Objectives .............................................................................................................................. 4 1.5 Out line .................................................................................................................................. 4 2. LITERATURE SURVEY ........................................................................................................... 6 2.1 Advancement in technology .................................................................................................. 6 3. THE SPEECH SIGNAL ............................................................................................................. 9 3.1 Speech production ................................................................................................................. 9 3.2 Speech Representation ........................................................................................................ 12 3.2.1 Three-state Representation ........................................................................................... 12 3.2.2 Spectral Representation ................................................................................................ 14 3.2.3 Parameterization of the Spectral Activity..................................................................... 15 3.3 Technical Characteristics of the Speech Signal .................................................................. 15 3.3.1 Bandwidth ..................................................................................................................... 15 3.3.2 Fundamental Frequency ............................................................................................... 16 3.3.3 Peaks in the Spectrum................................................................................................... 16 3.3.4 The Envelope of the Power Spectrum .......................................................................... 16 3.4 Speech perception process .................................................................................................. 16
ii
4. WAVELET ANALYSIS .......................................................................................................... 17 4.1 Definition ............................................................................................................................ 17 4.2 Fourier Analysis .................................................................................................................. 18 4.2.1 Limitations .................................................................................................................... 18 4.3 Short-Time Fourier analysis ................................................................................................ 18 4.3.1 Limitations .................................................................................................................... 19 4.4 Types of Wavelets ............................................................................................................... 19 4.4.1 Haar Wavelet ................................................................................................................ 20 4.4.2 Daubechies-N wavelet family ...................................................................................... 20 4.4.3 Advantages Wavelet analysis over STFT ..................................................................... 22 4.5 Wavelet Transform .............................................................................................................. 22 4.5.1 Discrete Wavelet Transform ......................................................................................... 23 4.5.2 Multilevel Decomposition of Signal............................................................................. 24 4.5.3 Wavelet Reconstruction ................................................................................................ 25 5. FROM SPEECH TO FEATURE VECTORS ........................................................................... 26 5.1 Preprocessing ...................................................................................................................... 26 5.1.1 Pre emphasis ................................................................................................................. 27 5.1.2 Voice Activation Detection (VAD) .............................................................................. 28 5.2 Frame blocking & Windowing............................................................................................ 29 5.2.1 Frame blocking ............................................................................................................. 30 5.2.2 Windowing ................................................................................................................... 31 5.3 Feature Extraction ............................................................................................................... 32 6. DYNAMIC TIME WARPING ................................................................................................. 34 6.1 DTW Algorithm .................................................................................................................. 34 6.1.1 DP-Matching Principle ................................................................................................. 35 6.1.2 Restrictions on Warping Function ................................................................................ 37 6.1.3 Discussions on Weighting Coefficient ......................................................................... 39 6.2 Practical DP-Matching Algorithm ...................................................................................... 40 6.2.1 DP-Equation ................................................................................................................. 40 6.2.2 Calculation Details ....................................................................................................... 42 7. FPGA Implementation .............................................................................................................. 43
iii
8. SIMULATION & RESULTS ................................................................................................... 45 8.1 8.2 8.3 8.4 8.5 8.5 Input Signal: ................................................................................................................... 45 Pre emphasis:.................................................................................................................. 46 Voice Activation & Detection ........................................................................................ 47 De-noising: ..................................................................................................................... 49 Recognition Results: ...................................................................................................... 50 FPGA Implementation ................................................................................................... 51
9. CONCLUSION ......................................................................................................................... 54 REFERENCES ............................................................................................................................. 55
iv
List of Tables
Table 8.1: Recognition rates for English words using db8 & level 4 DWT. ................................ 50 Table 8.2: Recognition rates for English words using db8 & level 7 DWT. ................................ 51
List of Figures
Fig. 2.1 Literature survey ................................................................................................................ 7 Fig. 3.1 Schematic diagram of the speech production/perception process ..................................... 9 Fig. 3.2 Human Vocal Mechanism ............................................................................................... 10 Fig. 3.3 Discrete-Time Speech Production Model........................................................................ 11 Fig. 3.4 Three state representation of a speech signal. ................................................................. 13 Fig. 3.5.Spectrogram using Welchs Method ............................................................................... 14 Fig. 4.1 Fourier transform ............................................................................................................. 18 Fig. 4.2 Short time Fourier transform ........................................................................................... 19 Fig. 4.3 Haar wavelet .................................................................................................................... 20 Fig. 4.5 Daubechies wavelets........................................................................................................ 21 Fig. 4.6 Comparison of Wavelet analysis over STFT ................................................................... 22 Fig. 4.7 Filter functions ................................................................................................................. 24 Fig. 4.8 Decomposition of DWT Co-efficients ............................................................................ 24 Fig. 4.9 Decomposition using DWT ............................................................................................. 24 Fig. 4.10 Signal Reconstruction .................................................................................................... 25 Fig. 4.11 Signal Decomposition & Reconstruction ...................................................................... 25 Fig. 5.1 Main steps in Feature Extraction ..................................................................................... 26 Fig. 5.2 Pre processing .................................................................................................................. 26 Fig. 5.3 Pre emphasis filter ........................................................................................................... 27 Fig. 5.4 Frame blocking & Windowing ........................................................................................ 30 Fig. 5.5 Frame blocking of a sequence ......................................................................................... 31 Fig. 5.6 Hamming Window .......................................................................................................... 32 Fig. 6.1 warping function & adjusting window definition............................................................ 35 Fig. 6.2 Slope constraint on warping function .............................................................................. 38 Fig. 6.3 Weighting coefficient W(k) ............................................................................................. 40 Fig. 7.1 Synthesis flow in AccelDSP ............................................................................................ 44 Fig. 8.1 Input speech signal .......................................................................................................... 45 Fig. 8.2 Pre emphasis output ......................................................................................................... 46 Fig. 8.3 Voice Activation & Detection ......................................................................................... 47
vi
Fig. 8.4 Speech signal after Voice Activation & Detection .......................................................... 48 Fig. 8.5 Speech signal after de-noising ......................................................................................... 49 Fig. 8.6 Matlab output of Speech Recognition for word FEDORA. ......................................... 52 Fig. 8.7 Figure showing FPGA results for word FEDORA. ..................................................... 53
vii
1. INTRODUCTION
1.1 Definition
Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech signal using computers or electronic circuits. Recent advances in soft computing techniques give more importance to automatic speech recognition. Large variation in speech signals and other criteria like native accent and varying pronunciations makes the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a good recognition result. Speech recognition is a topic that is very useful in many applications and environments in our daily life. The fundamental purpose of speech is communication, i.e., the transmission of messages. According to Shannons information theory a message represented as a sequence of discrete symbols can be quantified by its information content in bits, and the rate of transmission of information is measured in bits/second (bps). In order for communication to take place, a speaker must produce a speech signal in the form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although the majority of the pressure wave originates from the mouth, sound also emanates from the nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a symbolic representation for a thought that the speaker wishes to relay to the listener. The arrangement of these sounds is governed by rules associated with a language. The scientific study of language and the manner in which these rules are used in human communication is referred to as linguistics. The science that studies the characteristics of human sound production, especially for the description, classification, and transcription of speech, is called phonetics.
1.2 Application area, Features & Issues

A different aspect of speech recognition is to facilitate for people with functional disability or other kinds of handicap. To make their daily chores easier, voice control could be helpful. With their voice they could operate the light switch, turn off/on the coffee machine or operate some other domestic appliances. This leads to the discussion about intelligent homes where these operations can be made available for the common man as well as for handicapped.
1
1.2.1 Features
Speech input is easy to perform because it does not require a specialized skill as does typing or pushbutton operations. Information can be input even when the user is moving or doing other activities involving the hands, legs, eyes, or ears. Since a microphone or telephone can be used as an input terminal, inputting information is economical with remote inputting capable of being accomplished over existing telephone networks and the Internet.
1.2.2 Issues
Lot of redundancy is present in the speech signal that makes discriminating between the classes difficult. Presence of temporal and frequency variability such as intra speaker variability in pronunciation of words and phonemes as well as inter speaker variability e.g. the effect of regional dialects. Context dependent pronunciation of the phonemes (co-articulation). Signal degradation due to additive and convolution noise present in the background or in the channel. Signal distortion due to nonideal channel characteristic.
1.3 Recognition Systems

Recognition systems may be designed in many modes to achieve specific objective or performance criteria.
1.3.1 Speaker Dependent / Independent System

For speaker dependent systems, user is asked to utter predefined words or sentences. These acoustic signals form the training data, which are used for recognition of the input speech. Since these systems are used for only a predefined speaker, their performance become higher compared to speaker independent systems.
2
1.3.2 Isolated Word Recognition

This is also called discrete recognition system. In this system, there has to be pause between uttered words. Therefore the system does not have to care about finding boundaries between words.
1.3.3 Continuous Speech Recognition

These systems are the ultimate goal of a recognition process. No matter how or when a word is uttered, they are recognized in real time and accordingly an action is performed. Changes in speaking rate, careless pronunciations, detecting the word boundaries and real time issues are main problems for this recognition mode.
1.3.4 Vocabulary Size

The lower the size of the vocabulary in a recognition system, the higher the recognition performance. Specific tasks may use small vocabularies. However a natural system should be speaker independent continuous recognition over a large vocabulary which is the most difficult.
1.3.5 Keyword Spotting

These systems are used to detect a word in continuous speech. For this reason they may be as good as isolated recognition besides having the capability to handle continuous. Speech word recognition systems commonly carry out some kind of classification recognition based on speech features which are usually obtained via Fourier Transforms (FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However, these methods have some disadvantages. These methods accept signal stationary with in a given time frame and may therefore lack the ability to analyze localized events correctly. The wavelet transform copes with some of these problems. Other factors influencing the selection of Wavelet Transforms (WT) over conventional methods include their ability to determine localized features. Discrete Wavelet Transform method is used for speech processing. The speech recognizer implemented in Mat lab was used to simulation, as if, a speech recognizer was operating in a real environment. Simulation recordings are taken in open environment to get real data.
3
In the future it could be possible to use this information to create a chip that could be used as a new interface to humans. For example it would be desired to get rid of all remote controls in the home and just tell the television, stereo or any desired device what to do with the voice.
1.4 Objectives
This project will cover speaker independent and small vocabulary speech recognition with the help of wavelet analysis using Dynamic Time Warping method. The project will compose of two phases: 1) Training phase: In this phase, a number of words will be trained to extract model for each word. 2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or an input file and the system will try to recognize these words.
1.5 Out line

The outline of this thesis is as follows. Chapter 2 Literature Survey: This chapter discuss about trends and technologies that are followed for improvising the speech recognition performance. Chapter 3 - The Speech Signal: This chapter will discuss how the production and perception of speech is performed. Topics related to this chapter are Speech production, speech representation, Characteristics of speech signal and Perception. Chapter 4 Wavelet Analysis: This chapter will discuss what is wavelet, what are the types of wavelets available, which type of wavelets are used, basically why wavelets are introduced and decomposition of wavelets. Some topics related to this chapter are Fourier analysis, STFT, types of wavelets and wavelet transform.
Chapter 5 - From Speech to Feature Vectors In this chapter the fundamental signal processing applied to a speech recognizer. Some topics related to this chapter are Pre-processing, frame blocking and windowing and Feature extraction. Chapter 6 Dynamic Time Warping Aspects of this chapter are theory and implementation of the set of statistical modeling techniques collectively referred to as Dynamic Time Warping. Some topics related to this chapter are DTW Algorithm, DP Matching Algorithm. Chapter 7 FPGA Implementation This chapter describes about the FPGA Implementation of Speech Recognition system using AccelDSP tool in Xilinx ISE. Chapter 8 Simulation & Results In this chapter the speech recognizer implemented in Matlab will be used. This is to test the recognizer in different cases for finding efficiency. Chapter 9 - Conclusions This chapter will summarizes the whole project.
2. LITERATURE SURVEY
Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis, the problem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, automatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quotations, weather reports, etc. Speech is the primary means of communication between people. For reasons ranging from technological curiosity about the mechanisms for mechanical realization of human speech capabilities, to the desire to automate simple tasks inherently requiring human-machine interactions, research in automatic speech recognition (and speech synthesis) by machine has attracted a great deal of attention over the past five decades.
2.1 Advancement in technology

Fig. 2.1 shows a timeline of progress in speech recognition and understanding technology over the past several decades. We see that in the 1960s we were able to recognize small vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic properties of speech sounds. The key technologies that were developed during this time frame were filter-bank analyses, simple time normalization methods, and the beginnings of sophisticated dynamic programming methodologies. In the 1970s we were able to recognize medium vocabularies (order of 100-1000 words) using simple template-based, pattern recognition methods [3]. The key technologies that were developed during this period were the pattern recognition models, the introduction of LPC methods for spectral representation, the pattern clustering methods for speaker-independent recognizers, and the introduction of dynamic programming methods for solving connected word recognition problems. In the 1980s we
6
started to tackle large vocabulary (1000-unlimited number of words) speech recognition problems based on statistical methods, with a wide range of networks for handling language structures. The key technologies introduced during this period were the hidden Markov model (HMM) [9] and the stochastic language model, which together enabled powerful new methods for handling virtually any continuous speech recognition problem efficiently and with high performance. In the 1990s we were able to build large vocabulary systems with unconstrained language models, and constrained task syntax models for continuous speech recognition and understanding. The key technologies developed during this period were the methods for stochastic language understanding, statistical learning of acoustic and language models, and the introduction of finite state transducer framework (and the FSM Library) and the methods for their determination and minimization for efficient implementation of large vocabulary speech understanding systems.
Fig. 2.1 Literature survey
Finally, in the last few years, we have seen the introduction of very large vocabulary systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systems with a range of input and output modalities for ease-of-use and flexibility in handling adverse environments where speech might not be as suitable as other input-output modalities. During this period we have seen the emergence of highly natural speech synthesis systems, the use of machine learning to improve both speech understanding and speech dialogs, and the introduction of mixed-initiative dialog systems to enable user control when necessary. After nearly five decades of research, speech recognition technologies have finally entered the marketplace, benefiting the users in a variety of ways. Throughout the course of development of such systems, knowledge of speech production and perception was used in establishing the technological foundation for the resulting speech recognizers. Major advances, however, were brought about in the 1960s and 1970s via the introduction of advanced speech representations based on LPC analysis and cepstral analysis methods, and in the 1980s through the introduction of rigorous statistical methods based on hidden Markov models [9]. All of this came about because of significant research contributions from academia, private industry and the government. As the technology continues to mature, it is clear that many new applications will emerge and become part of our way of life thereby taking full advantage of machines that are partially able to mimic human speech capabilities.
3. THE SPEECH SIGNAL

This chapter intends to discuss how the speech signal is produced and perceived by human beings. This is an essential subject that has to be considered before one can pursue and decide which approach to use for speech recognition.
3.1 Speech production

Human communication is to be seen as a comprehensive diagram of the process from speech production to speech perception between the talker and listener as in Fig. 3.1 [2].
Fig. 3.1 Schematic diagram of the speech production/perception process
Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic air, D. Perception of the ear, E. Speech comprehension, will be examined more carefully in the following sections. The first element (A. Speech formulation) is associated with the formulation of the speech signal in the talkers mind. This formulation is used by the human vocal mechanism (B. Human vocal mechanism) to produce the actual speech waveform. The waveform is transferred via the air (C. Acoustic air) to the listener. During this transfer the acoustic wave can be affected by external sources, for example noise, resulting in a more complex waveform. When the wave
9
reaches the listeners hearing system (the ears) the listener percepts the waveform (D. Perception of the ear) and the listeners mind (E. Speech comprehension) starts processing this waveform to comprehend its content so the listener understands what the talker is trying to tell him or her.
Fig. 3.2 Human Vocal Mechanism
To be able to understand how the production of speech is performed one need to know how the humans vocal mechanism is constructed, as in Fig. 3.2.
10
The most important parts of the human vocal mechanism are the vocal tract together with nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled together with the vocal tract to formulate the desired speech signal. The cross-sectional area of the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 cm2. When humans produce speech, air is expelled from the lungs through the trachea. The air flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips, tongue, jaw and maybe using the nasal cavity, different sounds can be produced. Important parts of the discrete-time speech production model, in the field of speech recognition and signal processing, are: u (n), gain b0 and H (z). The impulse generator acts like the lungs, exciting the glottal filter G (z), resulting in u (n). The G (z) is to be regarded as the vocal cords in the human vocal mechanism. The signal u (n) can be seen as the excitation signal entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from the lungs.
Fig. 3.3 Discrete-Time Speech Production Model
The gain b0 is a factor that is related to the volume of the speech being produced. Larger gain b0 gives louder speech and vice versa. The vocal tract filter H (z) is a model over the vocal tract and the nasal cavity. The lip radiation filter R (z) is a model of the formation of the human lips to produce different sounds.
11
3.2 Speech Representation

The speech signal and all its characteristics can be represented in two different domains, the time and the frequency domain. A speech signal is a slowly time varying signal in the sense that, when examined over a short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is not the case if we look at a speech signal under a longer time perspective (approximately time T>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to reflect the different sounds spoken by the talker. To be able to use a speech signal and interpret its characteristics in a proper manner some kind of representation of the speech signal are preferred. The speech representation can exist in either the time or frequency domain, and in three different ways. These are a three-state representation, a spectral representation and the last representation is a parameterization of the spectral activity.
3.2.1 Three-state Representation

The three-state representation is one way to classify events in speech. The events of interest for the three-state representation are: Silence (S) - No speech is produced. Unvoiced (U) - Vocal cords are not vibrating, resulting in an aperiodic or random speech waveform. Voiced (V) - Vocal cords are tensed and vibrating periodically, resulting in a speech waveform that is quasi-periodic. Quasi-periodic means that the speech waveform can be seen as periodic over a short-time period (5-100 ms) during which it is stationary.
12
Fig. 3.4 Three state representation of a speech signal.
The upper plot Fig. 3.4(a) contains the whole speech sequence and in the middle plot Fig. 3.4(b) a part of the upper plot Fig. 3.4(a) is reproduced by zooming an area of the whole speech sequence. At the bottom of Fig. 3.4 the segmentation into a three-state representation, in relation to the different parts of the middle plot, is given.
13
3.2.2 Spectral Representation

Spectral representation of speech intensity over time is very popular, and the most popular one is the sound spectrogram, see Fig. 3.5.
Fig. 3.5.Spectrogram using Welchs Method
Here the darkest (dark blue) parts represent the parts of the speech waveform where no speech is produced and the lighter (red) parts represent intensity if speech is produced.
14
3.2.3 Parameterization of the Spectral Activity

When speech is produced in the sense of a time-varying signal, its characteristics can be represented via a parameterization of the spectral activity. This representation is based on the model of speech production. The human vocal tract can (roughly) be described as a tube excited by air either at the end or at a point along the tube. From acoustic theory it is known that the transfer function of the energy from the excitation source to the output can be described in terms of natural frequencies or resonances of the tube, more known as formants. Formants represent the frequencies that pass the most acoustic energy from the source to the output. This representation is highly efficient, but is more of theoretical than practical interest. This because it is difficult to estimate the formant frequencies in low-level speech reliably and defining the formants for unvoiced (U) and silent (S) regions [3].
3.3 Technical Characteristics of the Speech Signal

A speech signal might characterize it as follows: The bandwidth of the signal is 4 kHz. The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz. There are peaks in the spectral distribution of energy at (2n 1) 500 Hz; n = 1, 2, 3 The envelope of the power spectrum of the signal shows a frequency (-6dB per octave) This is a very rough and technical description of the speech signal. But where do those characteristics come from. decrease with increasing
3.3.1 Bandwidth
The bandwidth of the speech signal is much higher than the 4 kHz. In fact, for the fricatives, there is still a significant amount of energy in the spectrum for high and even ultrasonic frequencies. However, as we all know from using the (analog) phone, it seems that within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice.
15
3.3.2 Fundamental Frequency

The time between successive vocal fold openings is called the fundamental period T0, while the rate of vibration is called the fundamental frequency of the phonation, F0= 1/T0. Using voiced excitation for the speech sound will result in a pulse train, the so-called fundamental frequency. Voiced excitation is used when articulating vowels and some of the consonants. For fricatives (e.g., /f/ as in fish or /s/, as in mess), unvoiced excitation (noise) is used. In these cases, usually no fundamental frequency can be detected. On the other hand, the zero crossing rate of the signal is very high. Plosives (like /p/ as in put), which use transient excitation, you can best detect in the speech signal by looking for the short silence necessary to build up the air pressure before the plosive bursts out.
3.3.3 Peaks in the Spectrum

After passing the glottis, the vocal tract gives a characteristic spectral shape to the speech signal. If one simplifies the vocal tract to a straight pipe (the length is about 17cm), one can see that the pipe shows resonance at the frequencies. Depending on the shape of the vocal tract (the diameter of the pipe changes along the pipe), the frequencies of the formants (especially of the 1st and 2nd formant) changes and therefore characterizes the vowel being articulated.
3.3.4 The Envelope of the Power Spectrum

The pulse sequence from the glottis has a power spectrum decreasing towards higher frequencies by -12dB per octave. The emission characteristics of the lips show a high-pass characteristic with +6dB per octave. Thus, this results in an overall decrease of -6dB per octave.
3.4 Speech perception process

The microphone.cs class is responsible to accept input from a microphone and forward it to the feature extraction module. Before converting the signal into suitable or desired form, it is important to identify the segments of the sound containing words. The audio.cs class deals with all tasks needed for converting wave file to stream of digits and vice versa. It also has a provision of saving the sound into WAV files.
16
4. WAVELET ANALYSIS
4.1 Definition
A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets can be combined, using a "reverse, shift, multiply and sum" technique called convolution, with portions of an unknown signal to extract information from the unknown signal. The fundamental idea behind wavelets is to analyze according to scale. The wavelet analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or mother wavelet. Any speech signal can then be represented by translated and scaled versions of the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech signal analysis technique such the extracted features are then passed to a classifier for the recognition of isolated words [4]. The integral wavelet transform is the integral transform defined as: ( )
Equation 4.1
Where a is positive and defines the scale and b is any real number and defines the shift. For decomposition of speech signal, we can use different techniques like Fourier analysis, STFT (Short Time Fourier Transforms), wavelet transform techniques. Here, we have explained the necessity and advantages of Wavelet Analysis by first considering the Fourier analysis, its limitations, its modification to Short Time Fourier Transform, its limitations and finally the Wavelet Analysis.
17
4.2 Fourier Analysis

Fourier analysis breaks down a signal into constituent sinusoids of different frequencies. It is a mathematical technique for transforming a signal from a time-based one to a frequencybased one. Fourier Transform of sinusoidal signal is depicted in Fig. 3.1 below. ( ) ( ) Equation 4.2
Fig. 4.1 Fourier transform
4.2.1 Limitations
But Fourier analysis has a serious drawback. In transforming to the frequency domain, time information is lost. When looking at a Fourier transform of a signal, it is impossible to tell when a particular event took place. If a signal doesnt change much over time, i.e. if it is what is called a stationary signal. This drawback isnt very important. However, most interesting signals contain numerous non-stationary or transitory characteristics: drift, trends, abrupt changes, and beginnings and ends of events. These characteristics are often the most important part of the signal, and Fourier analysis is not suited to detecting them.
4.3 Short-Time Fourier analysis

Short-Time Fourier Transform (STFT), maps a signal into a two-dimensional function of time and frequency. A technique called windowing the signal. Mathematically it is given by * , -+ ( ) , - , Equation4.3
Where signal is x[n] and window is w[n].
18
Short-Time Fourier Transform of a random signal is shown in Fig. 4.2 below.
Fig. 4.2 Short time Fourier transform
The STFT represents a sort of compromise between the time- and frequency-based views of a signal. It provides some information about both when and at what frequencies a signal event occurs.
4.3.1 Limitations
However, you can only obtain this information with limited precision, and that precision is determined by the size of the window. While the STFTs compromise between time and frequency information can be useful, the drawback is that once you choose a particular size for the time window, that window is the same for all frequencies .Otherwise ,if a wider window is chosen, it gives better frequency resolution but poor time resolution. A narrower window gives good time resolution but poor frequency resolution. Many signals require a more flexible approach - one where we can vary the window size to determine more accurately either time or frequency.
4.4 Types of Wavelets

Different types of wavelets are Haar wavelets, Daubechies wavelets, Bi orthogonal wavelets, Coiflet wavelets, Symlet wavelets, Morlet wavelets, Mexican Hat wavelets and Meyer wavelets. Wavelets mainly used in speech recognition are discussed here.
19
4.4.1 Haar Wavelet

Its first and simplest. Haar is discontinuous, and resembles a step function. It represents the same wavelet as Daubechies db1. The Haar wavelet family for t [0, 1] is defined as follows: [ hi (t) = { Integer m = 2j ( j = 0,1,2J ) indicates the level of the wavelet; k = 0,1, 2,..m1is the translation parameter. Maximal level of resolution is J. [ ] ] Equation 4.4
Fig. 4.3 Haar wavelet
4.4.2 Daubechies-N wavelet family

The Daubechies wavelets are a family of orthogonal wavelets defining a discrete wavelet transform and characterized by a maximal number of vanishing moments for some given support. With each wavelet type of this class, there is a scaling function (also called father wavelet) which generates an orthogonal multi resolution analysis. The Daubechies wavelet is one of the popular wavelets and has been used for speech recognition [4].
20
In general the Daubechies wavelets are chosen to have the highest number A of vanishing moments, (this does not imply the best smoothness) for given support width N=2A, and among the 2A1possible solutions the one is chosen whose scaling filter has external phase. The wavelet transform is also easy to put into practice using the fast wavelet transform. Daubechies wavelets are widely used in solving a broad range of problems, e.g. self-similarity properties of a signal or fractal problems, signal discontinuities, etc. The Daubechies wavelets properties [6]: The support length of wavelet function and the scaling function is 2N1. The number of vanishing moments of is N. Most dbN are not symmetrical. The regularity increases with the order. When N becomes very large, and belong to CN where is approximately equal to 0.2. Daubechies8 wavelet is used for decomposition of speech signal as it needs minimum support size for the given number of vanishing points. The names of the Daubechies family wavelets are written dbN, where N is the order, and db the surname of the wavelet. The db1 wavelet, as mentioned above, is the same as Haar. Here are the next nine members of the family:
Fig. 4.5 Daubechies wavelets
21
4.4.3 Advantages Wavelet analysis over STFT

Wavelet analysis represents the next logical step: a windowing technique with variablesized regions. Wavelet analysis allows the use of long time intervals where we want more precise low frequency information, and shorter regions where we want high frequency information.
Fig. 4.6 Comparison of Wavelet analysis over STFT
The time-based, frequency-based and STFT views of a signal are given with respect to that of Wavelet analysis. One major advantage afforded by wavelets is the ability to perform local analysis, i.e., to analyze a localized area of a larger signal.
4.5 Wavelet Transform

The transform of a signal is just another form of representing the signal. It does not change the information content present in the signal. For many signals, the low-frequency part contains the most important part. It gives an identity to a signal. Consider the human voice. If we remove the high-frequency components, the voice sounds different, but we can still tell whats being said. In wavelet analysis, we often speak of approximations and details. The approximations are the high- scale, low-frequency components of the signal. The details are the low-scale, high frequency components. ( ) ( )
Equation 4.5
Where (t) is a time function with finite energy and fast decay called mother wavelet.
22
4.5.1 Discrete Wavelet Transform

The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on powers of two so called dyadic scales and positions. The mother wavelet is rescaled or dilated, by powers of two and translated by integers. Specifically, a function f (t) L2 (R) (defines space of square integrable functions) can be represented as [1]: ( ) * ( ) ( ) ( ) ( )+
Equation 4.6 The function (t) is known as the mother wavelet, while (t) is known as the scaling function. The set of functions { ( ) ( ) } where Z is
the set of integers is an orthonormal basis for L2(R). The numbers a (L, k) are known as the approximation coefficients at scale L, while d (j, k) are known as the detail coefficients at scale j. The approximation and detail coefficients can be expressed as: ( ( ) )

( ) ( ( ) (
) )
Equation 4.7 Equation 4.8
The DWT analysis can be performed using a fast, pyramidal algorithm related to multirate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank with octave spacing between the centers of the filters. Each sub-band contains half the samples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed at different frequency bands with different resolution by decomposing the signal into a coarse approximation and detail information. The coarse approximation is then further decomposed using the same wavelet decomposition step. This is achieved by successive high-pass and lowpass filtering of the time domain signal and is defined by the following equations: ylow[n] = yhigh[n] = , - , , - , Equation 4.9 Equation 4.10
23
Fig. 4.7 Filter functions
Signal x[n] is passed through low pass and high pass filters and it is down sampled by 2. ylow[n] = (x * g) 2 yhigh[n] = (x*h) 2 Equation 4.11 Equation 4.12
In the DWT, each level is calculated by passing the previous approximation coefficients though a high and low pass filters.
4.5.2 Multilevel Decomposition of Signal

A signal can be decomposed using Wavelet Analysis as Shown below [11]:
Fig. 4.8 Decomposition of DWT Co-efficients
Fig. 4.9 Decomposition using DWT
24
The DWT is computed by successive low-pass and high-pass filtering of the discrete time-domain signal as shown in figure 4.8 and 4.9. This is called the Mallat algorithm or Mallattree decomposition.
4.5.3 Wavelet Reconstruction

Getting the original signal with no loss (min.) of information is called Reconstruction. It can be done by inverse discrete wavelet transform (IDWT). Whereas wavelet analysis involves filtering and down sampling, the wavelet, Reconstruction process consists of up sampling and filtering. Up sampling is the process of lengthening a signal component by inserting zeros between samples.
Fig. 4.10 Signal Reconstruction
Fig. 4.11 Signal Decomposition & Reconstruction
25
5. FROM SPEECH TO FEATURE VECTORS

The main objective of this stage is to extract the important features that are enough for the recognizer to recognize the words. This chapter describes how to extract information from a speech signal, which means creating feature vectors from the speech signal. A wide range of possibilities exist for parametrically representing a speech signal and its content. The main steps for extracting information are preprocessing, frame blocking & windowing and feature extraction [1].
Fig. 5.1 Main steps in Feature Extraction
5.1 Preprocessing
This step is the first step to create feature vectors. The objective in the pre-processing is to modify the speech signal, x (n), so that it will be more suitable for the feature extraction analysis. The preprocessing operations noise cancelling, pre emphasis and voice activation detection can be seen in Figure below shown.
Fig. 5.2 Pre processing
The first thing to consider is if the speech, x (n), is corrupted by some noise, d (n), for example an additive disturbance x (n) = s (n) + d (n), where s (n) is the clean speech signal. There are several approaches to perform noise reduction on a noisy speech signal. Two commonly used noise reduction algorithms in the field of speech recognition context is spectral subtraction and adaptive noise cancellation. A low signal to noise ratio (SNR) decrease the
26
performance of the recognizer in a real environment. Some changes to make the speech recognizer more noise robust will be presented later. Note that the order of the operations might be reordered for some tasks. For example the noise reduction algorithm, spectral subtraction, is better placed last in the chain (it needs the voice activation detection).
5.1.1 Pre emphasis

There is a need for spectrally flatten the signal. The pre emphasize, often represented by a first order high pass FIR filter is used to emphasize the higher frequency components. The second stage in feature extraction is to boost the amount of energy in the high frequencies. It turns out that if we look at the spectrum for voiced segments like vowels, there is more energy at the lower frequencies than the higher frequencies. This SPECTRAL TILT drop in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal pulse. Boosting the high frequency energy makes information from these higher formants more available to the acoustic model and improves phone detection accuracy.
Fig. 5.3 Pre emphasis filter
The pre emphasizer is used to spectrally flatten the speech signal. This is usually done by a high pass filter. The most commonly used filter for this step is the FIR filter described below: ( )
27
Equation5.1
The filter response for this FIR filter can be seen in Figure. The filter in the time domain will be h (n) = {1, 0.95} and the filtering in the time domain will give the pre emphasized signal s1 (n): s1 (n) = ( ) ( ) Equation 5.2
The pre emphasis filter is shown on Fig. 5.3.
5.1.2 Voice Activation Detection (VAD)

The problem of locating the endpoints of an utterance in a speech signal is a major problem for the speech recognizer. Inaccurate endpoint detection will decrease the performance of the speech recognizer. The problem of detecting endpoint seems to be relatively trivial, but it has been found to be very difficult in practice. Only when a fair SNR is given, the task is made easier. Some commonly used measurements for finding speech are short-term energy estimate Es1, or short-term power estimate Ps1, and short term zero crossing rate Zs1. For the speech signal s1(n) these measures are calculated as follows [1]:
Es1(m) = Ps1(m) = Zs1(m) =

Where:
|
( ) ( )
, ( ), ( )-|
Equation 5.3 Equation 5.4 Equation 5.5 Equation 5.6
( ( ))
( ) ( )
For each block of L samples these measures calculate some value. Note that the index for these functions is m and not n, this because these measures do not have to be calculated for every sample (the measures can for example be calculated in every 20 ms). The short-term energy estimate will increase when speech is present in s1 (n). This is also the case with the short-term power estimate; the only thing that separates them is scaling with 1/L when calculating the shortterm power estimate. The short term zero crossing rate gives a measure of how many times the signal, s1 (n), changes sign. This short term zero crossing rates tend to be larger during unvoiced regions. These measures will need some triggers for making decision about where the utterances begin and end. To create a trigger, one needs some information about the background noise. This is done by assuming that the first 10 blocks are background noise. With this assumption the
28
mean and variance for the measures will be calculated. To make a more comfortable approach the following function is used: Ws1(m)=Ps1(m)(1-Zs1(m))Sc Equation 5.7
Using this function both the short-term power and the zero crossing rates will be taken into account. Sc is a scale factor for avoiding small values, in a typical application is Sc = 1000. The trigger for this function can be described as: tW=W + W Equation 5.8 The W is the mean and W is the variance for Ws1 (m) calculated for the first 10 blocks. The term is a constant that have to be fine-tuned according to the characteristics of the signal. After some testing the following approximation of will give pretty good voice activation detection in various level of additive background noise: Equation 5.9 The voice activation detection function, VAD (m), can now be found as: ( ) { Equation 5.10
VAD (n) is found as VAD (m) in the block of measure. For example if the measures is calculated every 320 sample (block length L=320), which corresponds to 40 ms if the sampling rate is 8 kHz. The first 320 samples of VAD (n) found as VAD (m) then m = 1. Using these settings the VAD (n) is calculated for the speech signal containing the word file shown in results.
5.2 Frame blocking & Windowing

Speech signal is a kind of unstable signal. But we can assume it as stable signal during 10-20ms. Framing is used to cut the long-time speech to the short-time speech signal in order to get relative stable frequency characteristics. Features get periodically extracted. The time for which the signal is considered for processing is called a window and the data acquired in a window is called as a frame. Typically features are extracted once every 10ms, which is called as frame rate. The window duration is typically 20ms. Thus two consecutive frames have overlapping areas.
29
Fig. 5.4 Frame blocking & Windowing
5.2.1 Frame blocking

For each utterances of the word, window duration of 320 samples is used for processing at later stages. A frame is formed from the windowed data with typical frame duration (Tf) of about 200 samples. Since the frame duration is shorter than window duration there is an overlap of data and the percentage overlap is given as: %Overlap = ((Tw Tf )*100)/Tw) Each frame is K samples long, with adjacent frames being separated by P samples. Equation 5.11
30
Fig. 5.5 Frame blocking of a sequence
By applying the frame blocking to de noised signal (x (k)), one will get M vectors of length K, which correspond to x (k; m) where k=0, 1...K-1 and m=0, 1.M 1.
5.2.2 Windowing
Windowing concept is used to minimize the signal distortion by using the window to taper the signal to zero at the beginning and end of each frame i.e. to reduce signal discontinuity at either end of the block. The rectangular window (i.e. no window) can cause problems, when we do Fourier analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow main lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal toward zero at the window boundaries, avoiding discontinuities. ( ) { Equation 5.12
The most commonly used window function in speech processing is the Hamming window defined as follows: ( ) ( ) Equation 5.13
By applying w (k) to x (k; m) for all blocks, the windowed signal output is calculated.
31
Hamming window function is shown in Fig. 5.5 below:
Fig. 5.6 Hamming Window
Multiplication of the signal by a window function in the time domain is the same as convolving the signal in the frequency domain. Rectangular window gives maximum sharpness but large side-lobes (ripples) - hamming window blurs in frequency but produces much less leakage.
5.3 Feature Extraction

A feature extractor should reduce the pattern vector (i.e., the original waveform) to a lower dimension, which contains most of the useful information from the original vector. Here we use we extract features of the input speech signal using Daubechies-8 wavelets of level 4 [4]. The extracted wavelet coefficients provide a compact representation that shows the energy distribution of the signal in time and frequency. In order to further reduce the dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients are used.
32
The following features are used in our system: The mean of the absolute value of the coefficients in each sub-band. These features provide information about the frequency distribution of the audio signal. The standard deviation of the coefficients in each sub-band. These features provide information about the amount of change of the frequency distribution. Energy of each sub-band of the signal. These features provide information about the energy of the each sub-band. Kurtosis of each sub-band of the signal. These features measure whether the data are peaked or flat relative to a normal distribution. Skewness of each sub-band of the signals. These features are the measure of symmetry or lack of symmetry. After frame blocking and windowing we get different frame vectors i.e. different signals are to be loaded to extract the features at a time. Hence Multi signal analysis is performed on input frame vectors using wavelets using matlab [13].
33
6. DYNAMIC TIME WARPING

Dynamic time warping (DTW) is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics indeed, any data which can be turned into a linear representation can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds [3]. In general, DTW is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" nonlinearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. This sequence alignment method is often used in time. The recognition process then consists of matching the incoming speech with stored templates. The template with the lowest distance measure from the input pattern is the recognized word. The best match (lowest distance measure) is based upon dynamic programming.
6.1 DTW Algorithm

Speech is a time-dependent process. Hence the utterances of the same word will have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed.
34
6.1.1 DP-Matching Principle

General Time-Normalized Distance Definition: Speech can be expressed by appropriate feature extraction as a sequence of feature vectors. A= a1, a2, a3,ai ..aI, B = b1, b2, b3, bj.. bJ. Equation 6.1 Equation 6.2
Consider the problem of eliminating timing differences between these two speech patterns. In order to clarify the nature of time-axis fluctuation or timing differences, let us consider an i-j plane, shown in Fig. 6.1, where patterns A and B are developed along the i-axis and j-axis, respectively. Where these speech patterns are of the same category, the timing differences between them can be depicted by a sequence of points c = (i, j): F = c (l), c (2), ------c, (k), ---------- c (K). Where c (k) = (i (k), j (k)). This sequence can be considered to represent a function which approximately realizes a mapping from the time axis of pattern A onto that of pattern B. Hereafter, it is called a warping function. When there is no timing difference between these patterns, the warping function coincides with the diagonal line j = i. It deviates further from the diagonal line as the timing difference grows [3]. Equation 6.3
Fig. 6.1 warping function & adjusting window definition
35
As a measure of the difference between two feature vectors ai and bi, a distance ( ) ( ) || || Equation 6.4
is employed between them. Then, the weighted summation of distances on warping function F becomes ( ) ( ( )) ( ) Equation 6.5
(Where w (k) is a nonnegative weighting coefficient, which is intentionally introduced to allow the E (F) measure flexible characteristic) and is a reasonable measure for the goodness of warping function F. It attains its minimum value when warping function F is determined so as to optimally adjust the timing difference. This minimum residual distance value can be considered to be a distance between patterns A and B, remaining still after eliminating the timing differences between them, and is naturally expected to be stable against time-axis fluctuation. Based on these considerations, the time-normalized distance between two speech patterns A and B is defined as follows:
( ( )) ( ) ( )
Equation 6.6
Where denominator ( ) is employed to compensate for the effect of K (number of points on the warping function F). Above equation is no more than a fundamental definition of timenormalized distance. Effective characteristics of this measure greatly depend on the warping function specification and the weighting 'coefficient definition. Desirable characteristics of the time-normalized distance measure will vary, according to speech pattern properties (especially time axis expression of speech pattern) to be dealt with. Therefore, the present problem is restricted to the most general case where the following two conditions hold: Condition 1: Speech patterns are time-sampled with a common and constant sampling period. Condition 2: We have no a priori knowledge about which parts of speech pattern contain linguistically important information. In this case, it is reasonable to consider each part of a speech pattern to contain an equal amount of linguistic information.
36
6.1.2 Restrictions on Warping Function

Warping function F is a model of time-axis fluctuation in a speech pattern. Accordingly, it should approximate the properties of actual time-axis fluctuation. In other words, function F, when viewed as a mapping from the time axis of pattern A onto that of pattern B, must preserve linguistically essential structures in pattern A time axis and vice versa. Essential speech pattern time-axis structures are continuity, monotonicity (or restriction of relative timing in a speech), limitation on the acoustic parameter transition speed in a speech, and so on. These conditions can be realized as the following restrictions on warping function F or points ( )
1) Monotonic conditions: i (k-1) i (k) and j (k-1) j (k). 2) Continuity conditions: : i(k)- i(k-1) 1 and j(k)- j(k-1) 1.
( ( ) ( )) Equation 6.7 Equation 6.8
As a result of these two restrictions, the following relation holds between two consecutive points ( ) (( ) ( ) ) ( ) ) {( ( ) ( )) (( ) Equation 6.9
3) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. 4) Adjustment window condition: | ( )
Equation 6.10 Equation 6.11
( )|
Where r is an appropriate positive integer, called window length. This condition corresponds to the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.
5) Slope constraint condition:
Neither too steep nor too gentle a gradient should be allowed for warping function F because such deviations may cause undesirable time-axis warping. Too steep a gradient, for example, causes an unrealistic correspondence between very short patterns A segment and a relatively long pattern B segment. Then, such a case occurs where a short segment in consonant or phoneme transition part happens to be in good coincidence with an entire steady vowel part. Therefore, a restriction called a slope constraint condition was set upon the warping function F, so that its first derivative is of discrete form. The slope constraint condition is realized as a restriction on the possible relation among (or the possible configuration of) several consecutive
37
points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c (k) moves forward in the direction of i (or j)-axis consecutive m times, then point c (k) is not allowed to step further in the same direction before stepping at least n times in the diagonal direction. The effective intensity of the slope constraint can be evaluated by the following measure P = n/m.
Fig. 6.2 Slope constraint on warping function
The larger the P measure, the more rigidly the warping function slope is restricted. When p = 0, there are no restrictions on the warping function slope. When p = (that is m = 0), the warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional
38
pattern matching no time normalization. Generally speaking, if the slope constraint is too severe, then time-normalization would not work effectively. If the slope constraint is too lax, then discrimination between speech patterns in different categories is degraded. Thus, setting neither a too large nor a too small value for p is desirable. Section IV reports the results of an investigation on an optimum compromise on p value through several experiments. In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope constraint condition p = 1 are shown. The Fig. 6.2(c) type is directly derived from the above definition, while Fig. 6.2(d) is an approximated type, and there is another constraint. That is, the second derivative of warping function F is restricted, so that the point c (k) path does not orthogonally change its direction. This new constraint reduces the number of paths to be searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for the p = 0 case.
6.1.3 Discussions on Weighting Coefficient

Since the criterion function in Equation 6.6 is a rational expression, its maximization is an unwieldy problem. If the denominator in Equation 6.6 ( ) Equation 6.12
(Called normalization coefficient) is independent of warping function F; it can be put out of the bracket, while simplifying the equation as follows: ( ) [ ( ( )) ( )] Equation 6.13
This simplified problem can be effectively solved by use of the dynamic programming technique. W (k) = [i (k) - i (k-1)] + [j (k) - j (k-1)], Then N=I+J, where I and J are lengths of speech patterns A and B, respectively. Equation 6.14
If it is assumed that time axes i and j are both continuous, then, in the symmetric form, the summation in Equation 6.6 means an integration along the temporarily defined axis l = i +j. As a result of this difference, time-normalized distance is symmetric, or D (A, B) = D (B, A), in the symmetric form. Another more important result, caused by the difference in the integration axis, is that, as is in Fig. 6.3, weighting coefficient w (k) reduces to zero in the asymmetric form, when the point in warping function steps in the direction of j-axis, or c (k) = c (k-1) + (0, 1). This means that some feature vectors bj are possibly excluded from the integration in the asymmetric
39
form. On the contrary, in the case of symmetric form, minimum w (k) value is equal to 1, and no exclusion occurs. Since discussions here are based on the assumption that each part in a speech pattern should be treated equally, an exclusion of any feature vectors from integration should be avoided as long as possible. It can be expected, therefore, that the symmetric form will give better recognition accuracy than the asymmetric form. However, it should be noted that the slope constraint reduces the situation where the point in warping function steps in the j-axis direction. The difference in performance between the symmetric one and asymmetric one will gradually vanish as the slope constraint is intensified.
Fig. 6.3 Weighting coefficient W(k)
6.2 Practical DP-Matching Algorithm

6.2.1 DP-Equation
A simplified definition of time-normalized distance D (A, B) given above is one of the typical problems to which the well-known DP-principle Equation 6.10 can be applied. The basic algorithm for calculating Equation 6.13 is written as follows. Initial condition: g1(c (1)) = d (c (1)) w (1). DP-equation: ( ( )) Time-normalized distance: ( ) ( ( )) Equation 6.17
( )[
Equation 6.15 ( )]
( (
))
( ( ))
Equation 6.16
40
It is implicitly assumed here that c (0) = (0, 0). Accordingly, w (1) = 2 in the symmetric form, and w (1) = 1 in the asymmetric form. By realizing the restriction on the warping function described in Section 6.1.2 and substituting Equation 6.14 for weighting coefficient w (k) in Equation 6.16, several practical algorithms can be derived. As one of the simplest examples, the algorithm for symmetric form, in which no slope constraint is employed (that is P = 0), is shown here. Initial condition: g (l, 1) = 2 d (1, 1). DP-equation: ( ( ) [ ( ( Restricting condition (adjustment window): j - r i j + r. Time-normalized distance: ( Where N = I+J. The algorithm, especially the DP-equation, should be modified when the asymmetric form is adopted or some slope constraint is employed. In Table I, algorithms are summarized for both symmetric and asymmetric forms, with various slope constraint conditions. In this table, DP-equations for asymmetric forms are shown in some improved form. The first expression in the bracket of the asymmetric form DP-equation for P = 1 (that is, [g(i - 1 , j - 2) + d(i, j - 1) + d(i, j)]/2) corresponds to the case where c(k - 1 ) = (i(k), j(k) - 1) and c(k - 2) = (i(k - 1) - 1 , j (k 1) - 1). Accordingly, if the definition in (14) is strictly obeyed, w (k) is equal to zero while w (k 1) is equal to 1, thus completely omitting the d(c (k)) from the summation. In order to avoid this situation to a certain extent, the weighting coefficient w (k - 1) = 1 is divided between two weighting coefficients w (k - 1) and w (k). Thus, (d(i, j - 1) + d(i, j))/2 is substituted for d(i, j - 1) + 0 * d(i, j ) in this expression. Similar modifications are applied to other asymmetric form DPequations. In fact, it has been established, by a preliminary experiment, that this modification significantly improves the asymmetric form performance [12]. ) ( ) Equation 6.21 Equation 6.20 ) ) ) ( ) ( )] ( ) Equation 6.19 Equation 6.18
41
6.2.2 Calculation Details

DP-equation or g(i, j ) must be recurrently calculated in ascending order with respect to coordinates i and j , starting from initial condition at (1, 1 ) up to ( I , J). The domain in which the DP-equation must be calculated is specified by 1 i I, 1 j J. and adjustment window j - r i j + r. Equation 6.23 Equation 6.22
The optimum DP-algorithm, applied to speech recognition, was investigated. Symmetric form was proposed along with slope constraint technique. These varieties were then compared through theoretical and experimental investigations. Conclusions are as follows: Slope constraint is actually effective. Optimum performance is attained when the slope constraint condition is P = 1. The validity of these results was ensured by a good agreement between theoretical discussions and experimental results. The optimized algorithm was then experimentally compared with several other DP-algorithms applied to spoken word recognition by different research groups, and the superiority of the algorithm described in this paper was established.
42
7. FPGA Implementation
The AccelDSP Synthesis Tool is a product that allows to transform a MATLAB floatingpoint design into a hardware module that can be implemented in a Xilinx FPGA. AccelDSP Synthesis Tool features an easy-to-use Graphical User Interface that controls an integrated environment with other design tools such as MATLAB, Xilinx ISE tools, and other industrystandard HDL simulators and logic synthesizers. AccelDSP Synthesis is done with the following implementation procedure: a) Reading and analyzing a MATLAB floating-point design. b) Automatically creating an equivalent MATLAB fixed-point design. c) Invoking a MATLAB simulation to verify the fixed-point design. d) Providing the power to quickly explore design trade-offs of algorithms that are optimized for the target FPGA architectures. e) Creating a synthesizable RTL HDL model and a Test bench to ensure bit-true, cycle accurate design verification. f) Providing scripts that invoke and control down-stream tools such as HDL simulators, RTL logic synthesizers, and Xilinx ISE implementation tools.
43
The Synthesis flow in AccelDSP ISE can be observed from the following flow chart:
Fig. 7.1 Synthesis flow in AccelDSP
44
8. SIMULATION & RESULTS

This chapter presents the experimental results obtained from the proposed approach; namely Wavelet analysis, Dynamic Time Warping that was applied to the isolated word speech recognition. The effectiveness of the algorithms is measured through the analysis of the results.
8.1
Input Signal:
1) Input speech signal for word Speech:
Fig. 8.1 Input speech signal
The input speech signal with duration of 5 seconds with sampling frequency of 8k Hz is shown above.
45
8.2
Pre emphasis:
Pre emphasis output for Speech:
Fig. 8.2 Pre emphasis output
The output obtained after passing the input Speech signal to the pre emphasis (first order high pass) filter. The output has significant spectral flatness when compared with input.
46
8.3
Voice Activation & Detection
1) Voice Activation and Detection for Speech:
Fig. 8.3 Voice Activation & Detection
The above plot shows the voice activated region for the word Speech. The output is 1 for voiced region and 0 for unvoiced and silence region. Hence out of the total samples, only the voice activated samples are going to be filtered out.
47
2) Speech signal after voice activation and Detection:
Fig. 8.4 Speech signal after Voice Activation & Detection
After obtaining the Voice Activation & Detection output, the regions for which VAD=1 are extracted out for further analysis.
48
8.4
De-noising:
De-noising for Speech:
Fig. 8.5 Speech signal after de-noising
The final denoised signal obtained after Spectral subtraction. Here the noise components present in the signal are reduced.
49
8.5
Recognition Results:
This section provides the experimental results in recognizing the isolated words. In the
experiment, the database consists of 10 different words and 25 utterances for each word is used. Calculation of Recognition rate is given in Equation 8.1 below.
Equation 8.1 a) The Recognition rates for each word using Daubechies-8 wavelet & level-4 decomposition using DWT for English words is shown in the following table: Word to be recognized Matrix Paste Project Speech Window Distance India Ubuntu Fedora Android Number of times the word is correctly recognized 24 24 18 18 24 20 24 19 25 24 Recognition Rate 96 96 72 72 96 80 96 76 100 96
Table 8.1: Recognition rates for English words using db8 & level 4 DWT. The overall Recognition rate for English words using Daubechies 8 wavelet of level-4 is 88%.
50
b) The Recognition rates for each word using Daubechies-8 wavelet & level-7 decomposition using DWT for English words is shown in the following table: Word to be recognized Matrix Paste Project Speech Window Distance India Ubuntu Fedora Android Number of times the word is correctly recognized 24 23 21 23 24 22 25 21 25 25 Recognition Rate 96 92 84 92 96 88 100 84 100 100
Table 8.2: Recognition rates for English words using db8 & level 7 DWT. The overall Recognition rate for English words using Daubechies 8 wavelet of level-7 is
93.2%. 8.5 FPGA Implementation

AccelDSP synthesis tool is used to transform a MATLAB design into a hardware module that can be implemented in a Xilinx FPGA. The figure in Fig. 8.6 shows the matlab results for the word recognized FEDORA. The figure in Fig. 8.7 shows the FPGA Implementation results for the word recognized FEDORA on AccelDSP tool in Xilinx ISE Platform.
51
Fig. 8.6 Matlab output of Speech Recognition for word FEDORA.
52
Fig. 8.7 Figure showing FPGA results for word FEDORA.
53
9. CONCLUSION
From this study we could understand and experience the effectiveness of discrete wavelet transform in feature extraction. The wavelet transform is a more dominant technique for speech processing than other previous techniques. The features obtained by using the wavelet transform shows higher recognition rates if the features are extracted properly. Wavelets are able to distinguish between different properties high frequency low amplitude spectral components and low frequency large amplitude spectral components. During the experiment we worked with Daubechies-8 mother wavelet for feature extraction. Here we have used level-4 and level-7 decompositions of db8 wavelets. The efficiency of speech recognition is increased with the level for recognition of isolated words. In this experiment we have only used a limited number of samples. The performance of the system can be improved by utilizing various noise reduction or removal algorithms and training with large dataset. The optimized DTW algorithm was experimentally applied for the features extracted from wavelet analysis using Discrete Wavelet Transform. Slope constraint used in DTW is gave effective results. The matlab code used for the speech recognition system is used for hardware implementation using AccelDSP tool. The validity of these results using Dynamic Time Warping was ensured by a good agreement between theoretical discussions and experimental results.
54
REFERENCES
[1] Trivedi, Saurabh, Sachin and Raman, "Speech Recognition by Wavelet Analysis", International Journal of Computer Applications (0975 8887) Volume 15 No.8, February 2011. [2] Lawrence Rabiner and Bing-Hwang Jung, "Fundamentals of Speech Recognition". [3] Hiroaki Sakoe, Seibi Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition", IEEE Transactions On Acoustics, Speech, And Signal Processing, Vol. Assp-26, No. 1, February 1978. [4] Ingrid Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992. [5] Ian Mcloughlin, "Audio Processing with Matlab Examples". [6] I.Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. on Pure and Applied Math., vol.41, pp.909-996, Nov.1988. [7] Murali Krishnan, Chris P.Neophytou and Glenn Prescott, Wavelet Transform Speech Recognition using Vector Quantization, Dynamic Time Warping and Artificial Neural Networks". [8] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete Wavelet Transform Organized sound, Vol. 4(3), 2000. [9] L.R.Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, vol. 77, no. 2, pp. 257-286, 1989. [10] Michael Nilsson, Marcus Ejnarsson, "Speech Recognition using Hidden Markov Model". [11] S.G.Mallat, A theory for multi resolution signal decomposition: the wavelet representation, IEEE transactions on Pattern Analysis Machine Intelligence, Vol. 11 1989, pp.674-693. [12] Sylvio Barbon Junior, Rodrigo Capobianco Guido, Shi-Huang Chen, Lucimar Sasso Vieira, Fabricio Lopes Sanchez, "Improved Dynamic Time Warping Based on the Discrete Wavelet Transform", Ninth IEEE International Symposium on Multimedia 2007. [13] M.Misiti, Y. Misiti, G. Oppenheim and J. Poggi, Matlab Wavelet Tool Box, The Math Works, Inc.,2000. [14] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete Wavelet Transform Organized sound, Vol. 4(3), 2000.
55
[15] Mike Brookes, "Voicebox: Speech Processing Toolbox for Matlab", Department of Electrical & Electronic Engineering, Imperial College, London SW7 2BT, UK. [16] Daryl Ning, "Developing an Isolated Word Recognition System in Matlab", Matlab Digest - January 2010.
56

Speech Recognition Using Wavelets

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition Using Wavelets

Uploaded by

Copyright:

Available Formats

Abstract

9. CONCLUSION ......................................................................................................................... 54 REFERENCES ............................................................................................................................. 55

1.2 Application area, Features & Issues

1.3 Recognition Systems

1.3.1 Speaker Dependent / Independent System

1.3.2 Isolated Word Recognition

1.3.3 Continuous Speech Recognition

1.3.4 Vocabulary Size

1.3.5 Keyword Spotting

1.5 Out line

2.1 Advancement in technology

Fig. 2.1 Literature survey

3. THE SPEECH SIGNAL

3.1 Speech production

Fig. 3.1 Schematic diagram of the speech production/perception process

Fig. 3.2 Human Vocal Mechanism

Fig. 3.3 Discrete-Time Speech Production Model

3.2 Speech Representation

3.2.1 Three-state Representation

Fig. 3.4 Three state representation of a speech signal.

3.2.2 Spectral Representation

Fig. 3.5.Spectrogram using Welchs Method

3.2.3 Parameterization of the Spectral Activity

3.3 Technical Characteristics of the Speech Signal

3.3.2 Fundamental Frequency

3.3.3 Peaks in the Spectrum

3.3.4 The Envelope of the Power Spectrum

3.4 Speech perception process

4.2 Fourier Analysis

Fig. 4.1 Fourier transform

4.3 Short-Time Fourier analysis

Where signal is x[n] and window is w[n].

Short-Time Fourier Transform of a random signal is shown in Fig. 4.2 below.

Fig. 4.2 Short time Fourier transform

4.4 Types of Wavelets

4.4.1 Haar Wavelet

Fig. 4.3 Haar wavelet

4.4.2 Daubechies-N wavelet family

Fig. 4.5 Daubechies wavelets

4.4.3 Advantages Wavelet analysis over STFT

Fig. 4.6 Comparison of Wavelet analysis over STFT

4.5 Wavelet Transform

4.5.1 Discrete Wavelet Transform

Equation 4.7 Equation 4.8

Fig. 4.7 Filter functions

4.5.2 Multilevel Decomposition of Signal

Fig. 4.8 Decomposition of DWT Co-efficients

Fig. 4.9 Decomposition using DWT

4.5.3 Wavelet Reconstruction

Fig. 4.10 Signal Reconstruction

Fig. 4.11 Signal Decomposition & Reconstruction

5. FROM SPEECH TO FEATURE VECTORS

Fig. 5.1 Main steps in Feature Extraction

Fig. 5.2 Pre processing

5.1.1 Pre emphasis

Fig. 5.3 Pre emphasis filter

The pre emphasis filter is shown on Fig. 5.3.

5.1.2 Voice Activation Detection (VAD)

Es1(m) = Ps1(m) = Zs1(m) =

Equation 5.3 Equation 5.4 Equation 5.5 Equation 5.6

5.2 Frame blocking & Windowing

Fig. 5.4 Frame blocking & Windowing

5.2.1 Frame blocking