Professional Documents
Culture Documents
Speech recognition systems have come a long way in the last forty years, there is still room for improvement. Although readily available, these systems are sometimes inaccurate and insufficient. In an effort to provide a more efficient representation of the speech signal, application of the wavelet analysis is considered. Here we present an effective and robust method for extracting features for speech processing. Based on the time frequency multi resolution property of wavelet transform, the input speech signal is decomposed into various frequency channels. Further we can recognize the original speech using Wavelet Transform. The major issues concerning the design of this Wavelet based speech recognition system are choosing optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature vectors from the wavelet coefficients.
Dynamic Time Warping (DTW) is a pattern matching approach that can be used for limited vocabulary speech recognition, which is based on a temporal alignment of the input signal with the template models. The main drawback of this method is its high computational cost when the length of the signals increases. The main aim of the project work is to provide a modified version of the DTW, based on the Discrete Wavelet Transform (DWT), which reduces its original complexity. Daubechies wavelet family with level 4 & level 7 are experimented and the corresponding results are reported. The above proposed approaches are implemented in software and also implemented by using FPGA.
Table of Contents
Abstract ............................................................................................................................................ i List of Tables .................................................................................................................................. v List of Figures ................................................................................................................................ vi 1. INTRODUCTION ...................................................................................................................... 1 1.1 Definition .............................................................................................................................. 1 1.2 Application area, Features & Issues ...................................................................................... 1 1.2.1 Features ........................................................................................................................... 2 1.2.2 Issues .............................................................................................................................. 2 1.3 Recognition Systems ............................................................................................................. 2 1.3.1 Speaker Dependent / Independent System ..................................................................... 2 1.3.2 Isolated Word Recognition ............................................................................................. 3 1.3.3 Continuous Speech Recognition ..................................................................................... 3 1.3.4 Vocabulary Size .............................................................................................................. 3 1.3.5 Keyword Spotting ........................................................................................................... 3 1.4 Objectives .............................................................................................................................. 4 1.5 Out line .................................................................................................................................. 4 2. LITERATURE SURVEY ........................................................................................................... 6 2.1 Advancement in technology .................................................................................................. 6 3. THE SPEECH SIGNAL ............................................................................................................. 9 3.1 Speech production ................................................................................................................. 9 3.2 Speech Representation ........................................................................................................ 12 3.2.1 Three-state Representation ........................................................................................... 12 3.2.2 Spectral Representation ................................................................................................ 14 3.2.3 Parameterization of the Spectral Activity..................................................................... 15 3.3 Technical Characteristics of the Speech Signal .................................................................. 15 3.3.1 Bandwidth ..................................................................................................................... 15 3.3.2 Fundamental Frequency ............................................................................................... 16 3.3.3 Peaks in the Spectrum................................................................................................... 16 3.3.4 The Envelope of the Power Spectrum .......................................................................... 16 3.4 Speech perception process .................................................................................................. 16
ii
4. WAVELET ANALYSIS .......................................................................................................... 17 4.1 Definition ............................................................................................................................ 17 4.2 Fourier Analysis .................................................................................................................. 18 4.2.1 Limitations .................................................................................................................... 18 4.3 Short-Time Fourier analysis ................................................................................................ 18 4.3.1 Limitations .................................................................................................................... 19 4.4 Types of Wavelets ............................................................................................................... 19 4.4.1 Haar Wavelet ................................................................................................................ 20 4.4.2 Daubechies-N wavelet family ...................................................................................... 20 4.4.3 Advantages Wavelet analysis over STFT ..................................................................... 22 4.5 Wavelet Transform .............................................................................................................. 22 4.5.1 Discrete Wavelet Transform ......................................................................................... 23 4.5.2 Multilevel Decomposition of Signal............................................................................. 24 4.5.3 Wavelet Reconstruction ................................................................................................ 25 5. FROM SPEECH TO FEATURE VECTORS ........................................................................... 26 5.1 Preprocessing ...................................................................................................................... 26 5.1.1 Pre emphasis ................................................................................................................. 27 5.1.2 Voice Activation Detection (VAD) .............................................................................. 28 5.2 Frame blocking & Windowing............................................................................................ 29 5.2.1 Frame blocking ............................................................................................................. 30 5.2.2 Windowing ................................................................................................................... 31 5.3 Feature Extraction ............................................................................................................... 32 6. DYNAMIC TIME WARPING ................................................................................................. 34 6.1 DTW Algorithm .................................................................................................................. 34 6.1.1 DP-Matching Principle ................................................................................................. 35 6.1.2 Restrictions on Warping Function ................................................................................ 37 6.1.3 Discussions on Weighting Coefficient ......................................................................... 39 6.2 Practical DP-Matching Algorithm ...................................................................................... 40 6.2.1 DP-Equation ................................................................................................................. 40 6.2.2 Calculation Details ....................................................................................................... 42 7. FPGA Implementation .............................................................................................................. 43
iii
8. SIMULATION & RESULTS ................................................................................................... 45 8.1 8.2 8.3 8.4 8.5 8.5 Input Signal: ................................................................................................................... 45 Pre emphasis:.................................................................................................................. 46 Voice Activation & Detection ........................................................................................ 47 De-noising: ..................................................................................................................... 49 Recognition Results: ...................................................................................................... 50 FPGA Implementation ................................................................................................... 51
iv
List of Tables
Table 8.1: Recognition rates for English words using db8 & level 4 DWT. ................................ 50 Table 8.2: Recognition rates for English words using db8 & level 7 DWT. ................................ 51
List of Figures
Fig. 2.1 Literature survey ................................................................................................................ 7 Fig. 3.1 Schematic diagram of the speech production/perception process ..................................... 9 Fig. 3.2 Human Vocal Mechanism ............................................................................................... 10 Fig. 3.3 Discrete-Time Speech Production Model........................................................................ 11 Fig. 3.4 Three state representation of a speech signal. ................................................................. 13 Fig. 3.5.Spectrogram using Welchs Method ............................................................................... 14 Fig. 4.1 Fourier transform ............................................................................................................. 18 Fig. 4.2 Short time Fourier transform ........................................................................................... 19 Fig. 4.3 Haar wavelet .................................................................................................................... 20 Fig. 4.5 Daubechies wavelets........................................................................................................ 21 Fig. 4.6 Comparison of Wavelet analysis over STFT ................................................................... 22 Fig. 4.7 Filter functions ................................................................................................................. 24 Fig. 4.8 Decomposition of DWT Co-efficients ............................................................................ 24 Fig. 4.9 Decomposition using DWT ............................................................................................. 24 Fig. 4.10 Signal Reconstruction .................................................................................................... 25 Fig. 4.11 Signal Decomposition & Reconstruction ...................................................................... 25 Fig. 5.1 Main steps in Feature Extraction ..................................................................................... 26 Fig. 5.2 Pre processing .................................................................................................................. 26 Fig. 5.3 Pre emphasis filter ........................................................................................................... 27 Fig. 5.4 Frame blocking & Windowing ........................................................................................ 30 Fig. 5.5 Frame blocking of a sequence ......................................................................................... 31 Fig. 5.6 Hamming Window .......................................................................................................... 32 Fig. 6.1 warping function & adjusting window definition............................................................ 35 Fig. 6.2 Slope constraint on warping function .............................................................................. 38 Fig. 6.3 Weighting coefficient W(k) ............................................................................................. 40 Fig. 7.1 Synthesis flow in AccelDSP ............................................................................................ 44 Fig. 8.1 Input speech signal .......................................................................................................... 45 Fig. 8.2 Pre emphasis output ......................................................................................................... 46 Fig. 8.3 Voice Activation & Detection ......................................................................................... 47
vi
Fig. 8.4 Speech signal after Voice Activation & Detection .......................................................... 48 Fig. 8.5 Speech signal after de-noising ......................................................................................... 49 Fig. 8.6 Matlab output of Speech Recognition for word FEDORA. ......................................... 52 Fig. 8.7 Figure showing FPGA results for word FEDORA. ..................................................... 53
vii
1. INTRODUCTION
1.1 Definition
Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech signal using computers or electronic circuits. Recent advances in soft computing techniques give more importance to automatic speech recognition. Large variation in speech signals and other criteria like native accent and varying pronunciations makes the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a good recognition result. Speech recognition is a topic that is very useful in many applications and environments in our daily life. The fundamental purpose of speech is communication, i.e., the transmission of messages. According to Shannons information theory a message represented as a sequence of discrete symbols can be quantified by its information content in bits, and the rate of transmission of information is measured in bits/second (bps). In order for communication to take place, a speaker must produce a speech signal in the form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although the majority of the pressure wave originates from the mouth, sound also emanates from the nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a symbolic representation for a thought that the speaker wishes to relay to the listener. The arrangement of these sounds is governed by rules associated with a language. The scientific study of language and the manner in which these rules are used in human communication is referred to as linguistics. The science that studies the characteristics of human sound production, especially for the description, classification, and transcription of speech, is called phonetics.
1.2.1 Features
Speech input is easy to perform because it does not require a specialized skill as does typing or pushbutton operations. Information can be input even when the user is moving or doing other activities involving the hands, legs, eyes, or ears. Since a microphone or telephone can be used as an input terminal, inputting information is economical with remote inputting capable of being accomplished over existing telephone networks and the Internet.
1.2.2 Issues
Lot of redundancy is present in the speech signal that makes discriminating between the classes difficult. Presence of temporal and frequency variability such as intra speaker variability in pronunciation of words and phonemes as well as inter speaker variability e.g. the effect of regional dialects. Context dependent pronunciation of the phonemes (co-articulation). Signal degradation due to additive and convolution noise present in the background or in the channel. Signal distortion due to nonideal channel characteristic.
In the future it could be possible to use this information to create a chip that could be used as a new interface to humans. For example it would be desired to get rid of all remote controls in the home and just tell the television, stereo or any desired device what to do with the voice.
1.4 Objectives
This project will cover speaker independent and small vocabulary speech recognition with the help of wavelet analysis using Dynamic Time Warping method. The project will compose of two phases: 1) Training phase: In this phase, a number of words will be trained to extract model for each word. 2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or an input file and the system will try to recognize these words.
Chapter 5 - From Speech to Feature Vectors In this chapter the fundamental signal processing applied to a speech recognizer. Some topics related to this chapter are Pre-processing, frame blocking and windowing and Feature extraction. Chapter 6 Dynamic Time Warping Aspects of this chapter are theory and implementation of the set of statistical modeling techniques collectively referred to as Dynamic Time Warping. Some topics related to this chapter are DTW Algorithm, DP Matching Algorithm. Chapter 7 FPGA Implementation This chapter describes about the FPGA Implementation of Speech Recognition system using AccelDSP tool in Xilinx ISE. Chapter 8 Simulation & Results In this chapter the speech recognizer implemented in Matlab will be used. This is to test the recognizer in different cases for finding efficiency. Chapter 9 - Conclusions This chapter will summarizes the whole project.
2. LITERATURE SURVEY
Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis, the problem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, automatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quotations, weather reports, etc. Speech is the primary means of communication between people. For reasons ranging from technological curiosity about the mechanisms for mechanical realization of human speech capabilities, to the desire to automate simple tasks inherently requiring human-machine interactions, research in automatic speech recognition (and speech synthesis) by machine has attracted a great deal of attention over the past five decades.
started to tackle large vocabulary (1000-unlimited number of words) speech recognition problems based on statistical methods, with a wide range of networks for handling language structures. The key technologies introduced during this period were the hidden Markov model (HMM) [9] and the stochastic language model, which together enabled powerful new methods for handling virtually any continuous speech recognition problem efficiently and with high performance. In the 1990s we were able to build large vocabulary systems with unconstrained language models, and constrained task syntax models for continuous speech recognition and understanding. The key technologies developed during this period were the methods for stochastic language understanding, statistical learning of acoustic and language models, and the introduction of finite state transducer framework (and the FSM Library) and the methods for their determination and minimization for efficient implementation of large vocabulary speech understanding systems.
Finally, in the last few years, we have seen the introduction of very large vocabulary systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systems with a range of input and output modalities for ease-of-use and flexibility in handling adverse environments where speech might not be as suitable as other input-output modalities. During this period we have seen the emergence of highly natural speech synthesis systems, the use of machine learning to improve both speech understanding and speech dialogs, and the introduction of mixed-initiative dialog systems to enable user control when necessary. After nearly five decades of research, speech recognition technologies have finally entered the marketplace, benefiting the users in a variety of ways. Throughout the course of development of such systems, knowledge of speech production and perception was used in establishing the technological foundation for the resulting speech recognizers. Major advances, however, were brought about in the 1960s and 1970s via the introduction of advanced speech representations based on LPC analysis and cepstral analysis methods, and in the 1980s through the introduction of rigorous statistical methods based on hidden Markov models [9]. All of this came about because of significant research contributions from academia, private industry and the government. As the technology continues to mature, it is clear that many new applications will emerge and become part of our way of life thereby taking full advantage of machines that are partially able to mimic human speech capabilities.
Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic air, D. Perception of the ear, E. Speech comprehension, will be examined more carefully in the following sections. The first element (A. Speech formulation) is associated with the formulation of the speech signal in the talkers mind. This formulation is used by the human vocal mechanism (B. Human vocal mechanism) to produce the actual speech waveform. The waveform is transferred via the air (C. Acoustic air) to the listener. During this transfer the acoustic wave can be affected by external sources, for example noise, resulting in a more complex waveform. When the wave
9
reaches the listeners hearing system (the ears) the listener percepts the waveform (D. Perception of the ear) and the listeners mind (E. Speech comprehension) starts processing this waveform to comprehend its content so the listener understands what the talker is trying to tell him or her.
To be able to understand how the production of speech is performed one need to know how the humans vocal mechanism is constructed, as in Fig. 3.2.
10
The most important parts of the human vocal mechanism are the vocal tract together with nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled together with the vocal tract to formulate the desired speech signal. The cross-sectional area of the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 cm2. When humans produce speech, air is expelled from the lungs through the trachea. The air flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips, tongue, jaw and maybe using the nasal cavity, different sounds can be produced. Important parts of the discrete-time speech production model, in the field of speech recognition and signal processing, are: u (n), gain b0 and H (z). The impulse generator acts like the lungs, exciting the glottal filter G (z), resulting in u (n). The G (z) is to be regarded as the vocal cords in the human vocal mechanism. The signal u (n) can be seen as the excitation signal entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from the lungs.
The gain b0 is a factor that is related to the volume of the speech being produced. Larger gain b0 gives louder speech and vice versa. The vocal tract filter H (z) is a model over the vocal tract and the nasal cavity. The lip radiation filter R (z) is a model of the formation of the human lips to produce different sounds.
11
12
The upper plot Fig. 3.4(a) contains the whole speech sequence and in the middle plot Fig. 3.4(b) a part of the upper plot Fig. 3.4(a) is reproduced by zooming an area of the whole speech sequence. At the bottom of Fig. 3.4 the segmentation into a three-state representation, in relation to the different parts of the middle plot, is given.
13
Here the darkest (dark blue) parts represent the parts of the speech waveform where no speech is produced and the lighter (red) parts represent intensity if speech is produced.
14
3.3.1 Bandwidth
The bandwidth of the speech signal is much higher than the 4 kHz. In fact, for the fricatives, there is still a significant amount of energy in the spectrum for high and even ultrasonic frequencies. However, as we all know from using the (analog) phone, it seems that within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice.
15
16
4. WAVELET ANALYSIS
4.1 Definition
A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets can be combined, using a "reverse, shift, multiply and sum" technique called convolution, with portions of an unknown signal to extract information from the unknown signal. The fundamental idea behind wavelets is to analyze according to scale. The wavelet analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or mother wavelet. Any speech signal can then be represented by translated and scaled versions of the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech signal analysis technique such the extracted features are then passed to a classifier for the recognition of isolated words [4]. The integral wavelet transform is the integral transform defined as: ( )
Equation 4.1
Where a is positive and defines the scale and b is any real number and defines the shift. For decomposition of speech signal, we can use different techniques like Fourier analysis, STFT (Short Time Fourier Transforms), wavelet transform techniques. Here, we have explained the necessity and advantages of Wavelet Analysis by first considering the Fourier analysis, its limitations, its modification to Short Time Fourier Transform, its limitations and finally the Wavelet Analysis.
17
4.2.1 Limitations
But Fourier analysis has a serious drawback. In transforming to the frequency domain, time information is lost. When looking at a Fourier transform of a signal, it is impossible to tell when a particular event took place. If a signal doesnt change much over time, i.e. if it is what is called a stationary signal. This drawback isnt very important. However, most interesting signals contain numerous non-stationary or transitory characteristics: drift, trends, abrupt changes, and beginnings and ends of events. These characteristics are often the most important part of the signal, and Fourier analysis is not suited to detecting them.
18
The STFT represents a sort of compromise between the time- and frequency-based views of a signal. It provides some information about both when and at what frequencies a signal event occurs.
4.3.1 Limitations
However, you can only obtain this information with limited precision, and that precision is determined by the size of the window. While the STFTs compromise between time and frequency information can be useful, the drawback is that once you choose a particular size for the time window, that window is the same for all frequencies .Otherwise ,if a wider window is chosen, it gives better frequency resolution but poor time resolution. A narrower window gives good time resolution but poor frequency resolution. Many signals require a more flexible approach - one where we can vary the window size to determine more accurately either time or frequency.
19
20
In general the Daubechies wavelets are chosen to have the highest number A of vanishing moments, (this does not imply the best smoothness) for given support width N=2A, and among the 2A1possible solutions the one is chosen whose scaling filter has external phase. The wavelet transform is also easy to put into practice using the fast wavelet transform. Daubechies wavelets are widely used in solving a broad range of problems, e.g. self-similarity properties of a signal or fractal problems, signal discontinuities, etc. The Daubechies wavelets properties [6]: The support length of wavelet function and the scaling function is 2N1. The number of vanishing moments of is N. Most dbN are not symmetrical. The regularity increases with the order. When N becomes very large, and belong to CN where is approximately equal to 0.2. Daubechies8 wavelet is used for decomposition of speech signal as it needs minimum support size for the given number of vanishing points. The names of the Daubechies family wavelets are written dbN, where N is the order, and db the surname of the wavelet. The db1 wavelet, as mentioned above, is the same as Haar. Here are the next nine members of the family:
21
The time-based, frequency-based and STFT views of a signal are given with respect to that of Wavelet analysis. One major advantage afforded by wavelets is the ability to perform local analysis, i.e., to analyze a localized area of a larger signal.
Equation 4.5
Where (t) is a time function with finite energy and fast decay called mother wavelet.
22
Equation 4.6 The function (t) is known as the mother wavelet, while (t) is known as the scaling function. The set of functions { ( ) ( ) } where Z is
the set of integers is an orthonormal basis for L2(R). The numbers a (L, k) are known as the approximation coefficients at scale L, while d (j, k) are known as the detail coefficients at scale j. The approximation and detail coefficients can be expressed as: ( ( ) )
( ) ( ( ) (
) )
The DWT analysis can be performed using a fast, pyramidal algorithm related to multirate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank with octave spacing between the centers of the filters. Each sub-band contains half the samples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed at different frequency bands with different resolution by decomposing the signal into a coarse approximation and detail information. The coarse approximation is then further decomposed using the same wavelet decomposition step. This is achieved by successive high-pass and lowpass filtering of the time domain signal and is defined by the following equations: ylow[n] = yhigh[n] = , - , , - , Equation 4.9 Equation 4.10
23
Signal x[n] is passed through low pass and high pass filters and it is down sampled by 2. ylow[n] = (x * g) 2 yhigh[n] = (x*h) 2 Equation 4.11 Equation 4.12
In the DWT, each level is calculated by passing the previous approximation coefficients though a high and low pass filters.
24
The DWT is computed by successive low-pass and high-pass filtering of the discrete time-domain signal as shown in figure 4.8 and 4.9. This is called the Mallat algorithm or Mallattree decomposition.
25
5.1 Preprocessing
This step is the first step to create feature vectors. The objective in the pre-processing is to modify the speech signal, x (n), so that it will be more suitable for the feature extraction analysis. The preprocessing operations noise cancelling, pre emphasis and voice activation detection can be seen in Figure below shown.
The first thing to consider is if the speech, x (n), is corrupted by some noise, d (n), for example an additive disturbance x (n) = s (n) + d (n), where s (n) is the clean speech signal. There are several approaches to perform noise reduction on a noisy speech signal. Two commonly used noise reduction algorithms in the field of speech recognition context is spectral subtraction and adaptive noise cancellation. A low signal to noise ratio (SNR) decrease the
26
performance of the recognizer in a real environment. Some changes to make the speech recognizer more noise robust will be presented later. Note that the order of the operations might be reordered for some tasks. For example the noise reduction algorithm, spectral subtraction, is better placed last in the chain (it needs the voice activation detection).
The pre emphasizer is used to spectrally flatten the speech signal. This is usually done by a high pass filter. The most commonly used filter for this step is the FIR filter described below: ( )
27
Equation5.1
The filter response for this FIR filter can be seen in Figure. The filter in the time domain will be h (n) = {1, 0.95} and the filtering in the time domain will give the pre emphasized signal s1 (n): s1 (n) = ( ) ( ) Equation 5.2
( ) ( )
, ( ), ( )-|
( ( ))
( ) ( )
For each block of L samples these measures calculate some value. Note that the index for these functions is m and not n, this because these measures do not have to be calculated for every sample (the measures can for example be calculated in every 20 ms). The short-term energy estimate will increase when speech is present in s1 (n). This is also the case with the short-term power estimate; the only thing that separates them is scaling with 1/L when calculating the shortterm power estimate. The short term zero crossing rate gives a measure of how many times the signal, s1 (n), changes sign. This short term zero crossing rates tend to be larger during unvoiced regions. These measures will need some triggers for making decision about where the utterances begin and end. To create a trigger, one needs some information about the background noise. This is done by assuming that the first 10 blocks are background noise. With this assumption the
28
mean and variance for the measures will be calculated. To make a more comfortable approach the following function is used: Ws1(m)=Ps1(m)(1-Zs1(m))Sc Equation 5.7
Using this function both the short-term power and the zero crossing rates will be taken into account. Sc is a scale factor for avoiding small values, in a typical application is Sc = 1000. The trigger for this function can be described as: tW=W + W Equation 5.8 The W is the mean and W is the variance for Ws1 (m) calculated for the first 10 blocks. The term is a constant that have to be fine-tuned according to the characteristics of the signal. After some testing the following approximation of will give pretty good voice activation detection in various level of additive background noise: Equation 5.9 The voice activation detection function, VAD (m), can now be found as: ( ) { Equation 5.10
VAD (n) is found as VAD (m) in the block of measure. For example if the measures is calculated every 320 sample (block length L=320), which corresponds to 40 ms if the sampling rate is 8 kHz. The first 320 samples of VAD (n) found as VAD (m) then m = 1. Using these settings the VAD (n) is calculated for the speech signal containing the word file shown in results.
29
30
By applying the frame blocking to de noised signal (x (k)), one will get M vectors of length K, which correspond to x (k; m) where k=0, 1...K-1 and m=0, 1.M 1.
5.2.2 Windowing
Windowing concept is used to minimize the signal distortion by using the window to taper the signal to zero at the beginning and end of each frame i.e. to reduce signal discontinuity at either end of the block. The rectangular window (i.e. no window) can cause problems, when we do Fourier analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow main lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal toward zero at the window boundaries, avoiding discontinuities. ( ) { Equation 5.12
The most commonly used window function in speech processing is the Hamming window defined as follows: ( ) ( ) Equation 5.13
By applying w (k) to x (k; m) for all blocks, the windowed signal output is calculated.
31
Multiplication of the signal by a window function in the time domain is the same as convolving the signal in the frequency domain. Rectangular window gives maximum sharpness but large side-lobes (ripples) - hamming window blurs in frequency but produces much less leakage.
32
The following features are used in our system: The mean of the absolute value of the coefficients in each sub-band. These features provide information about the frequency distribution of the audio signal. The standard deviation of the coefficients in each sub-band. These features provide information about the amount of change of the frequency distribution. Energy of each sub-band of the signal. These features provide information about the energy of the each sub-band. Kurtosis of each sub-band of the signal. These features measure whether the data are peaked or flat relative to a normal distribution. Skewness of each sub-band of the signals. These features are the measure of symmetry or lack of symmetry. After frame blocking and windowing we get different frame vectors i.e. different signals are to be loaded to extract the features at a time. Hence Multi signal analysis is performed on input frame vectors using wavelets using matlab [13].
33
34
Consider the problem of eliminating timing differences between these two speech patterns. In order to clarify the nature of time-axis fluctuation or timing differences, let us consider an i-j plane, shown in Fig. 6.1, where patterns A and B are developed along the i-axis and j-axis, respectively. Where these speech patterns are of the same category, the timing differences between them can be depicted by a sequence of points c = (i, j): F = c (l), c (2), ------c, (k), ---------- c (K). Where c (k) = (i (k), j (k)). This sequence can be considered to represent a function which approximately realizes a mapping from the time axis of pattern A onto that of pattern B. Hereafter, it is called a warping function. When there is no timing difference between these patterns, the warping function coincides with the diagonal line j = i. It deviates further from the diagonal line as the timing difference grows [3]. Equation 6.3
35
As a measure of the difference between two feature vectors ai and bi, a distance ( ) ( ) || || Equation 6.4
is employed between them. Then, the weighted summation of distances on warping function F becomes ( ) ( ( )) ( ) Equation 6.5
(Where w (k) is a nonnegative weighting coefficient, which is intentionally introduced to allow the E (F) measure flexible characteristic) and is a reasonable measure for the goodness of warping function F. It attains its minimum value when warping function F is determined so as to optimally adjust the timing difference. This minimum residual distance value can be considered to be a distance between patterns A and B, remaining still after eliminating the timing differences between them, and is naturally expected to be stable against time-axis fluctuation. Based on these considerations, the time-normalized distance between two speech patterns A and B is defined as follows:
( ( )) ( ) ( )
Equation 6.6
Where denominator ( ) is employed to compensate for the effect of K (number of points on the warping function F). Above equation is no more than a fundamental definition of timenormalized distance. Effective characteristics of this measure greatly depend on the warping function specification and the weighting 'coefficient definition. Desirable characteristics of the time-normalized distance measure will vary, according to speech pattern properties (especially time axis expression of speech pattern) to be dealt with. Therefore, the present problem is restricted to the most general case where the following two conditions hold: Condition 1: Speech patterns are time-sampled with a common and constant sampling period. Condition 2: We have no a priori knowledge about which parts of speech pattern contain linguistically important information. In this case, it is reasonable to consider each part of a speech pattern to contain an equal amount of linguistic information.
36
As a result of these two restrictions, the following relation holds between two consecutive points ( ) (( ) ( ) ) ( ) ) {( ( ) ( )) (( ) Equation 6.9
3) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. 4) Adjustment window condition: | ( )
( )|
Where r is an appropriate positive integer, called window length. This condition corresponds to the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.
5) Slope constraint condition:
Neither too steep nor too gentle a gradient should be allowed for warping function F because such deviations may cause undesirable time-axis warping. Too steep a gradient, for example, causes an unrealistic correspondence between very short patterns A segment and a relatively long pattern B segment. Then, such a case occurs where a short segment in consonant or phoneme transition part happens to be in good coincidence with an entire steady vowel part. Therefore, a restriction called a slope constraint condition was set upon the warping function F, so that its first derivative is of discrete form. The slope constraint condition is realized as a restriction on the possible relation among (or the possible configuration of) several consecutive
37
points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c (k) moves forward in the direction of i (or j)-axis consecutive m times, then point c (k) is not allowed to step further in the same direction before stepping at least n times in the diagonal direction. The effective intensity of the slope constraint can be evaluated by the following measure P = n/m.
The larger the P measure, the more rigidly the warping function slope is restricted. When p = 0, there are no restrictions on the warping function slope. When p = (that is m = 0), the warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional
38
pattern matching no time normalization. Generally speaking, if the slope constraint is too severe, then time-normalization would not work effectively. If the slope constraint is too lax, then discrimination between speech patterns in different categories is degraded. Thus, setting neither a too large nor a too small value for p is desirable. Section IV reports the results of an investigation on an optimum compromise on p value through several experiments. In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope constraint condition p = 1 are shown. The Fig. 6.2(c) type is directly derived from the above definition, while Fig. 6.2(d) is an approximated type, and there is another constraint. That is, the second derivative of warping function F is restricted, so that the point c (k) path does not orthogonally change its direction. This new constraint reduces the number of paths to be searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for the p = 0 case.
(Called normalization coefficient) is independent of warping function F; it can be put out of the bracket, while simplifying the equation as follows: ( ) [ ( ( )) ( )] Equation 6.13
This simplified problem can be effectively solved by use of the dynamic programming technique. W (k) = [i (k) - i (k-1)] + [j (k) - j (k-1)], Then N=I+J, where I and J are lengths of speech patterns A and B, respectively. Equation 6.14
If it is assumed that time axes i and j are both continuous, then, in the symmetric form, the summation in Equation 6.6 means an integration along the temporarily defined axis l = i +j. As a result of this difference, time-normalized distance is symmetric, or D (A, B) = D (B, A), in the symmetric form. Another more important result, caused by the difference in the integration axis, is that, as is in Fig. 6.3, weighting coefficient w (k) reduces to zero in the asymmetric form, when the point in warping function steps in the direction of j-axis, or c (k) = c (k-1) + (0, 1). This means that some feature vectors bj are possibly excluded from the integration in the asymmetric
39
form. On the contrary, in the case of symmetric form, minimum w (k) value is equal to 1, and no exclusion occurs. Since discussions here are based on the assumption that each part in a speech pattern should be treated equally, an exclusion of any feature vectors from integration should be avoided as long as possible. It can be expected, therefore, that the symmetric form will give better recognition accuracy than the asymmetric form. However, it should be noted that the slope constraint reduces the situation where the point in warping function steps in the j-axis direction. The difference in performance between the symmetric one and asymmetric one will gradually vanish as the slope constraint is intensified.
Equation 6.15 ( )]
( (
))
( ( ))
Equation 6.16
40
It is implicitly assumed here that c (0) = (0, 0). Accordingly, w (1) = 2 in the symmetric form, and w (1) = 1 in the asymmetric form. By realizing the restriction on the warping function described in Section 6.1.2 and substituting Equation 6.14 for weighting coefficient w (k) in Equation 6.16, several practical algorithms can be derived. As one of the simplest examples, the algorithm for symmetric form, in which no slope constraint is employed (that is P = 0), is shown here. Initial condition: g (l, 1) = 2 d (1, 1). DP-equation: ( ( ) [ ( ( Restricting condition (adjustment window): j - r i j + r. Time-normalized distance: ( Where N = I+J. The algorithm, especially the DP-equation, should be modified when the asymmetric form is adopted or some slope constraint is employed. In Table I, algorithms are summarized for both symmetric and asymmetric forms, with various slope constraint conditions. In this table, DP-equations for asymmetric forms are shown in some improved form. The first expression in the bracket of the asymmetric form DP-equation for P = 1 (that is, [g(i - 1 , j - 2) + d(i, j - 1) + d(i, j)]/2) corresponds to the case where c(k - 1 ) = (i(k), j(k) - 1) and c(k - 2) = (i(k - 1) - 1 , j (k 1) - 1). Accordingly, if the definition in (14) is strictly obeyed, w (k) is equal to zero while w (k 1) is equal to 1, thus completely omitting the d(c (k)) from the summation. In order to avoid this situation to a certain extent, the weighting coefficient w (k - 1) = 1 is divided between two weighting coefficients w (k - 1) and w (k). Thus, (d(i, j - 1) + d(i, j))/2 is substituted for d(i, j - 1) + 0 * d(i, j ) in this expression. Similar modifications are applied to other asymmetric form DPequations. In fact, it has been established, by a preliminary experiment, that this modification significantly improves the asymmetric form performance [12]. ) ( ) Equation 6.21 Equation 6.20 ) ) ) ( ) ( )] ( ) Equation 6.19 Equation 6.18
41
The optimum DP-algorithm, applied to speech recognition, was investigated. Symmetric form was proposed along with slope constraint technique. These varieties were then compared through theoretical and experimental investigations. Conclusions are as follows: Slope constraint is actually effective. Optimum performance is attained when the slope constraint condition is P = 1. The validity of these results was ensured by a good agreement between theoretical discussions and experimental results. The optimized algorithm was then experimentally compared with several other DP-algorithms applied to spoken word recognition by different research groups, and the superiority of the algorithm described in this paper was established.
42
7. FPGA Implementation
The AccelDSP Synthesis Tool is a product that allows to transform a MATLAB floatingpoint design into a hardware module that can be implemented in a Xilinx FPGA. AccelDSP Synthesis Tool features an easy-to-use Graphical User Interface that controls an integrated environment with other design tools such as MATLAB, Xilinx ISE tools, and other industrystandard HDL simulators and logic synthesizers. AccelDSP Synthesis is done with the following implementation procedure: a) Reading and analyzing a MATLAB floating-point design. b) Automatically creating an equivalent MATLAB fixed-point design. c) Invoking a MATLAB simulation to verify the fixed-point design. d) Providing the power to quickly explore design trade-offs of algorithms that are optimized for the target FPGA architectures. e) Creating a synthesizable RTL HDL model and a Test bench to ensure bit-true, cycle accurate design verification. f) Providing scripts that invoke and control down-stream tools such as HDL simulators, RTL logic synthesizers, and Xilinx ISE implementation tools.
43
The Synthesis flow in AccelDSP ISE can be observed from the following flow chart:
44
8.1
Input Signal:
1) Input speech signal for word Speech:
The input speech signal with duration of 5 seconds with sampling frequency of 8k Hz is shown above.
45
8.2
Pre emphasis:
The output obtained after passing the input Speech signal to the pre emphasis (first order high pass) filter. The output has significant spectral flatness when compared with input.
46
8.3
The above plot shows the voice activated region for the word Speech. The output is 1 for voiced region and 0 for unvoiced and silence region. Hence out of the total samples, only the voice activated samples are going to be filtered out.
47
After obtaining the Voice Activation & Detection output, the regions for which VAD=1 are extracted out for further analysis.
48
8.4
De-noising:
De-noising for Speech:
The final denoised signal obtained after Spectral subtraction. Here the noise components present in the signal are reduced.
49
8.5
Recognition Results:
This section provides the experimental results in recognizing the isolated words. In the
experiment, the database consists of 10 different words and 25 utterances for each word is used. Calculation of Recognition rate is given in Equation 8.1 below.
Equation 8.1 a) The Recognition rates for each word using Daubechies-8 wavelet & level-4 decomposition using DWT for English words is shown in the following table: Word to be recognized Matrix Paste Project Speech Window Distance India Ubuntu Fedora Android Number of times the word is correctly recognized 24 24 18 18 24 20 24 19 25 24 Recognition Rate 96 96 72 72 96 80 96 76 100 96
Table 8.1: Recognition rates for English words using db8 & level 4 DWT. The overall Recognition rate for English words using Daubechies 8 wavelet of level-4 is 88%.
50
b) The Recognition rates for each word using Daubechies-8 wavelet & level-7 decomposition using DWT for English words is shown in the following table: Word to be recognized Matrix Paste Project Speech Window Distance India Ubuntu Fedora Android Number of times the word is correctly recognized 24 23 21 23 24 22 25 21 25 25 Recognition Rate 96 92 84 92 96 88 100 84 100 100
Table 8.2: Recognition rates for English words using db8 & level 7 DWT. The overall Recognition rate for English words using Daubechies 8 wavelet of level-7 is
51
52
53
9. CONCLUSION
From this study we could understand and experience the effectiveness of discrete wavelet transform in feature extraction. The wavelet transform is a more dominant technique for speech processing than other previous techniques. The features obtained by using the wavelet transform shows higher recognition rates if the features are extracted properly. Wavelets are able to distinguish between different properties high frequency low amplitude spectral components and low frequency large amplitude spectral components. During the experiment we worked with Daubechies-8 mother wavelet for feature extraction. Here we have used level-4 and level-7 decompositions of db8 wavelets. The efficiency of speech recognition is increased with the level for recognition of isolated words. In this experiment we have only used a limited number of samples. The performance of the system can be improved by utilizing various noise reduction or removal algorithms and training with large dataset. The optimized DTW algorithm was experimentally applied for the features extracted from wavelet analysis using Discrete Wavelet Transform. Slope constraint used in DTW is gave effective results. The matlab code used for the speech recognition system is used for hardware implementation using AccelDSP tool. The validity of these results using Dynamic Time Warping was ensured by a good agreement between theoretical discussions and experimental results.
54
REFERENCES
[1] Trivedi, Saurabh, Sachin and Raman, "Speech Recognition by Wavelet Analysis", International Journal of Computer Applications (0975 8887) Volume 15 No.8, February 2011. [2] Lawrence Rabiner and Bing-Hwang Jung, "Fundamentals of Speech Recognition". [3] Hiroaki Sakoe, Seibi Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition", IEEE Transactions On Acoustics, Speech, And Signal Processing, Vol. Assp-26, No. 1, February 1978. [4] Ingrid Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992. [5] Ian Mcloughlin, "Audio Processing with Matlab Examples". [6] I.Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. on Pure and Applied Math., vol.41, pp.909-996, Nov.1988. [7] Murali Krishnan, Chris P.Neophytou and Glenn Prescott, Wavelet Transform Speech Recognition using Vector Quantization, Dynamic Time Warping and Artificial Neural Networks". [8] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete Wavelet Transform Organized sound, Vol. 4(3), 2000. [9] L.R.Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, vol. 77, no. 2, pp. 257-286, 1989. [10] Michael Nilsson, Marcus Ejnarsson, "Speech Recognition using Hidden Markov Model". [11] S.G.Mallat, A theory for multi resolution signal decomposition: the wavelet representation, IEEE transactions on Pattern Analysis Machine Intelligence, Vol. 11 1989, pp.674-693. [12] Sylvio Barbon Junior, Rodrigo Capobianco Guido, Shi-Huang Chen, Lucimar Sasso Vieira, Fabricio Lopes Sanchez, "Improved Dynamic Time Warping Based on the Discrete Wavelet Transform", Ninth IEEE International Symposium on Multimedia 2007. [13] M.Misiti, Y. Misiti, G. Oppenheim and J. Poggi, Matlab Wavelet Tool Box, The Math Works, Inc.,2000. [14] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete Wavelet Transform Organized sound, Vol. 4(3), 2000.
55
[15] Mike Brookes, "Voicebox: Speech Processing Toolbox for Matlab", Department of Electrical & Electronic Engineering, Imperial College, London SW7 2BT, UK. [16] Daryl Ning, "Developing an Isolated Word Recognition System in Matlab", Matlab Digest - January 2010.
56