You are on page 1of 6

(IJCNS) International Journal of Computer and Network Security, 83

Vol. 1, No. 2, November 2009

A Comprehensive Analysis of Voice Activity


Detection Algorithms for Robust Speech
Recognition System under Different Noisy
Environment
C.Ganesh Babu1 , Dr.P.T.Vanathi2 R.Ramachandran3, M.Senthil Rajaa4, R.Vengatesh5
1
Research Scholar (PSGCT,) Associate Professor / ECE, Bannari Amman Institute of Technology, Sathyamangalam,
India. E-mail :bits_babu@yahoo.co.in
2
Assistant Professor / ECE, PSGCT, Coimbatore, India . E-mail :pt_vani@yahoo.com
3,4,5
UG Schola,r Bannari Amman Institute of Technology-Sathyamangalam, India.

Abstract: Speech Signal Processing has not been used much in Automatic Speech Recognition (ASR) in a comprehensive
the field of electronics and computers due to the complexity and manner. The proposed method of Speech Recognition
variety of speech signals and sounds. However, with modern System for Robust noise environment is shown in the
processes, algorithms, and methods which can process speech figure.1
signals easily and also recognize the text. Demand for speech
INPUT NOISE
recognition technology is expected to rise dramatically over the SPEECH ESTIMATION VAD OUTPUT
next few years as people use their mobile phones as all purpose
lifestyle devices. In this paper, we implements a speech-to-text
system using isolated word recognition with a vocabulary of ten Figure1. Proposed Robust Speech Recognition System
words (digits 0 to 9) and statistical modeling (Hidden Markov
Model - HMM) for machine speech recognition. In the training 1.1 Speech Characteristics
phase, the uttered digits are recorded using 8-bit Pulse Code
Modulation (PCM) with a sampling rate of 8 KHz and saved as Speech signals are composed of sequence of sounds. Sounds
a wave file using sound recorder software. The system performs can be classified into three distinct classes according to their
speech analysis using the Linear Predictive Coding (LPC) mode of excitation.
method of degree. From the LPC coefficients, the weighted (i) Voiced sounds are produced by forcing air through
cepstral coefficients and cepstral time derivatives are derived. the glottis with the tension of the vocal cords
From these variables the feature vector for a frame is arrived. adjusted so that they vibrate in a relaxation
Then, the system performs Vector Quantization (VQ) utilizing a oscillation, thereby producing a quasi-periodic
vector codebook which result vectors form of the observation pulse of air which vibrates the vocal tract.
sequence. For a given word in the vocabulary, the system builds (ii) Fricative or Unvoiced sounds a regenerated by
an HMM model and trains the model during the training phase.
forming a constriction at some point in the vocal
The training steps, from Voice Activity Detection (VAD) are
tract and forcing air through .the constriction at a
performed using PC-based Matlab programs. Our current
framework uses a speech processing module including a high enough velocity to produce turbulence.
Subband Order Statistics Filter based Voice Activity Detection (iii) Plosive sounds result from making a complete
with Hidden Markov Model (HMM)-based classification and closure and abruptly releasing it.
noise language modeling to achieve effective noise knowledge
estimation. 1.2 Overview of Speech Recognition
Keywords: Hidden Markov Model, Vector Quantization,
A Speech Recognition System is often degraded in
Subband OSF based Voice Activity Detection.
performance when there is a mismatch between the acoustic
conditions of the training and application environments.
This mismatch may come from various sources, such as
1. INTRODUCTION
additive noise, channel distortion, different speaker
Currently, Speech Recognition Systems are pluged into characteristics and different speaking modes. Various
many technical barriers towards modern application. An robustness techniques have been proposed to reduce this
important drawback affect most of these application is mismatch and thus improve the recognition performance
harmful environmental noise and it also reduces the system [12]. In the last decades, many methods have been proposed
performance. Most of the noise compensation algorithm to enable ASR systems to compensate or adapt to mismatch
often requires the Voice Activity Detector (VAD) to due to inter speaker differences, articulation effects and
estimate the presence or absence of speech signal [1]. In this microphone characteristics [14].
paper, we compare the performance of the VAD algorithm
in presence of different types of noise like airport, babble, The paper is organized as follows. Section 2 reviews the
train, car, street, exhibition, restaurant and station for theoretical background of VAD algorithms. Section 2.1
84 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

shows the principle of VAD algorithm. Section 3 explains distinguished to be adaptive to time-varying noise
the subband OSF based VAD implementation. Sections 4 environments with the following algorithm for updating the
express the nature of VAD using HMM. Results are noise spectrum during non-speech periods being used:
discussed in Section 5. The paper was concluded in Section
6. (4)

Where Nk is the average spectrum magnitude over a K-


2. Voice Activity Detection frame neighbourhood:

Voice is differentiated into speech or silence based on = (5)


speech characteristics. The signal is sliced into adjoining
frames. A real valued nonnegative parameter is associated
with each frame. If this parameter exceeds a certain 3. Subband OSF Based VAD
threshold, the signal is classified as speech or non speech.
An improved voice activity detection algorithm employing
The basic principle of VAD device is that it extracts some long-term signal processing and maximum spectral
measured features or quantities from the input signal and component tracking .It improves the speech/non-speech
then compares these values with thresholds. Voice activity discriminability and speech recognition performance in
(VAD=1) is declared if the measured value exceeds the noisy environments. Two issues are solved using VAD .The
threshold. Otherwise (VAD=0) is declared for no speech first one is performance of VAD in low noise condition (low
activity. In general, a VAD algorithm outputs a binary SNR) and the second is with noisy environment
decision in a frame by frame basis where a frame of an input (background) [1].
signal is a short unit of time such as 20-40mseconds.
NOISE
FFT VAD
REDUCTION
2.1 VAD Decision Rule

Once the input speech has been de-noised, its spectrum


magnitude Y (k, l) is processed by means of a (2N +1)-frame
window. Spectral changes around an N-frame neighborhood
of the actual frame are computed using the N-order Long- WF FREQUENCY
SPECTRUM
Term Spectral Envelope (LTSE) as: SMOOTHING
DESIGN DOMAIN
FILTER
(1)

where l is the actual frame for which the VAD decision[12] NOISE
UPDATE
is made and k= 0, 1, ..., NFFT-1, is the spectral band. The
noise suppression block have to perform the noise reduction
of the block Figure 2. Block Diagram of Subband Order Statistics Filter
based VAD
(2)
The subband based VAD uses two order statistics filters for
before the LTSE at the l-th frame can be computed. This is the Multi-Band Quantile (MBQ) SNR estimation [3]. The
carried out as follows. During the initialization, the noise implementation of both OSF is based on a sequence of 2N+1
suppression algorithm is applied to the first 2N + 1 frames log-energy values {E(m − N,k), . . . , E(m,k), . . . , E(m +
and, in each iteration, the (l+N +1)-th frame is de-noised, so N,k)} around the frame to be analyzed [14]. The block
that Y (k, l+N +1) become available for the next iteration. diagram of the subband based VAD is shown in the Figure
The VAD decision rule is formulated in terms of the 2. This algorithm operates on the subband log-energies.
Long-Term Spectral Divergence (LTSD)[1] calculated as Noise reduction is performed first and the VAD decision is
the deviation of the LTSE respect to the residual noise formulated on the de-noised signal. The noisy speech signal
spectrum N(k) and defined by: is decomposed into 25-mseconds frames with a 10-mseconds
window shift. Let X(m,l) be the spectrum magnitude for the
(3) mth band at frame l .The design of the noise reduction block
is based on Wiener Filter theory whereby the attenuation is a
function of the signal-to-noise ratio (SNR) of the input
If the LTSD is greater than an adaptive threshold γ, the signal.
actual The VAD decision is formulated in terms of the de-noised
frame is classified as speech, otherwise it is marked as non signal, being the subband log-energies processed by means
speech. A hangover delays the speech to non-speech of order statistics filters[2].
transition in order to prevent low-energy word endings
being misclassified as silences. On the other hand, if the The noise reduction block consists of four stages.
LTSD achieves a given threshold LTSD0, the hangover
algorithm is turned off to improve non speech detection
accuracy in low noise environments. The VAD is
(IJCNS) International Journal of Computer and Network Security, 85
Vol. 1, No. 2, November 2009

i) Spectrum smoothing: The power spectrum is averaged study these two separate aspects of modeling a dynamic
over two consecutive frames and two adjacent spectral process (like speech) using one consistent framework.
bands. Another attractive feature of HMM's comes from the fact
that it is relatively easy and straightforward to train a model
ii) Noise estimation: The noise spectrum Ne(m,l) is updated from a given set of labeled training data (one or more
by means of a 1st order IIR filter on the smoothed spectrum sequences of observations).
Xs(m,l), As mentioned above the technique used to implement
speech recognition is Hidden Markov Model (HMM)
(6) [4][13].The HMM] is used to represent the utterance of the
word and to calculate the probability of that the model
where λ=0.99 and m=0,1,…,NFFT/2 which created the sequence of vectors. There are some
challenges in designing of HMM for the analysis or
iii) Wiener Filter design: First, the clean signal S(m,l) is recognition of speech signal. HMM broadly works on two
estimated by combining smoothing and spectral subtraction phases under which phase I is Linear Predictive Coding and
phase II consists of Vector Quantization, training, and
recognition phases.

(7) The present hidden Markov Model is represented by


where γ=0.98 . equation 12.
(12)
Then, the wiener filter H(m,l) is designed as
π = initial state distribution vector.
(8) A= State transition probability matrix.
B=continuous observation probability density function
Where (9) matrix.

Given appropriate values of A,B and π as mentioned by


and ηmin is selected so that the filter frequency response equation 12, the HMM can be used as a generator to give
yields a 20 dB maximum attenuation. S’(m,l) the spectrum an observation sequence
of the cleaned speech signal is assumed to be zero at the (13)
beginning of the process and is used for designing the
wiener filter through Equation 3 to Equation 5. It is given (Where each observation Ot is one of the symbols from the
by observation symbol V and T is the number of observation in
the sequence) as follows:
(10)
1) Choose an initial state q1=Si according to the initial state
The filter H(m,l) is smoothed in order to eliminate rapid distribution π.
changes between neighbor frequencies that may often cause 2) Set t=1
musical noise. Thus, the variance of the residual noise is 3) Choose according to the symbol probability
reduced and consequently, the robustness when detecting distribution in state Si .
nonspeech is enhanced. The smoothing is performed by 4) Transit to a new state according to the state
truncating the impulse response of the corresponding causal transition probability distribution for state Si.
FIR filter to 17 taps using a Hanning window one to this 5) Set ( return to step3) if ; otherwise
time domain operation. The frequency response of the terminate the procedure.
Wiener filter is smoothed and the performance of the VAD
is improved. The above procedure can be used as both a generator of
iv) Frequency domain filtering: The smoothed filter is observations, and as a model for how a given observation
applied in the frequency domain to obtain the denoised sequence was generated by an appropriate HMM.
spectrum After re estimate the parameters, the model is represented
with the following denotation
(11)
(14)

The model is saved to represent that specific observation


sequences, i.e. an isolated word. The basic theoretical
4. Hidden Markov Model strength of the HMM is that it combines modeling of
stationary stochastic processes (for the short-time spectra)
The basic theoretical strength of the HMM is that it and the temporal relationship among the processes (via a
combines modeling of stationary stochastic processes (for Markov chain) together in a well-defined probability space.
the short-time spectra) and the temporal relationship among This combination allows us to study these two separate
the processes (via a Markov chain) together in a well- aspects of modeling a dynamic process (like speech) using
defined probability space. This combination allows us to one consistent framework. Another attractive feature of
86 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

HMM's comes from the fact that it is relatively easy and frames centered around the current vector[7][8] which is
straightforward to train a model from a given set of labeled denoted in the following equation
training data (one or more sequences of observations).
=[ (23)
4.1 Linear Predictive Coding Analysis
where G is the gain term to make the variance of ĉl(m) and
One way to obtain observation vectors O from speech ∆ĉl(m) equal.
samples is to perform a front end spectral analysis. The type (24)
of spectral analysis that is often used is linear predictive
coding[5]-[9]. (25)
The steps in the involved processing are as follows:
4.2 Vector Quantization Training and
i) Preemphasis: The digitized speech signal is processed by Recognition Phases
a first-order digital network in order to spectrally flatten the
signal which is discussed as To use HMM with discrete observation symbol density , a
Vector Quantizer (VQ) is required to map each continuous
(15) observation vector in to a discrete code book index. The
major issue in VQ is the design of an appropriate codebook
ii) Blocking Into Frames: Sections of NA consecutive speech for quantization. The procedure basically partitions the
samples are used as a single frame. Consecutive frames are training vector in to M disjoin sets. The distortion steadily
spaced MA samples apart. The frame separation is given by decreases as M increases. Hence HMM with codebook size
following equation of from M=32 to 256 vectors has been used in speech
recognition experiments using HMMs [9-10]
(16)
During the training phase the system trains the HMM for
iii) Frame Windowing: Each frame multiplied by an NA each digit in the vocabulary. The same weighted cepstrum
sample window(Hamming Window) so as to minimize matrices for various samples and digits are compared with
the adverse effects of chopping an NA samples section out of the code book and their corresponding nearest codebook
the running speech signal .The windowing technique is vector indices is sent to the Baum-Welch algorithm to train
expressed as a model for the input index sequence. After training, three
(17) models for each digit that corresponds to the three samples
in our vocabulary set. Then one obtained average of A,B and
iv) Auto Correlation Analysis: Each windowed set of speech π matrices over the samples are calculate to generalize the
sample is autocorrelated to give a set of (p+1) coefficients, models[11].
where p is order of the desired LPC analysis . The
autocorrelation process is given by The input speech sample is preprocessed to extract the
feature vector. Then, the nearest codebook vector index for
= l(n) l , (18) each frame is sent to the digit models. The system chooses
the model that has the maximum probability of a match.
v) LPC/Cepstral Analysis: A Vector of LPC coefficients is
computed from the autocorrelation vector using a Levinson
or a Durbin recursion method. An LPC derived cepstral
5. Results and Discussion
vector is then computed up to the Qth component. The
cepstral analysis is given by Several experiments are conducted commonly to evaluate
VAD algorithm .The analysis mainly focused on error
(19) Probabilities. The proposed VAD was evaluated in terms of
ability to discriminate speech signal from non –speech at
different SNR values .The results are shown in table 1-10
vi) Cepstral Weighting: The Q-coefficient cepstral vector
ct(m) at time frame l is weighted by a window Wc(m)[5][6] Table 1: Performance of VAD for digit ‘0’for various noise
which is discussed as sources

NOISES 0d 5d 10d 15d


20) B B B B
(21) AIRPORT 83 11 18 75
EXHIBITION 0 10 89 9
To find TRAIN 0 0 0 0
(22) RESTAURANT 8 50 2 9
STREET 94 86 75 95
vii) Delta Cepstrum: The time derivative of the sequence of
B ABBLE 0 30 6 28
weighted cepstral vectors is approximated by a first-order
STATION 10 27 3 31
orthogonal polynomial over a finite length window of
CAR 4 6 1 12
(IJCNS) International Journal of Computer and Network Security, 87
Vol. 1, No. 2, November 2009

Table 2: Performance of VAD for digit ‘1’for various noise Table 6: Performance of VAD for digit ‘5’for various noise
sources sources

NOISES 0d 5d 10d 15d NOISES 0dB 5dB 10dB 15dB


B B B B AIRPORT 20 32 43 31
AIRPORT 18 22 39 37 EXHIBITION 34 23 16 50
EXHIBITION 57 60 47 64 TRAIN 28 27 21 30
TRAIN 11 52 34 57 RESTAURANT 13 13 31 30
RESTAURANT 23 35 51 51 16 23 36 39
STREET
STREET 26 41 49 49
B ABBLE 14 26 37 23
B ABBLE 21 46 49 35
STATION 24 14 14 19
STATION 54 44 51 50
25 37 47 32 CAR 30 27 31 31
CAR

Table 3: Performance of VAD for digit ‘2’for various noise Table 7: Performance of VAD for digit ‘6’for various
sources noise sources
NOISES 0d 5d 10d 15d
NOISES 0d 5d 10d 15d
B B B B
B B B B
AIRPORT 62 18 61 38
AIRPORT 16 68 66 86
EXHIBITION 55 37 48 54
EXHIBITION 51 59 83 88
TRAIN 44 37 53 56
40 16 44 36 TRAIN 14 9 24 19
RESTAURANT
STREET 60 44 58 33 RESTAURANT 15 21 14 7
B ABBLE 31 49 55 19 STREET 1 1 1 3
STATION 37 36 23 40 B ABBLE 32 74 25 96
CAR 57 58 76 53 STATION 1 1 1 1
CAR 21 38 27 4

Table 4: Performance of VAD for digit ‘3’for various noise


Table 8: Performance of VAD for digit ‘7’for various noise
sources
sources
NOISES 0dB 5dB 10dB 15dB
AIRPORT 37 45 48 37 NOISES 0d 5d 10d 15d
EXHIBITION 54 19 41 30 B B B B
14 26 51 37 AIRPORT 27 40 49 44
TRAIN
EXHIBITION 31 38 84 14
RESTAURANT 27 28 18 43
TRAIN 35 46 47 50
STREET 48 27 36 0
RESTAURANT 38 43 36 39
B ABBLE 12 24 16 23
STREET 40 43 49 44
STATION 35 38 30 24 28 42 53 36
B ABBLE
CAR 37 23 40 41 STATION 33 40 38 36
CAR 53 57 40 41
Table 5: Performance of VAD for digit ‘4’for various noise
Table 9: Performance of VAD for digit ‘8’for various noise
sources
sources
NOISES 0dB 5dB 10dB 15dB
AIRPORT 64 79 77 59 NOISES 0dB 5dB 10dB 15dB
EXHIBITION 52 64 48 66 AIRPORT 48 53 57 44
TRAIN 46 75 83 74 EXHIBITION 60 22 56 19
RESTAURANT 37 43 56 64 TRAIN 18 54 58 61
STREET 84 69 56 82 RESTAURANT 45 59 58 59
B ABBLE 84 73 57 78 STREET 49 44 53 45
STATION 77 77 80 63 B ABBLE 28 58 52 44
CAR 73 49 71 67 STATION 59 48 30 51
CAR 40 54 59 51
88 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

Table 10: Performance of VAD for digit ‘9’for various [3] Javier Ramírez, José C. Segura, Senior Member,
noise sources IEEE, Carmen Benítez, Ángel de la Torre and
Antonio Rubio,” An Effective Subband OSF- Based
NOISES 0dB 5dB 10dB 15dB VAD With Noise Reduction for Robust Speech
AIRPORT 17 16 15 18 Recognition”. IEEE Transactions on Speech And
EXHIBITION 7 12 18 10 Audio Processing, Vol. 13, pp.1119-1129, November
TRAIN 4 23 12 21 2005.
RESTAURANT 14 16 18 16 [4] Lawrence R. Rabiner,”A tutorial on Hidden Markov
Model and selected applications in speech
STREET 13 6 6 15
recognition”,proceedings of the IEEE, vol.77,
B ABBLE 4 11 13 15
no.2,February 1989
STATION 13 1 10 11 [5] J. Makhoul,”Linear Prediction a Tutorial view,”
CAR 12 9 15 19 Proceedings of the IEEE, April 1975.
[6] J.D.Markel and A.H.Gray Jr., “Linear Prediction of
Speech”. Newyork, NY:springer-Verilag,1976.
6. Conclusion [7] Y.Tokhura,”Aweighted cepstraldistance measure for
speechrecognition,”IEEE Trans.Acoust.speech signal
The experimental results are shown in table 1-10 inferred processing,vol.ASSP-35,no.10.pp.1414-1422, October
that the VAD algorithm produces a better result for certain 1987.
noises. The recognition system using VAD gives robustness [8] B.H.Juang,L.R.Rabiner and J.G.Wilpon,”On the Use
than any another algorithm. For digits ‘0’ and ‘9’ VAD of Bandpass filtering in speech recognition”,
provides better result for airport and street noises. For digits IEEETrans.Acoust.Speech signal processing,
‘1’ and ‘8’ it gives better performance over exhibition and vol.ASSP-35, no.7, pp947-954, July 1987.
station. For digit ‘2’ the better recognition occurs for [9] J.Makhoul,S.Roucos and H.Gish,”Vector
airport, street and car noises. For digit ‘3’ the recognition is Quantization In Speech Coding” .Proc.IEEE.vol.73,
good for street and car. For digit ‘4’ the VAD performs no.11, pp.1551- 1558 , November 1985.
good recognition for street and babble noises. For digit ‘5’ it [10] L.R.Rabiner,S.E.Levinson and M.M.Sondhi,”On
works well for exhibition and car noises. For digit ‘6’ the The Application Of Vector Quantization And
recognition works well in airport and exhibition Hidden Morkov Models To Speaker-
environment. For digit ‘7’ the performance of VAD is better Independent Isolated Word Recognition”,Bell
for exhibition and babble noises. Thus VAD works well for Syst.Tech.J., vol.62, no.4, pp.1075-1105,April 1983.
utterances in different digits and extracts speech signal at [11] M.T.Balamuragan and M.Balaji, ”SOPC- Based
different noisy environment conditions. Further research is Speech toText Conversion Embedded processor
in the direction of Genetic Algorithm for Robust Speech design contest-outs standing design”, pp83-108,
Recognition in a noisy environment. 2006.
[12] Alan Davis, Sven Nordholm, Roberto Togneri.
”Statistical Voice Activity Detection Using Low-
Acknowledgement Variance Spectrum Estimation and an Adaptive
Threshold” IEEE Transaction On Audio,Speech and
Firstly, the authors would like his thanks to the Supervisor, Language Processing. VOL.14,NO.2,March2006.
Dr. P.T.Vanathi, Professor, Department of Electronics and [13] Kaisheng Yao, Kuldip K. Paliwal,and Te-Won Lee,
Communication Engineering, PSG College of Technology, “Generative factor analyzed HMM for automatic
Coimbatore, India. The author would like to express his speech recognition” Speech Communication vol.45
thank to the Management and Principal of Bannari Amman pp. 435–454 , January 2005.
Institute of Technology, Sathyamangalam, India. The author [14] Kentaro Ishizuka, and Tomohiro Nakatani, “A feature
greatly expresses his thanks to all persons whom will extraction method using subband based periodicity
concern to support in preparing this paper. and aperiodicity decomposition with noise robust
frontend processing for automatic speech recognition”
Speech Communication vol.48, pp.1447–1457, July
References 2006.

[1] Ramirez, J.C.Segura, C.Benitez, A.de la Torre,


A.Rubio, Voice activity detection with noise
reduction and long-term spectra divergence
estimation” IEEE International Conference on
Acoustics, Speech, and Signal Processing, Volume
2, Issue , pp 1093-6, 17-21, May 2004.
[2] Sundarrajan Rangachari, Philipos C. Loizou,” A noise-
estimation algorithm for highly non-stationary
environments” . Speech Communication 48 (2006)
pp.220–231, August 2005.

You might also like