Speech Processing Research Paper 9

2010 International Conference on Cyberworlds
IMPROVEMENT OF SPEECH SOURCE LOCALIZATION IN NOISY

ENVIRONMENT USING OVERCOMPLETE
RATIONAL-DILATION WAVELET TRANSFORMS
bearing error (degree)
Di Liu, Andy W. H. Khong

School of Electrical and Electronic Engineering
Nanyang Technological University
Singapore
Email: {LIUDI, andykhong}@ntu.edu.sg
AbstractThe generalized cross-correlation using the phase

transform prefilter remains popular for the estimation of timedifferences-of-arrival. However it is not robust to noise and as
a consequence, the performance of direction-of-arrival algorithms is often degraded under low signal-to-noise condition.
We propose to address this problem through the use of a
wavelet-based speech enhancement technique since the wavelet
transform can achieve good denoising performance. The overcomplete rational-dilation wavelet transform is then exploited
to effectively process speech signals due to its higher frequency
resolution. In addition, we exploit the joint distribution of
the speech in the wavelet domain and develop a novel local
noise variance estimator based on the bivariate shrinkage
function. As will be shown, our proposed algorithm achieves
good direction-of-arrival performance in the presence of noise.
100
50
0
50
5 10 15 20 25
SNR (dB)
100
Figure 1. Variation of the mean and standard deviation of the bearing

error against SNR for DOA estimation using the PHAT-GCC algorithm.
of the bearing errors increase from 20 to 40 and 40 to

60 , respectively, when the SNR reduces from 10 to 0 dB. As
can be seen, degradation in performance for DOA estimation
becomes more pronounced with lower SNR. A common
approach to this problem is to preprocess the noisy signals
by denoising. Although speech denoising has been an active
area of research, these efforts have mainly been focused
on improving the subjective quality or intelligibility of the
speech. In this work, however, we focus on denoising with
the aim of improving the performance of DOA estimation.
It has been shown that wavelet-based methods have become an important tool to address the difficult problem of
denoising [3], [4]. This is achieved by taking advantage of
the sparseness of signals in the wavelet domain. In this work,
we propose to incorporate such wavelet denoising techniques
to improve the DOA performance in the presence of noise.
The wavelet-based denoising algorithm will consist of three
steps: 1) computing the wavelet transform (WT) of the noisy
signal, 2) modifying the noisy wavelet coefficients and 3)
computing the inverse WT using the modified wavelets. It
is therefore important, in this work, to determine the type
of wavelet transform and the threshold selection method in
order to achieve good DOA estimation.
We note that the speech and noise signals can better
be separated if an appropriate transform is selected. The
overcomplete rational dilation WT [5] is a recent enhancement where the frequency resolution can be varied. Due to
the fact that the speech spectrum varies significantly across
frequency bands, the rational dilation WT with high frequency resolution can be effective for processing the speech
Keywords-denoising, wavelet, speech source localization,

DOA estimation
I. I NTRODUCTION
Research into speech source localization has received
much attention for cyberworld applications including automatic camera steering, online video surveillance and
speaker tracking. One of the widely adopted approaches
for speech source localization is the generalized crosscorrelation (GCC) based time-differences-of-arrival (TDOA)
estimation algorithm [1]. This algorithm computes the interchannel delays by locating the maximum weighted crosscorrelation between each pair of the received signals. While
many different prefilters can be applied, the heuristic-based
phase transform (PHAT) prefilter has been found to perform
very well under practical conditions [2].
As reported in [2], the PHAT prefilter is optimal in the
maximum likelihood (ML) sense in the presence of reverberation. However, this prefilter is not robust to low signal-tonoise ratio (SNR) conditions and as a result, the performance
of direction-of-arrival (DOA) estimation algorithms degrade
with reducing SNR. Figure 1 shows an illustrative example
of this degradation where the mean and standard deviation
This work is supported by the Singapore National Research Foundation Interactive Digital Media R&D Program, under research grant
NRF2008IDM-IDM004-010.
978-0-7695-4215-7/10 $26.00 2010 IEEE

DOI 10.1109/CW.2010.69
150
77
10
5
0
0
Figure 2. Analysis and synthesis filter banks for the implementation of

the rational-dilation wavelet transform [after [5]].
0.1
0.2
0.3
0.4
FREQUENCY (CYCLES/SAMPLE)
0.5
(a)
SUBBAND
in wavelet domain. In contrast, the poor frequency resolution

of the dyadic WT limits its effectiveness for analyzing
signals that are quasiperiodic in nature including speech,
electroencephalogram and signals arising from mechanical
vibrations [6].
In addition, among a variety of nonlinear thresholding
rules for wavelet-based denoising, the bivariate shrinkage
thresholding [7] can improve SNR performance significantly.
This is achieved by taking into account the statistical dependencies between wavelet coefficients and their parents
using Bayesian estimation theory. As an a priori knowledge,
we will discuss the joint distribution of wavelet coefficients
for a typical speech signal. In addition, we show that
direct application of existing approaches will not address
the noise robustness issue. This thresholding requires a
noise variance estimatior which will be computed locally
for each frequency subband, making it suitable according to
the speech spectrum distribution characteristics.
1
2
3
4
5
6
0
50
100
150
TIME (SAMPLES)
200
250
(b)
Figure 3.
[after [5]].
(a) Frequency response and (b) wavelets at several scales
independently which in turn determines the amount of noise

reduction in each subband.
III. WAVELET- BASED S PEECH DENOISING FOR
D IRECTION - OF -A RRIVAL ESTIMATION
To describe the wavelet-based denoising problem for
speech, we define k (j) to be the kth wavelet coefficient in
the high-pass (H) subband wavelets of scale j, where j =
1, . . . , J denotes the wavelet scale index and k = 1, . . . K
denotes the wavelet coefficient index. Here, J denotes the
total number of wavelet scales and K denotes the total
number of wavelet coefficients in each scale after resizing.
We next define yk (j) as the noisy observation of k (j) and
nk (j) as the additive noise, giving yk (j) = k (j) + nk (j).
We also note that k (j + 1) is the wavelet coefficient at the
next coarser scale to k (j) and therefore we say k (j + 1)
is the parent of k (j).
In statistical processes, we can define Wk (j), Yk (j) and
Nk (j) as the random variables of wk (j), yk (j) and nk (j),
respectively. Using this notation, we can write
II. R EVIEW OF OVERCOMPLETE RATIONAL - DILATION

WAVELET T RANSFORMS
The overcomplete rational-dilation WTs [5] can achieve a
class of WTs with constant quality (Q)-factor where the Qfactor of a band-pass filter is the ratio of its center frequency
to its bandwidth. We note that WTs with high Q-factors
are desirable for processing quasi-periodic signals such as
speech due to their higher frequency resolution compared to
the dyadic WT with low Q-factor.
The iterated filter banks shown in Fig. 2 can be used
to implement rational-dilation WTs [5], where p is an
upsampling factor, q and s are the downsampling factors
while q/p is a rational dilation factor. These parameters
can affect the Q-factor, redundancy of WTs and the timebandwidth product; for a given q/p, there is often a trade-off
between the Q-factor and the time-bandwidth product.
One generally requires higher frequency resolution when
analyzing/filtering quasi-periodic signals like speech. In this
work, we set p = 9, q = 10, s = 5 giving a dilation
factor of 1.11 and a redundancy of 2.0. Figure 3 illustrates
its corresponding frequency response of the iterated filter
bank and the wavelet. As can be seen from these figures, a
good time-frequency localization with more band-pass filters
covering the same frequency range is achieved. In addition,
these parameters give rise to a high Q-factor and is able to
avoid ringing with a modest factor of redundancy of less
than 3. This WT, set with higher frequency resolution, can
better separate the speech and noise signals. In addition, the
noise reduction filter on each subband can be manipulated
y = w + n,
(1)
where w = [Wk (j), Wk (j + 1)]T , y = [Yk (j), Yk (j + 1)]T

and n = [Nk (j), Nk (j +1)]T . Taking into account the statistical dependency between adjacent wavelets and employing
the maximum a posteriori (MAP) estimator, we can esimate
w of the clean speech given the noisy observation y using
b
w(y)
= arg max[pn (y w) pw (w)],
w
(2)
where pn (y w) and pw (w) are the joint probability

distribution functions (pdfs) of n and w, respectively. Hence,
b
to estimate clean wavelets w(y)
using (2), both pw (w) and
pn (n) must be computed. Here, the noise is assumed to be
i.i.d white Gaussian and we can express the noise pdf as

1
Nk2 (j) + Nk2 (j + 1)
pn (n) =
exp
,
(3)
2n2
2n2
where n2 is the variance of the additive noise.
78
0.03
Proposed pdf
Joint Histogram
kth coefficient in each wavelet scale j will be estimated in

the ML sense using coefficients in the neighboring region
of B(k)
,
X
1
by2 =
yk2 (j),
(8)
M
150
0.04
0.02
0.01
0
1
0.5
0
Parent
0
1 0.5
100
50
0
50
yk (j)B(k)
50
0
Child
Parent
(a)
50 50
where M is the size of the neighborhood B(k) and B(k) is

defined as all coefficients within a window that is centered
at the kth coefficient.
Although a typical speech signal occupies a wide frequency spectrum, it has significant energy within the range
of 500 4000 Hz. The wavelets in the finest scale correspond to the highest frequency subband denoted as H1 and
do not contain significant speech content. This assumption
is valid since we utilize the high frequency resolution of the
given rational-dilation WT. In addition, we assume that the
noise is white with equivalent energy throughout the whole
frequency band and as a result, y(H1 ) n(H1 ). We can
therefore estimate the overall noise variance from the finest
scale wavelet coefficients and a robust median estimator for
noise variance is [9]
median(|yk (1)|)
, yk (1) subband H1 . (9)
bn2 =
0.6745
We note that direct application of (9) is not applicable
for our DOA application. Simulation using (9) exhbits a
degradation in DOA performance and that the bearing errors
are sensitive to the noise variance. This is because the
energy of the speech spectrum varies significantly across
different scales. A poor noise estimation can therefore result
in an inappropriate threshold T . Accordingly, this can lead
to additional unwanted high-frequency noise components.
In view of the above, we should consider the degree of
shrinkage for the wavelets of the speech signals and propose
that the new estimator
bn2 be given as
0
Child
(b)
Figure 4. (a) Empirical joint parent-child histogram of wavelet coefficients

from speech signal database. (b) Bivariate pdf (4) for joint pdf of parentchild wavelet coefficient paris.
A. Bivariate shrinkage thresholding for speech signal

It is therefore important to determine an analytical expression for the joint pdf that models the wavelet distribution of
a typical speech. This joint empirical child-parent histogram
can then be used to etimate pw (w). As presented in [7], a
possible pdf model is given by

q
3
3
2
2
Wk (j) + Wk (j + 1) , (4)
exp
pw (w) =
22
where 2 is defined as the variance of the clean speech

wavelet. To evaluate if this pdf model is suitable for speech
signals, we performed the overcomplete rational-dilation WT
as described in Section II using q/p = 10/9, s = 5 for
a set of 30 speech signals extracted from the NOIZEUS
database [8]. The joint histogram between Wk (j) and
Wk (j + 1) is then plotted in Fig. 4(a) while this joint pdf
model defined in (4) is plotted in Fig. 4(b). Comparing both
plots, we note the close similarity between the analytical
expression given by (4) and that of the speech signals. We
therefore propose to employ (4) for the estimation of pw (w).
Substituting (3) and (4) into (2), the MAP estimator in (2)
can be rewritten as [7]
2
p
3
( Yk2 (j) + Yk2 (j + 1) n )+
c
p
Wk (j) = Yk (j)
, (5)
Yk2 (j) + Yk2 (j + 1)
median(|yk (1)|)
, yk (1) subband H1 .
(10)
c
The performance of the DOA estimation algorithm is therefore dependent on the choice of c.
bn2 =
where the function (g)+ at the numerator is defined as

0
if g < 0
(g)+ =
.
(6)
g
otherwise
C. Factor c selection
We determine a suitable value of c that gives rise to good
DOA performance. This can be achieved empirically by
studying how c varies across different speech signals under
different SNR conditions. We first perform denoising using (10), (8) and (5) for 30 speech signals extracted from the
NOIZEUS database [8]. The DOA of the denoised speech
is subsequently estimated using GCC-PHAT. Figures 5(a)
and (b) show the variation of bearing error with c for the
case of SNR = 0 and 5 dB, respectively. As can be seen,
the bearing error first reduces with c after which it then
increases modestly. Accordingly, a good choice of c = 1
can be chosen, i.e.,
median(|yk (1)|)
, yk (1) subband H1 .
(11)
bn2 =
1
This is the bivariate shrinkage function in each wavelet scale

used for speech denoising.
B. Variance estimation for thresholding
Consideringthe wavelet shrinkage function in (5), we
define T = 3n2 / as the denoising threshold. It is
therefore essential to estimate the noise variance n2 and the
wavelet variance 2 for each wavelet scale.
In our algorithm, the variance 2 can be estimated as
q
y2
bn2 )+ ,
(7)
b = (b
where y2 is the variance of the noisy wavelets. If one
assumes that Yk (j) has Gaussian distribution, y2 for the
79
70
60
50
40
30
0
be seen, a good choice for c(1) that gives rise to good DOA
performance for the GCC-PHAT is given by c(1) = 0.3
across the SNRs considered. In addition, we note that, for
c(1) = 0.7, a low mean bearing error can be achieved while
its standard deviation is modestly high compared to the case
when c(1) = 0.3. We therefore conclude that c(j) = 0.3
and c(j) = 1, j = 2, . . . , J are good choices for DOA
estimation.
(b)
(a)
bearing error(degree)
bearing error(degree)
70
60
50
40
30
20
10
0
Although a good choice of c(1) is given by 0.3, we

further provide a means of estimating the SNR so that
c(1) can be determined based on that shown in Fig. 6. We
first define w (j), y (j), n (j) as the energy of the clean
and received signal wavelets as well as noise
scale j,
Pof
J
respectively. We next define rw (j) = w (j)
w (j),
PJ
Pj=1
J
ry (j) = y (j)
j=1 y (j), rn (j) = n (j)
j=1 n (j)
as the energy ratio for wavelets corresponding to clean,
received and noise signals. Since energy in the wavelet
domain is equivalent to the time-domain energy, the SNR
can be computed by
PJ

j=1 w (j)
SNR = 10 log10 PJ
.
(12)
i=1 n (j)
Figure 5. Variation of the mean bearing errors with c for (a) SNR = 0
dB and (b) SNR = 5 dB.
70
c(1) = 0.7
40
c(1) = 0.5
30
20
c(1) = 0.3
10
0 1 2 3 4 5 6 7 8 9 10
SNR(dB)
(a)
Bearing Error(degree)
50
c(1) = 0.7
60
c(1) = 0.5
50
40
c(1) = 0.3
30
0 1 2 3 4 5 6 7 8 9 10
SNR(dB)
(b)
Figure 6. Variation of (a) mean and (b) standard deviation of the bearing
error with SNR for different factor c(1).
The ratio ry (j) can be obtained using
Additional simulations show similarity in this variation for

different SNR conditions.
We propose to further improve the performance of DOA
estimation through c(j) which is level dependent. We
achieve this by noting that the ratio between clean and noisy
signals in each scale is different and that each scale may
be processed independently in order to estimate the noise
variance for each scale. We determine a good choice of
c(j) empirically for realistic applications through an iterative
procedure by first initializing c(j) = 1 for j = 2, . . . , J.
The value of c(1) is then set to a value which gives rise
the lowest DOA error using the GCC-PHAT algorithm. The
value of c(j + 1) is then subsequently obtained in a similar
manner after finding c(j) that gives rise to the lowest DOA
error. The same process is then applied to 30 speech signals
from the NOIZEUS database [8] under different SNRs.
Experiments conducted in this manner reveal that the performance of GCC-PHAT after denoising is relatively insensitive
to c(j), j = 2, . . . , J under different SNR conditions and
that c(j) = 1 can be considered as a good choice for
j = 2, . . . , J.
Figures 6(a) and (b), show the variation of mean and
standard deviation of the bearing errors with SNR for different values of c(1). We note that the choice of c(1) affects
the DOA performance under different SNR conditions. This
can occur since, for the finest wavelet scale, corresponding
to the highest frequency subband, it is expected that noise
dominates the signal component under low SNR. Therefore,
compared with other scales, the noise energy in scale 1 is
more significant than the energy of the clean wavelet. Hence,
one should set a higher threshold for the finest scale. As can
ry (j)
y (j)
w (j) + n (j)
= PJ
PJ
j=1 y (j)
j=1 y (j)
PJ
PJ
rw (j)( j=1 y (j) j=1 n (j))
=
PJ
j=1 y (j)
PJ
rn (j) j=1 n (j)
+ PJ
,
(13)
j=1 y (j)
=
from which we obtain

ry (j) = rw (j) + (rn (j) rw (j)),
(14)
where
=
J
X
j
n (j)
X
J
y (j).
(15)
We note that when the number of decomposition levels J is

large, the signal energy in the coarsest scale approximates
to zero. Hence, (14) can be rewritten as ry (J) = rn (J)
and in (15) can then be expressed as

= ry (J) rn (J).
(16)
Since a white Gaussian noise should have constant energy
ratio across the scales, rn (j) can be computed given a WT.
By using (15), (16) and (12), SNR can now be rewritten as
SNR = 10 log10 ((1 )/) dB,
(17)
from which we can now select a value of c(1) based on

Fig. 6.
80
60
50
40
70
Martins approach [10]
without
denoising
Beroutis
approach [9]
30
20 waveletbased
10
denoising
0 1 2 3 4 5 6 7 8 910
SNR(dB)
(a)
70
low SNR environment. In addition, the standard deviation

for our proposed algorithm is reduced by approximately
8 over Beroutis denoising approach. This improvement
is significantly higher than the improvement of the SS
method over the GCC-PHAT processor without denoising.
This shows that our approach based on wavelet denoising
can improve DOA performance over that for the existing SS
speech denoising method.
Martins approach [10]
60
without
denoising
Beroutis
approach [9]
50
40
waveletbased
denoising
30
V. C ONCLUSION
We presented a novel wavelet-based speech denoising
algorithm for achieving high DOA performance for speech
signals. We estimate the local noise variance which can improve DOA performance further. Simulation results showed
our proposed method outperforms the spectral subtraction
technique under low SNR when the original PHAT algorithm
is not robust to low SNR environments.
0 1 2 3 4 5 6 7 8 910
SNR(dB)
(b)
Figure 7. DOA performance comparison by our proposed method and

that of [10], [11] under different SNRs: (a) mean bearing errors and (b)
standard deviation of bearing errors.
Using the above, we can therefore apply a MAP estimator

using (2) and our proposed denoising algorithm for speech
source localization is summarized as follows:
1) select c or c(1) using Fig. 6 or estimate SNR using (17);
2) compute the noise variance
bn2 using (10);
3) for wavelet coefficients in each scale k = 1, . . . , K,
a) calculate
by2 using (8);
b) calculate
b2 using using (7);
ck (j) in (5);
4) estimate each coefficient W
5) estimate the DOA using the GCC-PHAT.
R EFERENCES
[1] C. Knapp and G. Carter, The generalized correlation method
for estimation of time delay, IEEE Trans. Acoust., Speech
and Signal Process., vol. 24, no. 4, pp. 320327, Aug. 1976.
[2] C. Zhang, D. Florencio, and Z. Y. Zhang, Why does PHAT
work well in low noise, reverberative environments? IEEE
Intl Conf. Acoust., Speech and Signal Process., pp. 2565
2568, Mar.-Apr. 2008.
[3] M. Miller and N. Kingsbury, Image denoising using derotated complex wavelet coefficients, IEEE Trans. Image Process., vol. 17, no. 9, pp. 15001511, Nov. 2008.
IV. E XPERIMENT R ESULTS
[4] V. Bruni and D. Vitulano, Wavelet-based signal denoising

via simple singularities approximation, Signal Processing,
vol. 86, no. 4, pp. 859876, Apr. 2006.
We evaluate the performance of our proposed algorithm

and compare its performance with that of two well-known
denoising techniques [10], [11] in the context of DOA
estimation. A virtual room of size 10 m 10 m 10 m
is created using the method of images. A linear array of
four microphones with spacing 0.05 m and centroid position
(5, 5, 1.6) m is used. We evaluate the performance of the
algorithms by varying the source bearing with a constant
source-sensor distance of 3.6 m. We introduce white noise
with different SNRs at each microphone. Speech signals
used are obtained from the NOIZEUS database [8].
Bearing errors of our proposed wavelet-based algorithm
and the spectral-substraction (SS) technique by Beroutis
approach [10] and Martins approach [11] are computed for
30 different speech signals each using 100 independent trials
under different SNR conditions. For our method, we have
used factors c(j) = 1, j = 2, . . . , J and c(1) is chosen
using Fig. 6 based on different SNR conditions estimated
using (17). The mean and standard deviation of the bearing
errors are illustrated in Figs. 7(a) and (b), respectively.
As can be seen, the denoising approach of [11] does not
give rise to good DOA estimation, although it is well
known for offering better speech intelligibility. Using our
proposed algorithm, the mean bearing errors are reduced by
approximately 4 over Beroutis denoising approach under
[5] I. Bayram and I. W. Selesnick, Frequency-domain design

of overcomplete rational-dilation wavelet transforms, IEEE
Trans. Signal Process., vol. 57, no. 8, pp. 29572972, Aug.
2009.
[6] C. S. Burrus, R. Gopinath, and H. Guo, Introduction to
wavelets and wavelet transform: a primer, Prentice Hall,
1997.
[7] L. Sendur and I. W. Selesnick, Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency, IEEE Trans. Signal Process., vol. 50, no. 11, pp.
27442756, Nov. 2002.
[8] http://www.utdallas.edu/loizou/speech/noizeus/.
[9] D. Donoho and I. Johnstone, Ideal spatial adaptation by
wavelet shrinkage, Biometrika, vol. 81, no. 3, pp. 425455,
1994.
[10] M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of
speech corrupted by acoustic noise, in Proc. IEEE Intl Conf.
Acoust., Speech and Signal Process., pp. 208211, 1979.
[11] R. Martin, Noise power spectral density estimation based
on optimal smoothing and minimum statistics, IEEE Trans.
Speech and Audio Process., vol. 9, no. 5, pp. 504512, Jul.
2001.
81

Speech Processing Research Paper 9

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Processing Research Paper 9

Uploaded by

Copyright:

Available Formats

2010 International Conference on Cyberworlds

IMPROVEMENT OF SPEECH SOURCE LOCALIZATION IN NOISY

bearing error (degree)

Di Liu, Andy W. H. Khong

AbstractThe generalized cross-correlation using the phase

Figure 1. Variation of the mean and standard deviation of the bearing

of the bearing errors increase from 20 to 40 and 40 to

Keywords-denoising, wavelet, speech source localization,

978-0-7695-4215-7/10 $26.00 2010 IEEE

Figure 2. Analysis and synthesis filter banks for the implementation of

in wavelet domain. In contrast, the poor frequency resolution

(a) Frequency response and (b) wavelets at several scales

independently which in turn determines the amount of noise

II. R EVIEW OF OVERCOMPLETE RATIONAL - DILATION

where w = [Wk (j), Wk (j + 1)]T , y = [Yk (j), Yk (j + 1)]T

where pn (y w) and pw (w) are the joint probability

kth coefficient in each wavelet scale j will be estimated in

where M is the size of the neighborhood B(k) and B(k) is

Figure 4. (a) Empirical joint parent-child histogram of wavelet coefficients

A. Bivariate shrinkage thresholding for speech signal

where 2 is defined as the variance of the clean speech

where the function (g)+ at the numerator is defined as

This is the bivariate shrinkage function in each wavelet scale

Although a good choice of c(1) is given by 0.3, we

The ratio ry (j) can be obtained using

Additional simulations show similarity in this variation for

from which we obtain

We note that when the number of decomposition levels J is

from which we can now select a value of c(1) based on

low SNR environment. In addition, the standard deviation

Martins approach [10]

Figure 7. DOA performance comparison by our proposed method and

Using the above, we can therefore apply a MAP estimator

IV. E XPERIMENT R ESULTS

[4] V. Bruni and D. Vitulano, Wavelet-based signal denoising

We evaluate the performance of our proposed algorithm

[5] I. Bayram and I. W. Selesnick, Frequency-domain design

You might also like