Professional Documents
Culture Documents
ADAPTIVE FILTERS
LEAST-MEAN-SQUARE
ADAPTIVE FILTERS
Edited by
This book is dedicated to Bernard Widrow for inventing the LMS lter
and investigating its theory and applications
Simon Haykin
CONTENTS
Contributors
ix
xi
1.
2.
3.
35
79
4.
105
5.
145
6.
175
7.
241
8.
293
9.
335
vii
viii
10.
CONTENTS
445
Index
491
CONTRIBUTORS
CONTRIBUTORS
The earliest work on adaptive lters may be traced back to the late 1950s, during
which time a number of researchers were working independently on theories and
applications of such lters. From this early work, the least-mean-square LMS
algorithm emerged as a simple, yet effective, algorithm for the design of adaptive
transversal (tapped-delay-line) lters.
The LMS algorithm was devised by Widrow and Hoff in 1959 in their study of a
pattern-recognition machine known as the adaptive linear element, commonly
referred to as the Adaline [1, 2]. The LMS algorithm is a stochastic gradient
algorithm in that it iterates each tap weight of the transversal lter in the direction of
the instantaneous gradient of the squared error signal with respect to the tap weight
in question.
^ n denote the tap-weight vector of the LMS lter, computed at iteration
Let w
(time step) n. The adaptive operation of the lter is completely described by the
recursive equation (assuming complex data)
^ n 1 w
^ n m undn w
^ H nun*;
w
where un is the tap-input vector, dn is the desired response, and m is the step-size
parameter. The quantity enclosed in square brackets is the error signal. The asterisk
denotes complex conjugation, and the superscript H denotes Hermitian transposition
(i.e., ordinary transposition combined with complex conjugation).
Equation (1) is testimony to the simplicity of the LMS lter. This simplicity,
coupled with desirable properties of the LMS lter (discussed in the chapters of this
book) and practical applications [3, 4], has made the LMS lter and its variants an
important part of the adaptive signal processing kit of tools, not just for the past 40
years but for many years to come. Simply put, the LMS lter has withstood the test
of time.
Although the LMS lter is very simple in computational terms, its mathematical
analysis is profoundly complicated because of its stochastic and nonlinear nature.
Indeed, despite the extensive effort that has been expended in the literature to
xi
xii
analyze the LMS lter, we still do not have a direct mathematical theory for its
stability and steady-state performance, and probably we never will. Nevertheless,
we do have a good understanding of its behavior in a stationary as well as a
nonstationary environment, as demonstrated in the chapters of this book.
The stochastic nature of the LMS lter manifests itself in the fact that in a
stationary environment, and under the assumption of a small step-size parameter, the
lter executes a form of Brownian motion. Specically, the small step-size theory of
the LMS lter is almost exactly described by the discrete-time version of the
Langevin equation1 [3]:
Dnk n nk n 1 nk n
m l k nk n f k n;
k 1; 2; . . . ; M;
which is naturally split into two parts: a damping force m l k nk n and a stochastic
force f k n. The terms used herein are dened as follows:
M order (i.e., number of taps) of the transversal lter around which the
LMS lter is built
l k kth eigenvalue of the correlation matrix of the input vector un, which
is denoted by R
f k n kth component of the vector m QH une*o n
Q unitary matrix whose M columns constitute an orthogonal set of
eigerivectors associated with the eigenvalues of the correlation matrix R
eo n optimum error signal produced by the corresponding Wiener lter
driven by the input vector un and the desired response dn
To illustrate the validity of Eq. (2) as the description of small step-size theory of
the LMS lter, we present the results of a computer experiment on a classic example
of adaptive equalization. The example involves an unknown linear channel whose
impulse response is described by the raised cosine [3]
8
< 1 1 cos 2p n 2 ; n 1; 2; 3;
W
3
hn 2
:
0;
otherwise
where the parameter W controls the amount of amplitude distortion produced by the
channel, with the distortion increasing with W. Equivalently, the parameter W
controls the eigenvalue spread (i.e., the ratio of the largest eigenvaiue to the smallest
eigenvalue) of the correlation matrix of the tap inputs of the equalizer, with the
eigenvalue spread increasing with W. The equalizer has M 11 taps. Figure 1
presents the learning curves of the equalizer trained using the LMS algorithm with
the step-size parameter m 0:0075 and varying W. Each learning curve was
obtained by averaging the squared value of the error signal en versus the number of
iterations n over an ensemble of 100 independent trials of the experiment. The
1
The Langevin equation is the engineers version of stochastic differential (difference) equations.
xiii
Figure 1 Learning curves of the LMS algorithm applied to the adaptive equalization of a
communication channel whose impulse response is described by Eq. (3) for varying
eigenvalue spreads: Theory is represented by continuous well-dened curves. Experimental
results are represented by uctuating curves.
continuous curves shown in Figure 1 are theoretical, obtained by applying Eq. (2).
The curves with relatively small uctuations are the results of experimental work.
Figure 1 demonstrates close agreement between theory and experiment.
It should, however, be reemphasized that application of Eq. (2) is limited to small
values of the step-size parameter m . Chapters in this book deal with cases when m is
large.
REFERENCES
1. B. Widrow and M. E. Hoff, Jr. (1960). Adaptive Switching Circuits, IRE WESCON
Conv. Rec., Part 4, pp. 96 104.
2. B. Widrow (1966). Adaptive Filters I: Fundamentals, Rep. SEL-66-126 (TR-6764-6),
Stanford Electronic Laboratories, Stanford, CA.
3. S. Haykin (2002). Adaptive Filter Theory, 4th Edition, Prentice-Hall.
4. B. Widrow and S. D. Stearns (1985). Adaptive Signal Processing, Prentice-Hall.
ON THE EFFICIENCY OF
ADAPTIVE ALGORITHMS
1.1
INTRODUCTION
The basic component of most adaptive ltering and signal processing systems is the
adaptive linear combiner [1 5] shown in Figure 1.1. The formed output signal is a
weighted sum of a set of input signals. The output would be a simple linear
combination of the inputs only if the weights were xed. In actual practice, the
weights are adjusted or adapted purposefully; the resulting weight values are signal
dependent. This process causes the system behavior during adaptation to differ
signicantly from that of a linear system. However, after the adaptive process has
converged and the weights have settled to essentially xed values with only minor
random uctuations about the equilibrium solution, the converged system exhibits
essentially linear behavior.
Adaptive linear combiners have been successfully used in the modeling of
unknown systems [2, 6 8], linear prediction [2, 9 11], adaptive noise cancelling
[4, 12], adaptive antenna systems [3, 13 15], channel equalization systems for highspeed digital communications [16 19], echo cancellation [20 23], systems for
instantaneous frequency estimation [24], receivers of narrowband signals buried in
noise (the adaptive line enhancer) [4, 25 30], adaptive control systems [31], and
in many other applications.
In Figure 1.1a, the interpretation of the input signal vector un
u1 n; . . . ; uK nT and the desired response dn might vary, depending on how
the adaptive linear combiner is used. In Figure 1.1b, an application to adaptive nite
impulse response (FIR) ltering is shown. In turn, an application of adaptive FIR
ltering to plant modeling or system identication is shown in Figure 1.2. Here, we
can view the desired response dn as a linear combination of the last K samples of
the input signal, corrupted by independent zero-mean plant noise vn. Our aim in
this application is to estimate an unknown plant (represented by its transfer function
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
Figure 1.1 Adaptive linear combiner and its application in an adaptive lter: (a) linear
combiner; (b) adaptive FIR lter.
1:1
1:2
1.1 INTRODUCTION
Figure 1.2
K
P
wi un i 1 wT un uT nw:
1:3
i1
The input signal vector and the desired response are assumed to be wide-sense
stationary. Denoting the desired response as dn, the error at the nth time is
en dn yn dn wT un dn uT nw:
1:4
1:5
1:6
where the cross-correlation vector between the input signal and the desired response
is dened as
2
3
dnun
..
6
74
1:7
Ednun E4
5 p;
.
dnun K 1
and the input autocorrelation matrix R is dened as
2
unun
unun K 1
6 un 1un
un 1un K 1
6
EunuT n E6
..
..
4
.
.
un K 1un
3
7
7
7
5
un K 1un K 1
R:
1:8
It can be observed from Eq. (1.6) that with wide-sense stationary inputs, the MSE
performance function is a quadratic function of the weights, a paraboloidal bowl.
This function can be minimized by differentiating j with respect to w and setting the
derivative to zero. The minimal point is
w wo R1 p:
1:9
The optimal weight vector wo is known as the Wiener weight vector or the Wiener
solution.
In practice, we would not know the exact statistics of R and p. One way of nding
an estimate of the optimal weight vector wo would be to estimate R and p for the
given input and desired response. This approach would lead to what is called an
exact least-mean-square solution. This approach is optimal in the sense that the sum
of square errors will be minimal for the given data samples. However, such solutions
are generally somewhat complex from the computational point of view [32 37].
On the other hand, one can use one of the simpler gradient search algorithms such
as the least-mean-square (LMS) steepest descent algorithm of Widrow and Hoff [1].
However, this algorithm is sometimes associated with a certain deterioration in
performance in problems for which there exists great spread among the eigenvalues
of the autocorrelation matrix R (see, for instance, [32, 34, 37]).
In order to establish a bridge between the LMS and the exact least squares
approaches mentioned above, we will introduce an idealized algorithm called LMS/
Newton [5]. For the implementation of this algorithm, we will have to assume
perfect knowledge of the autocorrelation matrix R. Naturally, that means that this
idealized algorithm cannot be used in practice. However, its performance provides a
convenient theoretical benchmark for the sake of comparison.1
1
It should be noted that there are numerous algorithms in the literature that recursively estimate the
autocorrelation matrix R and use this estimation for orthogonalizing the input data (see, for instance,
[3842]). These algorithms converge asymptotically to the idealized algorithm discussed here.
In the next section, we will briey analyze the performance of the exact least
squares solution when the weights are obtained with a nite data sample. Then in
Sections 1.3 and 1.4, we will analyze the idealized LMS/Newton algorithm and, in
Section 1.5, show, at least heuristically, that its performance is equivalent to that of
an exact least squares algorithm. Based on this heuristic argument, we will view the
LMS/Newton process as an optimal gradient search algorithm.
In Section 1.6 we will dene a class of nonstationary problems: problems in
which an unknown plant Pz varies in a certain random way. Once again, the
adaptive lter will perform a modeling task. For this class of frequently encountered
problems, we will analyze and compare the performance of LMS/Newton with that
of the conventional steepest descent LMS algorithm. We will show that both
perform equivalently (in the mean square sense) for this class of nonstationary
problems. In Section 1.7, we will examine the MSE learning curves and the transient
behavior of adaptive algorithms. The excess error energy will be dened to be the
area under the excess MSE curve. The LMS and LMS/Newton algorithms will be
shown to perform, on average, equivalently with respect to this important criterion if
they both start learning from random initial conditions that have the same variance.
In Sections 1.8 and 1.9, we will conclude this chapter by summarizing the various
comparisons made between the LMS algorithm and the ideal LMS/Newton
algorithm.
1.2
Suppose that the adaptive linear combiner in Figure 1.2 is fed N independent zeromean K 1 training vectors u1; u2; . . . ; uN and their respective scalar desired
responses d1; d2; . . . ; dN, all drawn from a wide-sense stationary process.
Keeping the weights xed, a set of N error equations can be written as
en dn uT nw;
n 1; 2; . . . ; N:
1:10
The objective is to nd a weight vector that minimizes the sum of the squares of the
error values based on the nite sample of N items of data.
Equation (1.10) can be written in matrix form as
e d Uw;
1:11
1:12
1:13
1:14
A unique solution of Eq. (1.11), a weight vector w that brings e to zero, exists only if
U is square and nonsingular. However, the case of greatest interest is that of N K.
As such, Eq. (1.11) would typically be overconstrained and one would generally
seek a best least squares solution. The sum of the squares of the errors is
eT e dT d wT U T Uw 2dT Uw:
1:15
1:16
lim j^ j :
1:17
and
N!1
Note that j^ is a quadratic function of the weights. The parameters of the quadratic
form are related to properties of the N data samples. U T U is square and is assumed
to be positive denite. j^ is a small-sample-size MSE function. j is the large-samplesize true MSE function, and it is also a quadratic function of the weights. Figure
1.3 shows a comparative sketch of these functions. Many small-sample-size datadependent curves are possible, but there is only one large-sample-size curve. The
unique large-sample-size curve is the average of the many small-sample-size curves.
1:18
This is the exact least squares solution for the given data sample. The Wiener
solution wo is the expected value of wLS .
Each small-sample-size curve is an ensemble member. Let the ensemble be
constructed in the following manner. Assume that the vectors u1; u2; . . . ; uN
are the same for all ensemble members but that the associated desired responses
d1; d2; . . . ; dN differ from one ensemble member to another because of the
stochastic character of plant noise (refer to Fig. 1.2). Over this ensemble, therefore,
the U matrix is constant, while the desired response vector d is stochastic. In order to
evaluate the excess MSE due to adaptation with the nite amount of data available,
we have to nd
j excess
1
Ee T U T Ue ;
N
1:19
1:20
j excess
1
1
TrEe T U T Ue ETre T U T U e
N
N
1
1
ETre e T U T U TrEee T U T U:
N
N
1:21
1:22
where j min is the minimum MSE, the minimum of the true MSE function (see
Fig. 1.3).
Substitution of Eq. (1.23) into Eq. (1.21) yields
j excess
K
j :
N min
1:24
It is important to note that this formula does not depend on U. The above-described
ensemble can be generalized to an ensemble of ensembles, each having its own U,
without changing Eq. (1.24). Hence, this formula is valid for a very wide class of
inputs.
It is useful to consider a dimensionless ratio between the excess MSE and the
minimum MSE. This ratio is commonly called (see, e.g., [1, 2, 4]) the misadjustment, M. For the exact least squares solution based on learning with a nite
data sample, we nd the misadjustment from Eq. (1.24) as
M
K
number of weights
:
N number of independent training samples
1:25
1:26
1:27
This is a noisy but unbiased estimate of the gradient [5, p. 101]. Using this
instantaneous gradient in place of the true gradient in Eq. (1.26) yields the LMS
algorithm of Widrow and Hoff:
wn 1 wn 2m enun:
1:28
The behavior of this algorithm has been analyzed extensively in the literature (see,
e.g., [2 4, 44 51]). It was proved in [2] and [4] that if the adaptation constant m
were chosen such that
0,m ,
1
;
TrR
1:29
then the adaptive weights would relax from their initial condition to hover randomly
about the Wiener solution wo . The weight error vector
e n wn wo
1:30
will then converge to zero in the mean, and its variance will be stable ([2 4, 47, 52,
53]). The relaxation process will be governed by the relation
Ee n 1 I 2m REe n:
1:31
QQT I;
R QLQT
2
l1
6
..
L6
.
4
0
3
7
7;
5
1:32
1:33
lK
m l i ! 1;
ti
1
;
2m l i
1 i K:
1:34a
As the weights relax toward the Wiener solution, the MSE, a quadratic function of
the weights, undergoes a geometric progression toward j min . The learning curve is
a plot of MSE versus number of adaptation cycles. The natural modes of the learning
curve have time constants half as large as the corresponding time constants of the
10
weights ([2 4]). Accordingly, the MSE learning curve time constants are
t iMSE
1
;
4m l i
1 i K:
1:34b
After convergence has taken place, there remains noise in the weights due to the
noise in the estimation of the gradient in Eq. (1.27). An approximate value of the
covariance of the weight noise, valid for small m , was derived in [4, App. D]:
Ee ne T n m j min I:
1:35
The noise in the weights will cause an excess error in the system output (in addition
to j min, the Wiener error):
j excess Ee T nun2
1:36
1:37
Therefore, we can compute the misadjustment, dened as the ratio between the
excess and the minimum MSE:
4
M
j excess
m TrR:
j min
1:38
The adaptation constant m should be kept low in order to keep the misadjustment
low. However, low m is associated with slow adaptation in accordance with Eq.
(1.34). Equations (1.29) to (1.38) illustrate the potential vulnerability of the steepest
descent algorithm. The speed of convergence will depend on the choice of initial
conditions. In the worst case, the convergence will be dominated by the lowest
eigenvalue
l min minl 1 ; . . . ; l K :
1:39
This implies that even if we choose the maximal value allowable for the adaptation
constant m (due to the stability constraint in Eq. (1.29)), the slowest time constant
for the weights would be
t max MSE
1
:
4m l min
1:40
11
For the class of problems for which there exists a great spread of eigenvalues of the
autocorrelation matrix R, this number will be high, resulting in long convergence
times (at least in the worst case).
1:41
The gradient estimate is premultiplied by R1 and in addition scaled by l ave, the
average of the eigenvalues of R. With this scaling, the LMS/Newton algorithm of
Eq. (1.41) becomes identical to the steepest descent LMS algorithm of Eq. (1.28)
when all of the eigenvalues are equal.
The LMS/Newton algorithm will be shown to be the most efcient of all adaptive
algorithms. For a given number of weights and convergence speed, it has the lowest
possible misadjustment. The LMS/Newton algorithm cannot be implemented
physically because perfect knowledge of the autocorrelation matrix R and its inverse
usually do not exist. On the other hand, the LMS/Newton algorithm is very
important from a theoretical point of view because of its optimality.
The conditions for stability as well as learning time constant and misadjustment
formulas for the LMS/Newton algorithm can be readily obtained. The condition for
convergence in the mean and in the variance for LMS/Newton is
0,m ,
1
:
TrR
1:42
This is identical to Eq. (1.29). The time constant of the MSE learning curve for
LMS/Newton is
t MSE
1
:
4m l ave
1:43
Comparing this to Eq. (1.34b), one can see that LMS has many time constants and
LMS/Newton has only one. When the eigenvalues are equal, both algorithms have
only one time constant and these formulas become identical. The misadjustment of
12
LMS/Newton is
M m TrR:
1:44
TrR
K
:
4t MSE l ave 4t MSE
1:45
K
number of weights
:
1:46
When learning with a nite data sample, the optimal weight vector is the best least
squares solution for that data sample, and it is often called the exact least squares
solution. This solution, given by Eq. (1.18), makes the best use of the nite number
of data samples in the least squares sense. All of the data are weighted equally in
affecting the solution. This solution will vary from one nite data sample to another.
From Eq. (1.25), the misadjustment of the exact least squares solution is given by
M
number of weights
:
number of independent training samples
1:47
For the same consumption of data, it is apparent that LMS/Newton and exact
least squares yield the same misadjustment. Although we are comparing apples
with oranges by comparing a steady ow algorithm with an algorithm that learns
with a nite data sample, we nevertheless nd that LMS/Newton is as efcient as
exact least squares when we relate the quality of the weight-vector solution to the
amount of data used in obtaining it. Since the exact least squares solution makes
optimal use of the data, so does LMS/Newton.
1.6
13
Figure 1.4
14
corresponding weights of the adaptive lter and are designated as wo n, the time
index indicating that the unknown target to be tracked is time-varying.
The components of wo n are generated by passing independent white noises of
variance s 2 through identical one-pole low-pass lters. The components of wo n
therefore vary as independent rst-order Markov processes. The formation of wo n
is illustrated in Figures 1.4 and 1.5.
Figure 1.5
15
According to the scheme of Figure 1.4, minimizing the MSE causes the adaptive
weight vector wn to attempt to best match the unknown wo n on a continual basis.
The R matrix, dependent only on the statistics of un, is constant even as wo n
varies. The desired response of the adaptive lter, dn, is nonstationary, being
the output of a time-varying system. The minimum MSE, j min , is constant. Thus
the MSE function, a quadratic bowl, varies in position, while its eigenvalues,
eigenvectors, and j min remain constant.
In order to study this form of nonstationary adaptation both analytically and by
computer simulation, a model comprising an ensemble of nonstationary adaptive
processes has been dened and constructed, as illustrated in Figure 1.5. Throughout
the ensemble, the unknown lters to be modeled are all identical and have the same
time-varying weight vector wo n. Each ensemble member has its own independent
input signal going to both the unknown system and the corresponding adaptive
system. The effect of output noise in the unknown systems is obtained by the
addition of independent noises of variance j min. All of the adaptive lters are
assumed to start with the same initial weight vector w0; each develops its own
weight vector over time in attempting to pursue the moving Markovian target wo n.
For a given adaptive lter, the weight-vector tracking error at the nth instant is
4
e n wn wo n. This error is due to both the effects of gradient noise and
weight-vector lag and may be expressed as
e n wn wo n
wn Ewn Ewn wo n :
|{z} |{z}
weight vector noise
1:48
The expectations are averages over the ensemble. Equation (1.48) identies the two
components of the error. Any difference between the ensemble mean of the adaptive
weight vectors and the target value wo n is due to lag in the adaptive process, while
the deviation of the individual adaptive weight vectors about the ensemble mean is
due to gradient noise.
Weight-vector error causes an excess MSE. The ensemble average excess MSE at
the nth instant is
average excess
n E wn wo nT Rwn wo n :
MSE
1:49
average excess
MSE
n E wn EwnT Rwn Ewn
E Ewn wo nT REwn wo n
2E wn EwnT REwn wo n :
1:50
16
Expanding the last term of Eq. (1.50) and simplifying since wo n is constant over
the ensemble,
2EwT nREwn wT nRwo n EwnT REwn
EwnT Rwo n 2EwnT REwn EwnT REwn
1:51
average excess
n Ewn EwnT Rwn Ewn
MSE
E Ewn wo nT REwn wo n :
1:52
The average excess MSE is thus a sum of components due to both gradient noise and
lag:
average excess
n E Ewn wo nT REwn wo n
MSE due to lag
E Ew0 n w0o nT LEw0 n w0o n
1:53
average excess MSE
n E wn EwnT Rwn Ewn
due to gradient noise
E w0 n Ew0 nT Lw0 n Ew0 n ;
4
1:54
misadjustment
due to gradient noise
Ks 2
:
m TrR
4m j min
misadjustment
due to lag
1:55
17
It is interesting to note that Msum in Eq. (1.55) depends on the choice of the
parameter m and on the statistical properties of the nonstationary environment but
does not depend on the spread of the eigenvalues of the R matrix. It is no surprise,
therefore, that when the components of misadjustment are evaluated for the LMS/
Newton algorithm operating in the very same environment, the expression for Msum
for the LMS/Newton algorithm turns out to be
Msum
misadjustment
due to gradient noise
misadjustment
due to lag
Ks 2
m TrR
;
4m j min
1:56
which is the same as Eq. (1.55). From this we may conclude that the performance of
the LMS algorithm is equivalent to that of the LMS/Newton algorithm when both
are operating with the same choice of m in the same nonstationary environment,
wherein they are tracking a rst-order Markov target. Since LMS/Newton is
optimal, we may conclude that the conventional physically realizable LMS
algorithm is also optimal when operating in a rst-order Markov nonstationary
environment. And it is likely optimal or close to it when operating in many other
types of nonstationary environments, although this has not yet been proven.
1.7
There are two properties of the learning curve decay that are more important than
how long it takes to die out (in principle forever): the amount of transient excess
MSE and the length of time it has existed. In other words, we need to determine how
much excess error energy there has been. Refer to Figure 1.6 and consider the area
under the learning curve above the j min line. Starting from the same initial
condition, the convergence times of two different learning curves are hereby dened
as being identical if their respective areas are equal.
1.7.1
Assuming that we have knowledge of the true MSE gradient, adaptation will take
place without gradient noise. The weight vector w is then only a function of the
second-order statistics of the input u and the desired signal d and does not depend on
the actual values that a particular realization of these random processes may take.
That is, the weight vector is not a random variable and can be pulled out of the
expectations in the MSE expression. We thus obtain that at any iteration n, the MSE
can be expressed as
1:57
18
Figure 1.6 Idealized learning curve (no gradient noise). The shaded area represents the
excess error energy.
4
where Ed2 Ed2 n for all n, since the desired output d is wide-sense stationary.
When wn wo R1 p, we can obtain j min as
1:58
1:59
j min e T nRe n
j min b n;
4
19
equation is
1:60
b n 1 2m l ave 2n e T 0Re 0:
1:61
Excess error energy is the area under the transient excess MSE curve. Following this
denition,
4
1
P
b n
n0
1:62
1
e T 0Re 0:
1 1 2m l ave 2
a
1
1
e T 0Re 0
e 0 T 0Le 0 0;
4m l ave
4m l ave
1:63
1:64
4m l ave
K
P
1
g2 li
4m l ave i1
Kg 2
:
4m
20
1.7.1.2 Exact Steepest Descent Under the same conditions, analogous calculations can be made for the exact steepest descent algorithm. There is no gradient
noise. The weight error update equation is now
e n 1 e n 2m Re n I 2m Rn1 e 0:
1:65
b n e T 0I 2m Rn RI 2m Rn e 0
n
n
e T 0 QI 2m LQT R QI 2m LQT e 0
1:66
e 0 0I 2m L2n Le 0 0:
T
Then, once again exploiting the properties of the trace operator and assuming slow
adaptation, the excess error energy is
1
P
T
a
Tr e 0 0I 2m L2n Le 0 0
n0
Tr
1
P
I 2m L2n le0 0e 0 0
n0
Tr I I 2m L2 1 Le 0 0e 0 T 0
1:67
T
Tr 4m L1 Le 0 0e 0 0
1 0T
e 0e 0 0:
4m
Finally, again assuming that e 0 0 is a random vector with components each having a
variance of g 2, we obtain the average excess error energy as
Ea
i Kg 2
1 h 0T
E e 0e 0 0
:
4m
4m
1:68
Notice that the average excess error energy is once again independent of the
eigenvalues of R and is identical to Eq. (1.64) for Newtons method. The average
convergence time for steepest descent is therefore identical to the average
convergence time for Newtons method, given that both algorithms adapt with the
same value of m .
1.7.2
In practice, the true MSE gradient is generally unknown, and the LMS algorithm is
used to provide an estimate of the gradient based on the input u and the desired
output d. The weight vector is now stochastic and cannot be pulled out of the
expectations in the MSE expression.
21
Furthermore, gradient estimation results in gradient noise that prevents the MSE
from converging to j min, as it does in the exact gradient case. Instead, the MSE,
averaged over an ensemble of learning curves, now converges to j fin
j min j excess , where j excess is the excess MSE due to gradient noise. This is
illustrated in Figure 1.7, where the excess error energy is now the area below the
transient MSE curve and above j n. It is useful to note that the misadjustment, in
steady ow, after adaptive transients have died out, is given by
M
j excess
:
j min
1:69
In order to derive expressions for the average excess error energy, we will use an
approach similar to [52]. Let eo n be the error when the optimal weight vector
wo R1 p is used. Then the MSE at a particular iteration n can be expressed as
j n Ee2 n
E eo n en eo n2
E e2o n 2eo nen eo n en eo n2 :
1:70
Figure 1.7 Sample learning curve with gradient noise. The shaded area represents the
excess error energy.
22
We can now examine the three terms in Eq. (1.70) separately. By denition,
4
Ee2o n j min . Also,
E eo nen eo n E eo ndn wT nun dn wTo un
Eeo nuT ne n
1:71
E eo nen eo n pT Ee n wTo EunuT nEe n 0:
1:72
E en eo n2 E e T nun2
E Tre T nune T nun
Tr REe ne T n
TrLFn;
1:73
where
h
i
T
4
Fn
E e 0 ne 0 n :
1:74
Substituting Eqs. (1.72) and (1.73) back into Eq. (1.70), we obtain
j n j min TrLFn
1:75
and consequently,
Ea
1
P
TrLFn:
1:76
n0
Thus, we need to examine the evolution of TrLFn with n for LMS/Newton and
LMS.
23
e 0 n 1 e 0 n 2m l ave enL1 u0 n:
1:77
1:78
h
i
2m l ave E dne 0 nu0 T nL1
h
i
2m l ave E wT nune 0 nu0 T nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
2m l ave E wT nunL1 u0 ne 0 n
2m l ave Ee 0 n p0 L1
h
i
T
2m l ave E wT nune 0 nu0 nL1 ;
T
where p0 QT p.
We can
and subtract 2m l ave E wTo unL1 u0 ne 0 T n and
T now0 add
2m l ave E wo une nu0 T nL1 to the right-hand side of Eq. (1.78). Simplifying
24
further, we obtain
h
i
h
i
h
i
T
T
T
E e 0 n 1e 0 n 1 E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
2m l ave E e T nunL1 u0 ne 0 n
2m l ave Ee 0 n p0 L1
h
i
T
2m l ave E e T nuk e 0 nu0 nL1
T
h
i
T
2m l ave E wTo unL1 u0 ne 0 n
h
i
T
2m l ave E wTo une 0 nu0 nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
T
2m l ave E L1 u0 nu0 ne 0 ne 0 n
1:79
2m l ave Ee 0 n p0 L1
h
i
T
T
2m l ave E e 0 ne 0 nu0 nu0 nL1
T
h
i
T
2m l ave L1 QT EunuT nR1 pE e 0 n
2m l ave Ee 0 n pT R1 EunuT nQL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 LE e 0 ne 0 n
h
i
T
2m l ave E e 0 ne 0 n LL1
h
i
T
1 4m l ave E e 0 ne 0 n :
That is,
Fn 1 1 4m l ave Fn
1:80
1:81
and
25
1
P
n0
1
TrLF0:
4m l ave
1:82
At this stage, dene the diagA operator to return a column vector containing the
main diagonal of a square matrix A. Using this operator, we note that
TrA 1T diagA
1:83
1:84
g 2 TrL:
Substituting Eq. (1.84) back into Eq. (1.82), we nally obtain
Ea
g2
Kg 2
TrL
:
4m l ave
4m
1:85
LMS
e 0 n 1 e 0 n 2m enu0 n:
1:86
1:87
26
i
2m E e ne n L:
0
0T
1:88
That is,
Fn 1 Fn 2m LFn 2m FnL:
1:89
n1
1:90
diagLF0:
Thus,
Ea
1
P
1T diagLFn
n0
1T
1
P
I 4m Ln diagLF0
1:91
n0
1T 4m L1 diagLF0:
Once again assuming that e 0 0 is a random vector with components each having a
variance of g 2 , we nally obtain
h
i
T
Ea 1T 4m L1 diag LE e 0 0e 0 0
1T 4m L1 diagLI g 2
1:92
Kg 2
:
4m
Note that this result is identical to Eq. (1.68) and the result for LMS=Newton. This
means that the average excess error energy is the same for LMS=Newton and LMS
with random initial conditions.
1.7.3
27
It is important to point out that the preceding results do not imply that the excess
MSE curves for LMS/Newton and LMS are identical when averaged over starting
conditions. In fact, this is completely false, and it is important to illustrate why.
Starting with Eq. (1.81) and making the same assumptions about e 0 0 as before,
the excess MSE for LMS/Newton is
b Newton n TrLFn
1 4m l ave n TrLF0
1T 1 4m l ave n diagLg 2
K
P
1:93
1 4m l ave n g 2 l i :
i1
Similarly, we can use Eq. (1.90) to derive the equation for the excess MSE of LMS:
b SD n 1T diagLFn
1T I 4m Ln diagLF0
1T I 4m Ln diagLg 2
K
P
1:94
1 4m l i n g 2 l i ;
i1
Discussion
On the surface, it would seem that Section 1.7.3 contradicts the results that
immediately precede it. On the one hand, the average excess error energies for
LMS/Newton and LMS are the same. On the other hand, the excess MSE curves for
the two algorithms are not the same. How can both of these assertions be true? More
critically, if b SD n can be smaller than b Newton n for the same n, does it imply that
LMS is actually a superior algorithm to LMS/Newton?
The answer lies in ascertaining the exact method of comparison between
algorithms. A common mistake in comparing speed of convergence between two
algorithms is to plot two sample excess MSE curves and claim that one algorithm is
28
superior to the other because its initial rate of convergence is faster for a specic
starting weight vector. The inherent fallacy in such an approach is that the results
may not hold for other starting weight vectors. In fact, the results will often be
different, depending upon whether we compare worst-case, best-case, or average
convergence. But even when we average over some reasonable set of starting weight
vectors, it is not enough to look only at the initial rate of convergence. Even if the
initial rate of convergence of LMS is faster than that of LMS/Newton (meaning that
b SD n is smaller than b Newton n for small n), the fact that the average excess error
energy of the two algorithms is the same implies that the nal rate of convergence of
LMS is slower than that of LMS/Newton (meaning that b SD n is larger than
b Newton n for large n). Therefore, we cannot compare rates of convergence via a
direct comparison of excess MSE curves unless we also specify that we are only
interested in convergence to within a certain percentage of the nal MSE. For
example, direct comparison of two average excess MSE curves might reveal that, for
a particular eigenvalue spread, LMS converges to within 50 percent of the nal MSE
faster than LMS/Newton, but the result may be reversed if we compare convergence
to within 5 percent of the nal MSE. Unfortunately, the exact excess MSE at which
we can state that the algorithm has converged is usually problem-dependent. On
the other hand, the elegance of the excess error energy metric is that it removes this
constraint and thus makes the analysis problem-independent.
1.8
OVERVIEW
Using the same value of m for both LMS/Newton and LMS ensures that the steadystate performance of both algorithms, after transients die out, will be statistically
equivalent in terms of misadjustment. Further, with nonstationary inputs that cause
the Wiener solution to be rst-order Markov, the steady-state performance of LMS
is equivalent (in terms of the misadjustment) to that derived for LMS/Newton,
despite the spread in eigenvalues. Further yet, the average transient performance of
LMS is equivalent (in terms of the average excess error energy) to that derived for
LMS/Newton, despite the spread in eigenvalues. It is intuitively reasonable that
since the average transient performances of LMS and LMS/Newton are the same,
their average steady-state performances are also the same with certain nonstationary
inputs. Transient decay with Newtons method is purely geometric (discrete
exponential) with the single time constant t MSE 1=4m l ave . With Newtons
method, the rate of convergence is not dependent on initial conditions, as it is with
the method of steepest descent. Under worst-case conditions, adapting from a leastfavorable set of initial conditions, the time constant of the steepest descent algorithm
is t MSE 1=4m l min . With most-favorable initial conditions, this time constant is
t MSE 1=4m l max . Therefore, with a large eigenvalue spread, it is possible that the
steepest descent method could cause faster convergence in some cases, and slower
convergence in others, than Newtons method. On average, starting with random
initial conditions, they converge at effectively the same rate in the sense that
transient convergence time is proportional to excess error energy.
1.9 CONCLUSION
29
CONCLUSION
An adaptive algorithm is like an engine whose fuel is input data. Two algorithms
adapting the same number of weights and operating with the same misadjustment
can be compared in terms of their consumption of data. The more efcient algorithm
consumes less data, that is, converges faster. On this basis, the LMS/Newton
algorithm has the highest statistical efciency that can be obtained. The LMS/
Newton algorithm therefore can serve as a benchmark for statistical efciency
against which all other algorithms can be compared.
The role played by LMS/Newton in adaptive systems is analogous to that played
by the Carnot engine in thermodynamics. Neither one exists physically. But their
performances limit the performances of all practical systems, adaptive and
thermodynamic, respectively.
The LMS/Newton algorithm uses learning data most efciently. No other
learning algorithm can be more efcient. The LMS algorithm performs equivalently,
on average, to LMS/Newton in nonstationary environments and under transient
30
Figure 1.8 Illustration of Newtons method versus steepest descent: (a) Newtons method,
(b) steepest descent. The Wiener solution is indicated by . The three initial conditions are
indicated by W.
REFERENCES
Figure 1.9
31
REFERENCES
1. B. Widrow and M. E. Hoff, Adaptive switching circuits, IRE WESCON Conv. Rec., vol.
4, pp. 96 104, Aug. 1960.
2. B. Widrow, Adaptive lters, in Aspects of Network and System Theory, R. Kalman and
N. DeClaris, eds., pp. 563 587, Holt, Rinehart, and Winston, New York, 1971.
3. B. Widrow, P. Mantey, L. Grifths, and B. Goode, Adaptive antenna systems, Proc.
IEEE, vol. 55, no. 12, pp. 2143 2159, Dec. 1967.
32
REFERENCES
33
34
TRAVELING-WAVE MODEL
OF LONG LMS FILTERS
HANS J. BUTTERWECK
Eindhoven University of Technology
2.1
INTRODUCTION
35
36
In this section some characteristic properties of the long LMS lter are surveyed,
particularly those that distinguish it from its short and medium-length counterparts.
37
fn vnun;
2:1
38
2.2.1
During the adaptive process the difference nn between the weight vector wn and
the weight vector h of the reference lter is so large that the additive noise can be
neglected, fn 0. Then (2.1) passes into the homogeneous update equation
nn 1 I 2m unut nnn
2:2
for the weight error nn. For sufciently small step-sizes m the variations of the
weight error are much slower than those of the input signal, so that unut n can be
replaced with its time average. For an ergodic input signal this equals the ensemble
average, so that (2.2) passes into (direct averaging [21])
nn 1 I 2m Rnn m ! 0;
2:3
where R Efunut ng denotes the input correlation matrix. Using its eigenvalues
l i and eigenvectors qi such that
R
M
P
i1
l i qi qti ;
2:4
we can rewrite (2.3) in normal coordinates and arrive at the difference equation
qti nn 1 1 2m l i qti nn i 1; . . . ; M; m ! 0
2:5
2:6
2:7
Thus, the new weight error at position i equals the previous weight error at i minus a
weighted sum of neighboring previous weight errors, with the input correlation
U i Eful nuli ng as the weighting function. In the limiting case M ! 1 we
read (2.7) as a partial difference equation with the two independent variables n
39
2:8
2:9
Thus, the spatial Fourier transform of the weight error dies out exponentially, with a
decay factor depending on the spatial frequency under consideration. The time
constant n0 is determined by 1 F space f2m U i gn0 e1 , which in the limit
m ! 0 passes into
n0 F space f2m U i g1 :
2:10
Observing that F space fU g equals the input power spectral density (where the
temporal frequency is replaced with the spatial frequency), we arrive at the
following conclusions:
Spatial frequencies with high (low) spectral density are associated with fast
(slow) decay.
If certain frequencies are not excited (nonpersistent excitation), there is no
decay at all.
Small (large) step-sizes imply slow (fast) decay.
No eigenanalysis of the input correlation matrix is involved.
2.2.2
In the steady state, after completion of the adaptive process, the long adaptive lter
exhibits a still more noticeable behavior. This concerns particularly the weight-error
correlations obeying simple rules.
In contrast to the adaptive process, the additive noise cannot be neglected in the
steady state. Thus we now have to solve (2.1), which, again under the assumption of
a small step-size, can be approximated as
nn 1 I 2m Rnn 2m fn m ! 0:
2:11
Its solution can be determined with the aid of the representation (2.4) of the correlation matrix, again requiring the evaluation of the eigenvalues and eigenvectors
of the correlation matrix.
40
2:13
2:14
Thus the correlation between two weight errors, e taps apart, equals m times the
correlation between the noise at two instants, a distance e apart. This result is
remarkable in various respects. First, the correlations of the slow weight uctuations
are directly related to those of the fast noise uctuations. Second, the input signal
ui n has no inuence on the weight-error correlations. Neither its amplitude nor its
spectral distribution enters (2.14). One should not wonder if even the assumption of
a stationary, stochastic character for the input signal can be abandoned, so that (2.14)
holds also true for a limited class of deterministic inputs.
2.2.3
Stability
The last issue, for which the long lter provides meaningful statements, is stability.
For a given general lter, short or long, let the statistics of the input signal be known
(of course, the noise signal has no inuence on stability). Then for a sufciently
large step-size m . m 1 instabilities occur, whereas for a sufciently small stepsize m , m 2 the lter remains stable. But there is a rather broad gray zone,
m 2 , m , m 1 , where no stability statements are available. There the lter can be
stable or unstable, and if it has been stable during a long period of observation, there
is no guarantee that it will remain stable in the future.
Apparently, m , m 1 is a necessary stability condition, while m , m 2 is sufcient
for stability. An example of the rst type is provided by studying the approximate
updating rule (2.3) and its modal solution (2.6). Clearly, all the exponential solutions
(2.6) decay if [20]
m , l i;max 1 ;
2:15
41
so that already the simplied updating mechanism (2.3) (which, in due course, will
serve as a starting point for an iterative solution) will be unstable if (2.15) is not
satised. But that bound is far too optimistic. As can be concluded from our stability
condition (2.16) for the long lter (containing the factor 1/M), m must be
substantially smaller than l i;max 1 .
For the long lter we derive in Section 2.7, eq. (2.59), the necessary stability
condition
m,
1
;
MPu e jV max
2:16
where M again denotes the lter length and Pu e jV stands for the input power
spectral density. Clearly, for a given Pu e jV and thus a given Pu e jV max , the
maximum m decreases with increasing lter length.
The right-hand side of (2.16) plays the role of m 1 . For m . m 1 the lter can be
shown to become unstable, because then our iteration procedure, to be discussed
later, diverges. But there are good reasons to suppose that (2.16) is not only
necessary for stability but also sufcient. Numerous experiments support this
conjecture. Then the long lter would be the only one without a gray m -zone in
which no statement about stability can be made.
2.3
In this section, basic elements for a theory of the long LMS adaptive lter are
developed. Emphasis is put on the weight uctuations, particularly (1) their natural
behavior during the adaptive process, (2) their forced steady-state behavior after
the adaptive process, and (3) their possibly unlimited growth due to instability. The
output signal and the error signal including the concept of misadjustment are viewed
here as secondary quantities, closely related to and derivable from the weight
uctuations.
Under study is the question of whether the adaptive lter behaves in a
characteristic and possibly simple manner in the limit M ! 1, that is, for an innite
length of the tapped-delay line. Such questions play an important role in numerous
other structures exhibiting translational symmetry, such as cascades of equal twoports, transmission lines, and antenna arrays. From a practical point of view, one
need not necessarily study innitely long structures. One can also formulate
statements about long but nite arrangements; for these, local modications have to
be developed in the vicinity of the line endings.
The question formulated above has an afrmative answer: All innitely long
symmetrical structures are distinguished by characteristic, simple, and occasionally
surprising behavior, and this is particularly true for the LMS adaptive lter.
Common to such innite structures is the occurrence of traveling waves, absorbed in
42
sinks at innity. On the long but nite-length line, the necessary modications at the
terminations then are referred to as reections.
Our wave approach is characterized by a number of peculiarities. First, the
vectors nn; un; fn are written in component form ni n; ui n; f i n, where
the space coordinate i denotes the tap number on the delay line. The common
notation 1 i M for the nite-length lter is now replaced by 1 , i , 1 for
the innite line, and the updating rule
ni n 1 ni n 2m ui n
u j nn j n 2m f i n
2:17
ni n 2m
ui nuij nnij n 2m f i n;
into which the vector difference equation (2.1) passes is now read as a partial
difference equation with the two independent variables n (time) and i (position). The
tapped-delay mechanism nds expression in the basic input relation
ui n un i 1;
2:18
where un denotes the input signal of the delay line. Unfortunately, in our notation
1 , i , 1 nonpositive i values 1 , i 0 imply unrealizable negative delays.
Since, however, our wave theory deals only with delay differences (occurring in
correlation expressions), a huge imaginary dummy delay can be added in (2.18)
without affecting any further results but now satisfying physical causality
requirements.
A further peculiarity of our wave approach is that special weight distributions
propagate as wave modes in either direction to imaginary sinks at i 1 and
i 1. For the limiting case of a vanishing step-size m ! 0 these have the form of
complex exponentials; using spatial Fourier transformations, more general weight
distributions can be decomposed into such wave modes.
What wave theory distinguishes from the classical approach is the shift
invariance or stationarity in time and space. Stationarity in time, already an
ingredient of classical adaptive theory, states equivalence of all time instants in the
sense that any probability and any correlation depend only on distances in time.
What is new is spatial stationarity, stating that, moreover, any probability and any
correlation depend only on distances in space.
Requiring temporal and spatial stationarity for the external signals ui n; f i n,
we have to assume that
U e d Efui nuie n d g; F e d Ef f i n f ie n d g;
2:19
that is, that the correlations are independent of time n and position i. For the tappeddelay line satisfying (2.18), spatial shift invariance follows from the temporal shift
43
2:20
2:21
2:22
Thus the weight error correlations depend only on the time shift d (which in due
course will be set to zero) and the space shift e . For a nite-length line the latter is
not true in the vicinity of the terminations. The well-known weight-error correlation
matrix K Efnnnt ng then assumes an almost Toeplitz form with local
aberrations in the vicinity of the matrix borders.
Finally, we use an iterational technique to solve the updating equation (2.17).
This technique has been developed for the classical vectorial treatment of adaptive
ltering [7] but is also applicable to our scalar wave approach. It reads as
ni n a i n b i n g i n ;
2:23
where a i n represents the zeroth-order solution of (2.17) for the limiting case
m ! 0, and b i n; g i n; . . . are higher-order corrections for m . 0. At rst
glance, (2.23) suggests to represent a Taylor expansion of the weight-error
turns out to be slightly more
distribution in terms of m . However, the situation P
1
l
i
a
n
b i n
complicated:
Ultimately
we
nd
l0 al m O1;
P1
l
i
l1 bl m Om , and so on, so that a n has a Taylor expansion beginning
with m 0 , that of b i n begins with m 1 , and so on.
For m ! 0 the time variations of ni n are slow compared with those of the
factor ui nuij n in (2.17), so that the latter can be replaced with its (time or
ensemble) average (direct averaging [21]):
a i n 1 a i n 2m
2:24
a i n 1 a i n 2m U i a i n 2m f i n:
2:25
44
g i n 1 g i n 2m U i g i n 2m
Pi;ij nb ij n;
2:27
and so on, the sum (2.23) satises the p.d.e. (2.17), provided that the iteration
converges. Here
Pi;ij n ui nuij n U j
2:28
2.4
Based on an iteration procedure, we learned in the previous section that the update
equation (2.17) of the LMS algorithm is equivalent to the set of equations (2.25), and
so on. The zeroth-order solution a i n is determined by f i n (cf. (2.25)),
whereupon the rst-order correction b i n follows from a i n (cf. (2.26)), the
second-order correction g i n follows from b i n (cf. (2.27)), and so on. Thus
we proceed according to the scheme f i n ! a i n ! b i n ! g i n ! ;
where for sufciently small m the terms in the chain decrease to any wanted degree.
This procedure is attractive in that it replaces the difference equation (2.17) with
stochastically time-varying parameters into a set of constant-coefcient linear
differenceP equations (2.25), and
P so on, now with a stochastic excitation
f i n; j Pi;ij na ij n; j Pi;ij nb ij n; and so on. Thus the original
problem is reduced to a study of the passage of stationary stochastic signals through
a linear time-space-invariant system. Observe that the same operator Lf g applies in
all steps of the above scheme:
a i n Lf2m f i ng,
P
b i n Lf2m j Pi;ij na ij ng,
P
g i n Lf2m j Pi;ij nb ij ng;
and so on.
Viewed in the time domain, it has the character of a low-pass lter with a
vanishing cutoff frequency for m ! 0 (cf. (2.38)).
In this section we study the partial difference equation (2.25) for the zeroth-order
solution a i n, in which the stochastic character of ui n has been removed
through ensemble averaging of ui nuij n. The result is a constant-coefcient
45
a i n Lf2m f i ng
1
1
P
P
2:29
j1 l1
a i n hi n 2m f i n;
hi n 1 hi n 2m U i hi n d i n;
2:30
2:31
for n , 0;
hi n hi n:
2:32
2:33
The rst condition reects causality; the second follows from symmetry with respect
to the origin i 0 (left and right are equivalent). With (2.32) and (2.33) we
can solve (2.31) stepwise: hi 0 0; hi 1 d i ; hi 2 d i 2m U i ; hi 3
d i 2m U i d i 2m U i ,
hi n d i 2m U i n 1 terms d i 2m U i :
2:34
Thus, with increasing time n, the impulse response gradually spreads over the whole
line and is ultimately absorbed at i +1. In this sense we can talk of a wave
propagating to innity.
While the impulse response represents the operator Lf g in the time/space
domain, the system function as the double Fourier transform of the impulse response
provides a useful frequency domain equivalent:
Hz; j
PP
i
2:35
P
i
2m U i j i F space f2m U i g;
2:36
2:37
46
2:38
2.5
WEIGHT-ERROR CORRELATIONS
2:39
e
h~ d 4m 2 F e d ;
i
1
jV
jk 2
h~ n hi n hi n F 1
time F space fjHe ; e j g:
2:40
1
2p
1
2p
1
2p
1
2p
p
p
jHe jV ; e jk j2 dV
p
dV
cos V 1 Rj 2 sin2 V
p
dV
2 2 cos V1 Rj R2 j
dV
1
:
2
2 j
2R
j
V
R
1
2:41
47
2:42
This main result, valid for the combination m ! 0; M ! 1, directly relates the
spatial weight-error correlation to the temporal noise correlation (although the two
signals uctuate on completely different time scales). Surprisingly enough, the input
signal has no inuence on the weight correlations; neither its amplitude nor its
spectral distribution enters (2.42).
With e 0 the mean squared weight error equals the step-size times the noise
power: Efa i n2 g m Efv2 ng. Further, for white noise the weight uctuations
are uncorrelated. Notice that both results also are valid for a nite-length delay line
[4] under white noise; for that case they are also found with the aid of the
independence assumption [20, 7]. Why this illegitimate assumption succeeded in the
special situation under consideration has been elucidated in [5].
With the aid of (2.42) we can determine the misadjustment, dened [6] as the
ratio Efnt nun2 g=Efv2 ng of the powers of the output signal due to the weight
uctuations and P
of the additive output noise. In our notation and for m ! 0 the rst
signal reads as i a i nun i 1 so that, using (2.42), the numerator in the
misadjustment becomes
Efnt nun2 g E
PP
i
E
PP
i
PP
Efa i na j ngUi j
PP
i
a i nun i 1un j 1a j n
mM
PP
i
Ve Ue
Ve Ue :
The approximation is justied due to the extremely different time scales on which
un and a i n uctuate. So we arrive at
misadjustment
mM P
Efvnvn e gEfunun e g;
Efv2 ng e
2:43
48
valid for small step-sizes m . Due to Parsevals theorem, the sum can be rewritten as
the average over the product of the spectra of the input and the noise signal [6]. In [4]
it has been shown that (2.43) holds true for lines of any length, but the pertinent
proof is far more complicated than that for the long line. Observe that the misadjustment vanishes if the input and the noise spectrum do not overlap.
Above we determined the weight-error correlation Ae 0 for a zero time shift.
Often the generalized weight-error correlation Ae d will be desirable; due to the
small step-size, it will slowly decrease as a function of the time shift d . The
expression for Ae d isPrather complicated (see below), but we can derive a simple
expression for its sum d Ae d over all time shifts, which can be interpreted as
the low-frequency spectral density of P
the weight-error uctuations. First, we
determine its spatial transform: F space d Ae d F time F space fAe d gjz1
jH1; j j2 F space f4m 2 Ve Ue g R2 j 2m Rj F space fV e g; thus 1=2m Rj
P
e
e
d F space fA d g F space fV g, which, after application of an inverse spatial
Fourier transform, yields the interesting result
P
U e Ae d Ve :
2:44
d
Notice that in this relation, the step-size m does not occur. In this respect it is the
counterpart of (2.42), where U e does not occur. Combination of (2.42) and (2.44)
eliminates Ve , yielding
P
m U e Ae d Ae 0:
2:45
d
i
2:46
Ge d d e 2m U e jd j terms d e 2m U e : 2:47
49
2.6
In this section we concentrate on the adaptive process, that is, the transient phase, in
which the additive noise plays a negligible role, f i n 0. The adaptive process
ultimately passes into the steady state, in which the weight uctuations assume a
stationary stochastic character and where the noise becomes essential, f i n = 0. In
Section 2.2 we reviewed the two phenomena in exactly this order, but here we
choose the inverse treatment, guided by didactic considerations: While the weighterror correlations can be sufciently modeled as a zeroth-order effect (the higherorder corrections do not create basically new aspects), the simple zeroth-order
theory of the adaptive process merely predicts a deterministic exponential decay of
the weight errors, as represented by a i n. In a following step, the superimposed
stochastic uctuations are described by the rst-order corrections b i n. Thus the
present section represents a rst exercise in the iterative solution of the lters
update equation, as proposed in Section 2.3. In Section 2.7, treating stability, we will
prot by the full iterative solution using all higher-order corrections.
In the adaptive process with f i n 0, the zeroth-order solution a i n satises
the homogeneous partial difference equation (cf. (2.7)),
a i n 1 a i n 2m U i a i n:
2:48
2:49
2:50
Thus the spatial transform of the weight-error distribution decays exponentially with
a decay factor dependent on the spatial frequency j e jk . Compare (2.50) with the
classical theory (cf. Section 2.2), where the eigenvalues of the input correlation
matrix Efunut ng determine the decay factors, while its eigenvectors determine
the pertinent spatial modes of the adaptive process. For the innitely long LMS
lter, as discussed above, we have a continuum of spatially sinusoidal modes, which
can also be found from the asymptotic behavior of large Toeplitz matrices [18].
50
a i n 1 R1n a i 0;
2:51
with R1 2m Efu2 ng, so that the spatial structure of the weight errors is
preserved during the adaptive process.
2. The same result (2.51) is found for a colored input and a smooth initial
distributionP
a i 0, containing only small spatial frequencies, so that Rj
R1 2m d Efunun d g.
For an exact treatment of the adaptive process we have to solve the complete set
of equations (2.48) and (2.26), (2.27), and so on. As we have shown, the solution
a i n of (2.48) (zeroth-order solution) has a deterministic character, which for a
white input signal is given by the exponential decrease (2.51). Again for a white
input, we now consider the higher-order corrections, particularly the rst-order term
b i n. Since the excitation term of (2.26) is a mixture of deterministic and
stochastic signals, the same is true for the solution b i n, so that stochastic
uctuations are superimposed on the exponential a i n, whose amplitudes we
now determine. With (2.51) and the whiteness assumption U i U0d i ;
Rj R1 2m U0, the partial difference equation (2.26) reads as
b i n 1 1 R1b i n 2m
Pi;l na l n
e n 0 for n , 0; e n 1 for n 0;
gi n
2:52
The right-hand term of (2.52) is a product of two factors: 2m 1 R1n e n is a
deterministic signal starting at n 0, while gi n is a stationary, zero-mean
stochastic signal. The solution of (2.52) has the form
51
PP
j1
h j1 h j2 1 R1nj1
j2
1 R1nj2 e n j1 e n j2 Gi j2 j1
4m 2
PP
j
h jh j d 1 R1nj
1 R1njd e n je n j d Gi d ;
Gi d Efgi ngi n d g
PP
l
PP
p
a l 0a k 0EfPi;l nPi;k n d g
a kp 0a k 0Ti; k; p; d ;
P k 2 P 2
a 0
h j1 R12n2j e n j
Efb i n2 g R2 1
j
PP
j
a id 0a id 0h jh j d 1 R1nj
1 R1
njd
e n je n j d :
P
j
e j 1e j d 1e n j
Efb i n2 g
dP
n
P
R2 11 R12n2 n a k 02
n jd ja id 0a id 0 :
k
d n
2:54
52
R1 P k 2 P id
id
a 0 a
0a
0 ;
2e
k
d
2:55
valid for sufciently small step-sizes. For the special case of a uniform initial
weight-error distribution, a i 0 a 0, and an observation at the line center it
assumes the value 2e1 U0=U0max a 2 0, where U0max 1=m M denotes the
stability bound (cf. Section 2.7).
Summarizing, we conclude that the amplitude of the uctuations superimposed
on the exponential weight-error decay equals zero at the beginning and the end of the
adaptive process and reaches its maximum at half the time constant. That maximum
amplitude depends on the step-size m : It vanishes for m ! 0 but assumes
considerable values near the stability bound. Although we have explicitly studied
only white Gaussian input signals, similar statements also apply in more general
situations.
2.7
STABILITY
The iteration for the weight errors provides a useful tool to address the stability
issue. If the iteration diverges, the system is certainly unstable. Conversely, we only
conjecture stability if the iteration converges. In this case stability is not guaranteed,
because we refer to the class of stationary, stochastic processes and thus exclude
instabilities involving other signal classes. However, there is strong evidence,
theoretical and experimental, that our stability condition (2.59) is necessary and
sufcient.
In Section 2.3 we iteratively determined the weight errors in an adaptive lter
excited by a stationary stochastic signal f i n. In particular, we derived the steadystate weight-error correlation Ae 0 Efa i na ie ng in the limit m ! 0 (cf.
(2.42)). Here we return to that steady-state problem, but now we reckon with the
higher-order corrections. We derive an upper bound for the step-size beyond which
the adaptive lter becomes unstable. This maximum step-size turns out to be rather
small for long lters, so that throughout low-m approximations are justied.
First, we determine the autocorrelation of the rst-order correction b i n in
terms of the autocorrelation of the zero-order solution a i n. Replacing b i n with
2.7 STABILITY
53
Be d Efb i nb ie n d g h~ d 4m 2 Ge d ;
2:56
Be 0 h~ 0 4m 2
Ge d :
2:57
Ge d
Ae m 0
PP
l
Now the right-hand sum over l is unbounded for an innitely long lter (M ! 1) if
the expression between brackets does not vanish for l ! +1. If it approaches a
nonzero constant there, the sum over l approximately becomes M times this
constant. Using the tapped-delay line constraint (2.18), we nd for l ! +1
Efui nuilm nuie n e g uie l n e g g
Efui nuie n e g gEfuilm nuie l n e g g U g U mg ;
U lm U l 0;
P
d
Ge d M
Ae m 0
P
g
U g U mg MAe 0 U e U e ;
e
Be 0 h~ 0 4m 2 MAe 0 U e U e :
e
2:58
54
m,
1
for all V;
MPu e jV
i:e:; m ,
1
;
MPu max
2:59
where Pu e jV F fUe g denotes the input power spectrum. Then we have for all
spatial frequencies
F space fBe 0g , F space fAe 0g;
2:60
55
pertinent difference equation (2.48) to yield a decaying solution for all spatial
frequencies. In the z-domain this reads such that for any j e jk the poles of the
system function Hz; j must remain within the unit circle jzj 1. Following (2.38)
this amounts to Rj 2m F space fU i g , 2 or, transformed into the temporal
frequency domain,
m,
1
for all V:
Pu e jV
2:61
Fulllment of this condition guarantees stability of the zeroth-order solution. Obviously, our condition (2.59) guaranteeing convergence of the iterational procedure
is much stronger and implies (2.61).
Another stability condition (2.114) has been established by Clarkson and White
[9], which is based upon a transfer function approach of LMS adaptive ltering. In
Appendix A it is shown that condition (2.114) can be derived from but is weaker
than (2.59). But it is stronger than (2.61), which does not contain the crucial factor
M 1 .
2.8
The simple wave theory applies where spatial stationarity is guaranteed. This is the
case on a hypothetical tapped-delay line of innite length, but on an actual albeit
long line stationarity is violated in the vicinity of the terminations. The boundary
conditions (vanishing weight errors beyond the terminations) require local
perturbations of the wave modes called reections. Here we investigate the size of
the regions in which they occur and their inuence upon the lters steady-state and
transient behavior. Only in exceptional situations (short tapped-delay line, strong
coloring of the input signal) do the wave reections appear to deserve explicit
consideration; in most cases, the simple wave theory applies.
We assume the tapped-delay line to be so long that the reected waves set up at
the two terminations do not interact (no multiple reections); so we can concentrate on one of the terminations and apply the nal results mutatis mutandis to the
other termination. We arbitrarily choose the beginning (feeding point) of the line,
where the line taps are conveniently renumbered as i 0; . . . ; M 1, so that i 0
denotes the beginning of the line. Further, on the long line the reected waves do not
see the line end, so that the sequence i 0; 1; 2; 3; . . . can be viewed as
unterminated.
We now imagine a continuation of the line toward i , 0 while assuming the
validity of the original zero-order update equation (2.25) for all i. Then the response
to a delta excitation at i 0 penetrates into the virtual region i , 0, whereas the
response to an imaginary excitation at i , 0 (to be required below) penetrates into
the region i 0. Ultimately the total response a i n has to vanish for i , 0. For a
given excitation in the region i 0 this will be accomplished by applying imaginary
point excitations at i , 0. Just as the boundary condition for the electric eld of a
56
2:62
2:63
To satisfy (2.62), for any given z the poles of the system function Hz; j outside the
unit circle jj j 1 have to be counterbalanced by zeros of Fz; j . Let the input
signal of the adaptive lter have a nite correlation length L, such that U i vanishes
for jij . L; then, with Rj Rj 1 in (2.37) assuming the form j L 2Lthorder polynomial in j , the denominator of Hz; j in (2.38) can be cast in the form
z 1 Rj Gz; j Gz; j 1 :
2:64
L
Y
j ql with jql j 1;
2:65
l1
2:66
The second causality condition (2.63) is concerned with the behavior of Az; j for
j ! 1, where F z; j O1; Rj Oj L , and Hz; j Oj L . To obtain
Az; j O1, it is required that F z; j Oj L . This image source term and its
57
L
P
B j zj j ;
f i n
j1
L
P
b j nd ij :
2:67
j1
i
2
region, the spatial width or radius of inertia i of h n can be used, whose square
we (rather arbitrarily) dene as
i2
PP
i
i2 hi n2
PP
i
hi n2 :
2:68
1 p
dk
;
dk dVjHe ; e j
2p p 2Re jk
i n
p p
P P 2 i 2 P P i 2
1 p p
i h n
ih n
dk dVjH 0 e jV ; e jk j2
4p 2 p p
i n
i n
2
1 p
1 p
R0 e jk
dk
dV jV
2
j
k
2p p
2p p
e 1 Re
1 p R0 e jk 2
dk ;
2p p 4R3 e jk
p
1 p R0 e jk 2
dk
2
dk
;
i
jk
2 p R3 e jk
Re
p
PP
1
h n
4p 2
i
p p
jV
jk
2:69
58
0 jk
jk
where H 0 e jk dHe jk =dk and
pR e dRe =dk . From (2.69) it can be
2
easily concluded that the width i of the impulse response signicantly exceeds
unity only if the input power spectrum Re jV strongly varies as a function of V
(which can occur only for a large input correlation length). In most practical
situations this width, and hence the size of the reection region, are conned to only
a few taps.
Summarizing, it can safely be stated that for LMS adaptive lters of moderate or
great length (such as those used for acoustic echo cancellation) the simple wave
theory applies with sufcient accuracy.
2.9
In previous sections we developed a wave theory for long LMS adaptive lters
containing tapped-delay lines. Here we generalize the theory for a structure with
cascaded identical all-pass sections, as considered, for example, in [1] in the context
of Laguerre lters.
2.9.1
Steady State
First, we consider the weight-error correlations in the steady state, that is, after
completion of the adaptive process. To begin with, we modify (2.18) for a cascade of
identical all-pass sections:
ui n gi n un; gi n gn i terms gn;
2:70
where gn denotes the impulse response of the elementary all-pass section. Then the
input correlation (2.19) becomes
U e d Efui nuie n d g
Efgi n ungie n un d g:
2:71
for all V;
2:72
2:73
59
gn represents the inverse impulse response of the elementary all-pass section and
that going back on the delay line corresponds to system inversion, we have
gi n gi n:
2:74
U e d E
n0 n00
PP
n0
n00
PP
n0
2:75
gi n0 gie n0 nUd n:
gi n0 gie n0 n gi n gie n ge n:
2:76
ge nUd n ge d Ud :
2:77
2:78
Vd U e d :
In Section 2.5, before (2.42), an expression for the weight-error correlation was
derived:
e
Ae 0 Efa i na ie ng h~ 0 4m 2
F e d :
2:79
P
P e
1 p
F d Vd U e d
F time fU e d g F *time fVd gdV:
2p p
d
d
2:80
While F time fVd g V~ V can readily be interpreted as the noise power spectral
e
density (notice the different meaning of the tilde in h~ 0 and in V~ V!), the term
60
2:81
e
h~ 0 Ge e jV U~ VV~ VdV;
p
1
2F space f2m U i 0g
4m F space
1
P
ge nUn
2:83
4m
1
2p
p
p
^ k ; VU~ VdV
G
F space fA 0g m
p
2:84
1
:
db=dVVb1 k
2:85
Inserting this result into (2.84) shows that the Fourier transform of the weight-error
correlation is independent of the input signal (its amplitude and its spectral
distribution):
F space fAe 0g m V~ b1 k ;
2:86
61
thus solely determined by the noise power spectrum. We are acquainted with such a
result from the TDL structure, where bV V; b1 k k , and Ae 0 m Ve
(cf. (2.42)). In our generalized situation, we have a simple relation only in the spatial
frequency domain, which, however, contains a nonlinear frequency transformation
V b1 k . In the spatial domain the weight-error correlation Ae 0 is determined
by the noise correlation Vd such that, for a certain e ; Ae 0 depends on Vd for
all values of d . The dependence is linear and can formally be described by an
(innite) matrix (this item is not elaborated here).
2.9.2
Now we discuss the adaptive process, in which the additive noise can be neglected.
It is governed by the homogeneous difference equation
a i n 1 a i n 2m U i a i n;
2:87
gi nUn:
2:88
Let an; e jk denote the spatial Fourier transform of a i n; then spatial transformation of (2.87) yields
an 1; e jk 1 Re jk an; e jk :
2:89
Our main task now is to determine Re jk for a cascade of identical all-pass sections:
Re jk 2m F space fU i g 2m F space
2m F space
1
2p
gi nUn
1
Ge jV i U~ VdV 2m
2p
p
p
p
^ k ; VU~ VdV:
G
U~ V
:
t V Vb1 k
2:90
62
2.9.3 Stability Finally, we derive a necessary stability condition for the lter
e
under consideration following the reasoning of Section 2.7. First, Be 0 h~ 0
P
P
P
P
P
e
e m
4m 2 d Ge d ;
0 l g Efui nuilm nuie
d G d
mA
i e l
lm l
n e g u
n e g g U
U . For l ! +1 we have
Efui nuilm nuie n e g uie l n e g g
Efui nuie n e g gEfuilm nuie l n e g g;
U lm U l 0:
Using (2.77) we nd
P
Ge d M
Ae m 0
ge g Ug ge m g Ug
Ae m 0
F time g
1
2p
p
p
F time ge g Ug
e m
g Ug dV
P e m
1 p
0
Ge jV e U~ VG*e jV e m U~ VdV
M A
2p p
m
P e m
1 p je bV je mbV ~ 2
M A
0
e
e
U VdV:
2p p
m
We dene qm qm qm * 1=2p
obtain
P
d
Ge d M
P
m
Ae m 0qm M
p
p
Ae m 0qm MAe 0 qe ;
Be 0 he 0 4m 2 MAe 0 qe ;
which, after a spatial Fourier transformation, passes into
F space fBe 0g m M U~ b1 k F space fAe 0g;
2:91
2.10 EXPERIMENTS
63
because
2
U~ b1 k t b1 k
;
F space fq gF space fh 0g
t b1 k 4m U~ b1 k
e
t V db=dV:
In order that the iteration procedure converges, we have to satisfy
2:92
m M U~ max , 1:
Comparing this result with (2.59) for the TDL structure, we do not observe any
difference. Also for the general all-pass structure, the upper m bound is determined
only by the maximum input spectral density.
2.10
EXPERIMENTS
2.10.1
Steady State
For a sufciently small step-size and for an innitely long delay line, the weighterror correlations have been shown to satisfy (2.42). For a line of moderate or small
length, deviations from (2.42) have to be expected, particularly in the vicinity of the
terminations. However, this occurs only if the input signal and the additive noise are
nonwhite: The weight-error correlation matrix then satises the Lyapounov equation
(2.12), whose solution exactly agrees with (2.42) if at least one of the two signals is
white. In that case, no reections occur at the terminations.
Therefore, let un; vn both be colored, for example, U0 V0 2; U1
U1 0:8; V1 V1 1; Ui Vi 0; jij . 1. For the weight-error
correlation between two taps i, j on the innitely long delay line, (2.42) yields
Efa i na j ng m Vi j. However, in the vicinity of the terminations, the (i, j)
element Efa i na j ng of the weight-error correlation matrix no longer depends
only on the tap distance (i j). In other words, in the vicinity of the borders, the
weight-error correlation matrix deviates from the Toeplitz form. We illustrate that
for a delay line of length 6, for which, apart from a multiplicative factor m , the exact
64
:90
2:05
:98
:01
:00
:00
:03
:98
2:01
:99
:01
:01
:01
:01
:99
2:01
:98
:03
:00
:00
:01
:98
2:05
:90
:00
:00
:01
:03
:90
2:44
Particularly in the corners (left above, right below), deviations are observed from
what (2.42) predicts, viz a Toeplitz matrix T with Tii 2; Ti;i1 Ti;i1 1; Tij
0 elsewhere. The above result has been supported experimentally in a run of 5 107
cycles with m 0:782 103 . None of the measured correlations deviates more
than +0:02 from the theoretical results.
2.10.2
Transients
2.10.3
Stability
2.11 CONCLUSIONS
65
2.11
2.11.1
CONCLUSIONS
The Long LMS Filter
In previous sections we studied the transient and steady-state behavior of the long
LMS adaptive lter. Further, we discussed the stability problem and derived an
upper bound for the step-size. Now we combine these studies and are led to a
number of interesting conclusions concerning the global properties of the long LMS
lter.
First, consider the stability bound (2.59) for the step-size m , which we rewrite in
the form
m h m max h
1
;
MF space fU i gmax
0 , h , 1:
2:93
Inserted into (2.37), this yields Rj max 2m F space fU i gmax 2h =M. Further,
writing the system function in (2.38) in the form Hz; j z z0 1 with the pole
z0 1 Rj , we have
z0;min 1 Rj max 1 2
h
:
M
2:94
Thus, the pole remains in the vicinity of 1, just inside the unit circle jzj 1.
Associated with the pole, a time constant n0 can be dened satisfying zn00 e1 ,
which for z0 in the vicinity of 1 can be approximated by n0 1 z0 1 ,
yielding
n0; min
M
:
2h
2:95
66
Figure 2.2 Natural behavior of a noise-free LMS adaptive lter for two different step-sizes.
The lter length equals M 50, the weight error is observed at the center of the delay line (tap
25), and the input signal is white.
2.11 CONCLUSIONS
67
Figure 2.3 Natural behavior of a noise-free LMS adaptive lter for two different step-sizes.
The lter length equals M 50, and the weight error is observed at the center of the delay line
(tap 25). The input signal is colored according to Pu e jV const 1 0:8 cos V.
68
With respect to the last item we conclude that, in accordance with (2.43) and the
relations ahead the misadjustment can be determined as
Efnt nun2 g
misadjustment
Efv2 ng
P e
K Ue
:
M e 2
Efv ng
P P
i
K ij Ui j
Efv2 ng
j
2:96
2.11.2
Now we investigate which modications of the wave theory are required to adapt it
to the normalized least-mean-square (NLMS) algorithm, governed by the updating
relation
unut n
2m~
fn:
nn 1 I 2m~ t
nn t
u nun
u nun
2:97
2.11 CONCLUSIONS
69
For a long tapped-delay line we make the basic observation that, due to ergodicity,
the normalizing quantity ut nun becomes (almost) independent of time,
ut nun u2 n u2 n 1 u2 n 2 u2 n M 1
MEfu2 ng;
2:98
so that the NLMS lter is equivalent to an LMS lter with a step-size m equal to
m~ =MEfu2 ng. In particular, the weight-error correlation (2.42) passes into
Ae 0 Efa i na ie ng
m~
Ve
MEfu2 ng
m~
Efvnvn e g;
MEfu2 ng
2:99
P
m~
Efvnvn e gEfunun e g;
Efv2 ngEfu2 ng e
2:100
which, in contrast to (2.43), is symmetric with respect to the input and noise signal
and independent of M. Similarly, expressions can be derived for the adaptive
process, again with m replaced by m~ =MEfu2 ng.
Using the same reasoning as above and using (2.59), we would arrive at the
stability bound
m~ ,
Efu2 ng averagefPu e jV g
Pu e jV
Pu e jV
for all V;
which is more restrictive than the well-known NLMS stability bound [20]
m~ , 1:
2:101
Only for the special case of a white input are both bounds identical. Which bound is
correct in the case of a colored input? Following the reasoning cited in [20], the
NLMS lter is stable under the condition (2.101), because then the homogeneous
updating equation (without an excitation term) is associated with a nonincreasing
energy function. This simple reasoning is convincing. Moreover, the bound (2.101)
is conrmed by simulations.
What then is wrong with our own reasoning? Apparently, the approximation
(2.98) can fail from time to time in that the length of the input vector can deviate
considerably from the value predicted by (2.98), and local instabilities can occur.
Thus, the bound (2.101) cannot be derived from the stability bound (2.59) for the
70
LMS lter. In passing, we note that from a stability point of view, NLMS obviously
deserves preference to LMS.
2.11.3
2.12
APPENDIXES
n1
P
j0
Gn; jy j 2m
n1
P
Gn; jv j Gn; j ut nu j:
2:103
j0
The factor Gn; j deserves particular consideration. For an extremely long delay
line, this quantity loses its stochastic character and can be approximated by a
constant. To show that, elaborate the inner product
Gn; j ut nu j
M1
P
i0
un iu j i
2:104
2.12 APPENDIXES
71
and exploit ergodicity of un (time averaging ensemble averaging). Then the sum
becomes approximately M times the autocorrelation of the input signal:
Gn; n l MUl;
Ul Efunun lg;
l n j:
2:105
Notice that even for large M, this relation has an approximate character. On the
constant determined in (2.105), an (albeit small) oscillatory stochastic term is
superimposed (cf. (2.110)), which has to be taken into account throughout when
interpreting (2.103). Below we demonstrate that for increasing values of l the
approximate value (2.105) becomes smaller and smaller, whereas the oscillatory
contribution does not decrease. Thus the relative error of (2.105) is large for large l.
From (2.104) we conclude that
EfGn; jg MUn j;
2:106
EfG2 n; jg E
M1
P
2
un iu j i
i0
M1
P M1
P
i1 0 i2 0
2:107
Efun i1 un i2 u j i1 u j i2 g:
For a Gaussian input signal the right-hand expectation can be expanded as follows:
Efun i1 un i2 u j i1 u j i2 g
Efun i1 u j i1 gEfun i2 u j i2 g
Efun i1 un i2 gEfu j i1 u j i2 g
Efun i1 u j i2 gEfun i2 u j i1 g
U 2 n j U 2 i2 i1 Un j i1 i2 Un j i1 i2 :
Then we have (after minor elementary manipulations)
EfG2 n; jg
M1
P M1
P
i1 0 i2 0
U 2 n j U 2 i2 i1
Un j i1 i2 Un j i1 i2
M 2 U 2 n j
M
P
lM
72
The rst term equals EfGn; jg2 (cf. (2.106), so the sum
P
l
can be interpreted as
s 2 variance of Gn; j
M
P
2:108
lM
Furthermore, if M @ 1 and the input correlation length is nite, we can use the
approximation
1
P
s2 M
U 2 l Un j lUn j l:
2:109
l1
p
So, the RMS value of Gn; j increases with M , while its mean increases with M
(in accordance with a basic statistical law regarding the uncertainty in averaged
independent observations).
Now consider (2.106) and (2.109) for a white input signal. For the mean of Gn; j
wePnd that EfGn; jg M d n j, while the variance becomes s 2
M l d l d n j ld n j l M M d n j. Thus the variance assumes
a nonzero value for any pair n; j, while the mean vanishes for all n = j. Here we
have the key for the illegitimacy of the replacement of Gn; j by its mean value.
Even for taps n; j with a large mutual distance jn jj we have a nonvanishing
Gn; j, while the simple averaging yields a zero value. Hence, for most pairs n; j the
relative error in the approximation is 1arge. Similar reasoning applies to a colored
input signal.
We now decompose Gn; j into its mean value and a time-varying part gn; j
with zero mean:
Gn; j EfGn; jg gn; j MUn j gn; j:
2:110
Pn1
Neglecting
Pn1 gn; j, one derives from (2.103) yn 2m M j0 Un jy j
2m M j0 Un jv j. If, instead of n0 0 we choose n1 0 as the initial
condition, we deal with the steady state and nd
yn 2m M
n1
P
j1
Un jy j 2m M
n1
P
Un jv j:
2:111
j1
In the low-m approximation (zeroth-order solution) the rst right-hand term can be
neglected, so that we arrive at
yn 2m M U1vn 1 U2vn 2 U3vn 3
2:112
2.12 APPENDIXES
73
2.12.1.1
Stability
n1
P
Un jy j:
2:113
j1
Although we doubt the correctness of this linear equation, we can wonder whether a
stable lter at least satises the concomitant stability condition
1 2m MGz = 0 for jzj . 1;
2:114
in other words, all zeros of 1 2m MGz lie inside the unit circle jzj 1. Here
Gz is dened as
Gz U1z1 U2z2 U3z3 ;
2:115
P
P
iP
Uizi U0 Gz Gz1 :
2:116
74
2:117
2:118
2:119
2:120
Since Gz is regular for jzj . 1 (cf. (2.115), so that it assumes its maximum and
minimum real parts on the boundary jzj 1; then
1 , 2m M<fGzg , 1
2:121
holds also for jzj . 1. Thus 1 2m MGz cannot vanish for jzj . 1, because then its
real part also would vanish there. We therefore conclude that (2.59) implies (2.114).
On the other hand, with the aid of simple examples, one can show that the converse
is not true. Thus (2.114) is necessary for stability but not sufcient. This conjecture
can already be found in [9].
2.12.2
2:122
1
1
:
1 Rj 2 z1 z R2 j 2 z1 z R2 j
2:123
2.12 APPENDIXES
75
where
1
F 1
z R2 j 1 g
time f2 z
1
2p
p
p
dV zjd j
2 z1 z R2 j
1
dz zjd j
2
2p j z z2 R2 j 1
1
dz zjd j
zjd j
zjd j
1
1 ;
2p j z z1 z z2 z2 z1 2Rj
s
2
1 2
1
1 R j +
1 R2 j 1
2
2
z1;2
1 + Rj :
2:125
1
Rj j d :
2m
2:126
z1jd j
1
Vd
Rj j d m z1jd j Vd j d
2m
2Rj
m Gs d ; j Vd j d ;
2:127
where
Gs d ; j z1jd j 1 Rj jd j F space fGe d g
2:128
2:129
76
Its time transform and the double transform are also of interest:
Gt z; e F time fGe d g;
Gts z; j F time F space fGe d g:
2:130
2:131
Further temporal transformation of (2.127) yields (V~ denotes the Fourier transform of Vd F time F space fAe d g m Gts z; j F time fVd j d g m Gts z; j V~ j z,
which, after an inverse spatial transformation, yields
F time fAe d g m Gt z; e Ve ze :
2:132
Inverse time transformation then leads to the desired weight error correlation:
Ae d
1
2p
p
p
1
m P
2p e 0 1
m
m
1
P
e 0 1
1
P
p
p
Gt z; e 0 Ve e 0 ze e zd dV
Ve e 0
1
2p
p
Gt z; e 0 ze zd e dV
e 0 1
Ve e 0 Ge d e 0 e m
1
P
VxGe x d x:
x1
2:133
In passing we note that, due to (2.128), the weighting function Ge d can also be
written in the form (the approximation uses Rj ! 1)
1
G d
2p
e
1
1 Rj j d k
2
p
p
jd j e
p
p
ejd jRj j e dk ;
2:134
REFERENCES
77
REFERENCES
1. H. J. W. Belt and H. J. Butterweck, Cascaded all-pass sections for LMS adaptive ltering.
Proc. European Conference on Signal Processing, Triest (1996) (ed. G. Ramponi), pp.
1219 1222.
2. N. J. Bershad, Analysis of the normalized LMS algorithm with Gaussian inputs. IEEE
Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-34 (1986), pp. 793 806.
3. H. J. Butterweck, An approach to LMS adaptive ltering without use of the independence
assumption. Proc. European Conference on Signal Processing, Triest (1996) (ed. G.
Ramponi), pp. 1223 1226.
4. H. J. Butterweck, Iterative analysis of the steady-state weight uctuations in LMS-type
adaptive lters. Eindhoven University of Technology, Report 96-E-299, ISBN 90-6144299-0, June 1996.
5. H. J. Butterweck, The independence assumption: a dispensable tool in adaptive lter
theory. Signal Processing vol. 57 (1997), pp. 305 310.
6. H. J. Butterweck, A new interpretation of the misadjustment in adaptive ltering. Proc.
IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta 1996 (ed. M. H.
Hayes), pp. 1641 1643.
7. H. J. Butterweck, Iterative analysis of the steady-state weight uctuations in LMS-type
adaptive lters. Trans. IEEE Signal Processing, vol. SP-47 (1999), pp. 2558 2561.
8. H. J. Butterweck, A wave theory of long adaptive lters. Trans. IEEE Circuits and
SystemsI, vol. 48 (2001), pp. 739 747.
9. P. M. Clarkson and P. R. White, Simplied analysis of the LMS adaptive lter using a
transfer function approximation. IEEE Trans. Acoustics, Speech, and Signal Processing,
vol. ASSP-35 (1978), pp. 987 993.
10. P. M. Clarkson, Optimal and adaptive signal processing. Boca Raton: CRC Press, 1993.
11. S. C. Douglas and W. Pan, Exact expectation analysis of the LMS adaptive lter. Trans.
IEEE Signal Processing, vol. 43 (1995), pp. 2863 2871.
12. S. C. Douglas, T. H.-Y. Meng, Exact expectation analysis of the LMS adaptive lter
without the independence assumption. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, San Francisco, vol. IV (Mar. 1992), pp. 61 64.
13. R. Feldtkeller, Vierpoltheorie. Stuttgart: S. Hirzel, 1959.
14. A. Fettweis, Digital lters related to classical lternetworks. Arch. Elektr. Uebertr. vol.
25 (1971), pp. 79 89.
15. A. Feuer and E. Weinstein, Convergence analysis of LMS lters with uncorrelated
gaussian data. IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-33
(1985), pp. 222 230.
16. S. Florian and A. Feuer, Performance analysis of the LMS algorithm with a tapped-delay
line (two-dimensional case). Trans. IEEE Acoustics, Speech, and Signal Processing, vol.
ASSP-34 (1995), pp. 1542 1549.
17. W. A. Gardner, Learning characteristics of stochastic-gradient-descent algorithms: a
general study, analysis, and critique. Signal Processing, vol. 6 (1984), pp. 113 133.
18. R. M. Gray, On the asymptotic eigenvalue distribution of Toeplitz matrices. Trans. IEEE
Information Theory, vol. IT-18 (1972), pp. 725 730.
78
19. L. Guo, L. Ljung, and G. Wang, Necessary and sufcient conditions for stability of LMS.
Trans. IEEE Automatic Control, vol. AC-42 (1997), pp. 761770.
20. S. Haykin, Adaptive lter theory (fourth edition). London: Prentice-Hall, 2001.
21. H. J. Kushner, Approximation and weak convergence methods for random processes with
applications to stochastic system theory. Cambridge, Mass.: MIT Press, 1984.
22. O. Macchi, Adaptive processing: the least-mean-square approach with applications in
transmission. Chichester, UK: Wiley, 1995.
23. J. E. Mazo, On the independence theory of equalizer convergence. Bell System Tech. J.,
vol. 58 (1979), pp. 963993.
24. M. Reuter and J. Zeidler, Non-Wiener effects in LMS-implemented adaptive equalizers.
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Munich, vol. 3 (Apr. 1997),
pp. 2509 2512.
25. D. T. M. Slock, On the convergence behavior of the LMS and the normalized LMS
algorithm. IEEE Trans. Signal Processing, vol. ASSP-41 (1993), pp. 2811 2825.
26. V. Solo, The stability of LMS. Trans. IEEE Signal Processing, vol. SP-45 (1997), pp.
3017 3026.
27. V. Solo and X. Kong, Adaptive signal processing algorithms. Englewood Cliffs, NJ:
Prentice Hall, 1995.
28. V. Solo, The error variance of LMS with time-varying weights. IEEE Trans. Signal
Processing, vol. ASSP-40 (1992), pp. 803 813.
29. V. Solo, The limiting behavior of LMS. IEEE Trans. Acoustics, Speech, and Signal
Processing, vol. ASSP-37 (1989), pp. 1909 1922.
30. M. Tarrab and A. Feuer, Convergence and performance analysis of the normalized LMS
algorithm with uncorrelated Gaussian data. IEEE Trans. Information Theory, vol. 34
(1988), pp. 680 691.
31. B. Widrow and S. D. Stearns, Adaptive signal processing. Englewood Cliffs, NJ:
Prentice-Hall, 1985.
32. B. Widrow, J. M. McCool, M. G. Larimore, and D. R. Johnson, Stationary and
nonstationary learning characteristics of the LMS adaptive lter. Proc. IEEE, vol. 64
(1976), pp. 1151 1162.
ENERGY CONSERVATION
AND THE LEARNING
ABILITY OF LMS ADAPTIVE
FILTERS
ALI H. SAYED
Electrical Engineering Department, University of California, Los Angeles
VITOR H. NASCIMENTO
Department of Electronic Systems Engineering, University of Sao Paulo, Brazil
3.1
INTRODUCTION
Adaptive lters are prominent examples of systems that are designed to adjust to
variations in their environments in order to meet certain performance criteria. The
learning curve of an adaptive lter is a widely used tool to evaluate how fast and how
well an adaptive lter meets (or learns to meet) its objectives. This learning process
has been extensively studied in the literature for slowly adapting systems, that is, for
systems that employ innitesimally small step-sizes. This chapter highlights several
1
This material was based on work supported in part by the National Science Foundation under awards
CCR-9732376 and ECS-9820765. The work of V. H. Nascimento was also supported by Grant 2000/
09569-6 from FAPESP, Brazil.
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
79
80
phenomena that characterize the learning capabilities of adaptive lters when larger
step-sizes are used. The phenomena actually occur even for slowly adapting systems
but are less pronounced, which explains why they may go unnoticed.
The purpose of the chapter is to provide a straightforward exposition to the topic
so that it can provide motivation for further study and analysis. For this reason, the
discussion focuses in some detail on a special case that helps illustrate and explain
the desired phenomena in their simplest forms. Readers interested in more advanced
cases, and in additional details, are referred to the article [1] and the textbook [20].
Among other results, it is argued here that after an initial learning phase,
an adaptive lter generally learns at a rate that is higher than that predicted by
mean-square theory. It is also argued that even single-tap adaptive lters can exhibit
two very distinct rates of convergence; they learn at a slower rate initially and
at a faster rate later. Several examples are provided to illustrate these and other
effects.
3.2
un
en;
gun
k 0;
3:1
en dn un wn;
T
3:2
81
k 0;
3:3
en dn unT wn;
while the choice gun ; 1=d kunk2 results in the normalized least-meansquares (NLMS) algorithm
wn 1 wn m
un
en;
d kunk2
k 0;
3:4
en dn un wn;
T
where d is a small positive number and k k denotes the Euclidean norm of its vector
argument.
3.3
n 0;
82
B.
While such assumptions are not valid in most practical cases (e.g., when the
regressors un arise from a tapped-delay line implementation, in which case
assumption (a) is violated), there is ample evidence in the literature (e.g., [2
7]) to support the premise that conclusions obtained under these conditions
are sufciently realistic for slow adaptation scenarios (i.e., for innitesimally
small step-sizes).
C. Ensemble Averaging. The third method of evaluation is the most practical
and also the most widely used. It relies on controlled simulation or
experimentation. In this technique, an adaptive lter is trained repeatedly and
the resulting squared error curves are averaged to approximate the variance
curve. More specically, one performs several independent experiments or
simulations, say L of them. In each experiment, the adaptive lter is applied
for a duration of N iterations, always starting from the same initial condition
and under the same statistical conditions for the sequences fdn; un; vng.
From each experiment i, a sample error curve is obtained:
sample error curve fei n; 0 n Ng:
After all L experiments are completed, an approximation for the learning
curve is computed by averaging as follows:
4
Ensemble-average curve
L
1X
ei n2 ;
L i1
0 n N:
3:5
This method of evaluation is useful for complex lter updates for which
closed-form expressions for learning curves are difcult to obtain even under
the independence conditions. The method is also useful even for simple lter
structures, e.g., when an analysis by the independence theory is not possible
or even reliable due, for example, to faster adaptation (a situation that
corresponds to non-innitesimal step-sizes).
83
3.4
To begin with, it is helpful to introduce a few error measures and to derive a useful
energy relation that will be called upon later in the arguments.
A. Error Measures. It is common to associate with every adaptive scheme of
the form (3.1) (3.2) two estimation errors: the so-called a priori and a
posteriori errors,
4
ea n unT en;
ep n unT en 1;
3:6
Ee2a n
3:7
84
can be used to describe the learning behavior of an adaptive lter since they
differ only by a constant factor (equal to s 2v ). The noise assumption A.1 stated
above will be enforced throughout this chapter, and the discussions will
therefore focus on studying the behavior of the curve Ee2a n.
B. Energy Relation. Subtracting wo from both sides of (3.1) leads to the
weight error recursion
en 1 en m
un
en:
gun
3:8
Multiplying by unT from the left, one nds that the errors fep n; ea n; eng
are related via
ep n ea n m
kunk2
en:
gun
3:9
Substituting this relation back into (3.8), one obtains, for nonzero un, a
recursion that relates all four error quantities fen 1; en; ea n; ep ng:
en 1 en
un
ea n ep n:
kunk2
3:10
Observe that the data nonlinearity function g does not appear explicitly in
this relation. Evaluating the energies of both sides of this equation leads to the
following energy conservation relation:
ken 1k2
1
1
e2 n kenk2
e2 n:
2 a
kunk
kunk2 p
3:11
3:12
Both results (3.11) and (3.12) can be grouped together into a single equation
by dening
8
<
0
m n kunk
1
:
kunk2
2 y
if un 0
otherwise
3:13
85
This energy conservation relation holds for all adaptive lters whose recursions are of the form (3.1) (3.2), and was originally developed in [8] in
the context of robustness analysis of adaptive lters. No approximations or
assumptions are needed to establish (3.13); it is an exact relation that shows
how the energies of the weight error vectors at two successive time instants
are related to the energies of the a priori and a posteriori estimation errors.
Thus the energy relation provides a convenient and powerful framework for
carrying out different kinds of performance analysis, both stochastic and
deterministic, for a wide range of adaptive lters (see, e.g., [8 14] and the
textbook [20]). It will be used in the sequel to shed some light on the learning
behavior of adaptive lters.
3.5
TRANSIENT ANALYSIS
A recursion for the transient behavior of an adaptive lter can be deduced from the
energy conservation relation (3.13). For this purpose, one reworks (3.13) so as to
express it in an equivalent form that eliminates ep n and keeps only the a priori error
ea n, whose variance, as indicated earlier, is of interest to the learning behavior of
an adaptive lter. Using (3.6) in (3.9) leads to
kunk2
kunk2
ep n 1 m
vn:
ea n m
gun
gun
3:14
Substituting this equality into the energy relation (3.13) and expanding terms, one
nds, after some straightforward algebra, the equivalent representation:
m 2 kunk2 2
v n
g2 un
2m
m kunk2
1
ea nvn;
gun
gun
3:15
kxk2P xT Px:
3:16
86
3.6
R EununT :
Assume further that the reference sequence dn is independent of
fum; m = ng. These conditions correspond to a situation in which the
independence assumptions are satised and, in addition, the learning curve of
the adaptive lter can be evaluated in closed form.
Indeed, it follows from the independence assumptions that un is
independent of en and that vn and ea n are also independent. Taking
expectations of both sides of (3.15) with gun ; 1, and using the independence of vn and ea n, leads to the recursion
Eken 1k2 Ekenk2A m 2 s 2v TrR;
3:17
with
A I m 2 m kunk2 ununT :
Observe that the weight matrix A is a random variable since it depends on
un. However, the independence of un and en permits the replacement of
A by a constant matrix (namely, by its mean value). To see this, note that
Ekenk2A EenT Aen
EdEenT Aenjene
EenT EAjenen
Ekenk2F ;
87
where
4
F EA
I 2m R m 2 2R2 TrRR
and the above value for F follows from the fact that for real-valued Gaussian
regressors it holds that
Ekunk2 ununT 2R2 TrRR:
In this case, recursion (3.17) is seen to be equivalent to
Eken 1k2 Ekenk2F m 2 s 2v TrR;
3:18
with A replaced by F.
Now consider the choice R s 2u I. In this case, TrR M s 2u and F
becomes a constant multiple of the identity
F d1 2m s 2u m 2 M 2s 4u eI;
so that the weight-error variance relation (3.18) can be rewritten more directly
as
Eken 1k2 1 2m s 2u m 2 M 2s 4u Ekenk2 m 2 M s 2u s 2v :
Now using (3.7) and the fact that
Ee2a n EjunT enj2 EenT Ren s 2u Ekenk2 ;
one nds that the learning curve for this example is described in closed form
by the recursion
Ee2 n 1 d1 2ms 2u m 2 M 2s 4u eEe2 n 2m s 2u 1 m s 2u s 2v ;
with initial condition Ee2 0 Ed2 0 ; s 2d .
It is clear that in this example the learning curve has a single mode and that
it will be decaying (i.e., convergent) if, and only if, the step-size m is chosen
to satisfy
j1 2m s 2u m 2 M 2s 4u j , 1:
Observe that the value of the mode is positive since
1 2m s 2u m 2 M 2s 4u 1 m s 2u 2 m 2 M 1s 4u . 0:
3:19
88
2
:
s 2u M 2
When this is the case, the lter is said to be mean-square stable. In addition,
the fastest convergence rate occurs at the value of m that minimizes the
magnitude of the corresponding mode, which happens to be
mo
1
:
s 2u M 2
Figure 3.1 shows a plot of the learning curve for the numerical values
m 0:1429, M 5, s 2u 1, and s 2v 0:01. For these numerical values, the
lter is mean-square stable for step-sizes satisfying
0,m ,
2
0:2857:
7
89
Figure 3.1 Theoretical learning curve for the LMS algorithm with Gaussian iid regressors,
M 5, s 2u 1, s 2v 0:01, and m 0:1429.
Figure 3.2 Four sample squared-error curves for LMS with Gaussian iid regressors, M 5,
s 2u 1, s 2v 0:01, and m 0:1429.
90
Figure 3.3 Theoretical and ensemble-average learning curves for LMS with Gaussian iid
regressors, M 5, s 2u 1, s 2v 0:01, and m 0:1429.
91
Figure 3.4 Theoretical and ensemble-averaged learning curves for LMS with Gaussian iid
regressors, M 5, s 2u 1, s 2v 0:01, and m 0:275.
5. There is even a difference in behavior between the two ensembleaverage curves themselves: The higher the number of averaging
experiments, the closer the resulting ensemble-average curve is to the
theoretical curve. The analysis in a later section will reveal that even if
the number of experiments is increased signicantly, there will
continue to exist a discrepancy between the theoretical curve and the
experimental curve.
It should be mentioned that although the earlier discussion was restricted to
an example with independence assumptions on the data, these assumptions
have actually been enforced in the analysis and in the simulations and are
therefore valid. Thus the differences in behavior that one sees between the
theoretical learning curve and the experimental ones are not due to
assumptions that are made on the theoretical level and that are not valid on
the practical level. In this way, one can conclude that even under these
controlled conditions, the differences still exist. Actually, the differences
occur even for situations where the independence assumptions are not
satised (see [1]).
D. Example 4. There is one more phenomenon to highlight before moving on
to a justication of the results observed so far. Thus consider again the
numerical values used in Example 1, viz., M 5, s 2u 1, and s 2v 0:01.
For these values, the lter was seen to be mean-square stable for step-sizes
satisfying m , 0:2857. The diverging graph in Figure 3.5 conrms this fact.
92
However, the gure also shows a plot of the ensemble-average curve that is
obtained for a larger step-size, m 0:29, by averaging over 500
experiments. Mean-square theory predicts instability for this value of m ,
while the ensemble-average curve does not seem to diverge. Averaging over
a larger number of experiments reveals a similar behavior. An explanation
for this behavior is provided later by showing that, for larger step-sizes, there
is a noticeable distinction between the mean-square and the almost-sure
convergence behaviors of an adaptive lter.
E. Example 5. The earlier examples were concerned with data that satisfy the
independence assumptions. Now consider a tapped-delay line implementation with two taps, so that the regression vector at time n has the form
unT unun 1:
Observe that due to the shift structure, two successive regressors cannot be
independent and that, therefore, this is a situation where the independence
assumptions are not valid. Assume further that the entries fung are iid and
uniform in the interval 0:5; 0:5 so that
R EununT s 2u I;
s 2u
1
:
12
Figure 3.6 shows the ensemble-average curves that are obtained by averaging
over L 100 and L 1000 experiments for m 7:9. It is seen that in both
cases, the averaged curves converge, as opposed to the theoretical curve,
Figure 3.5 A comparison of the theoretical learning curve and the ensemble-average
learning curve for a case that is mean-square unstable. The theoretical curve is seen to diverge
in both cases, while the experimental curves converge. The plot on the left assumes zero noise
while the plot on the right uses s 2v 104 . The ensemble-average curves were obtained by
averaging over 500 experiments, with step-size m 0:29.
93
Figure 3.6 A comparison of the theoretical learning curve and the ensemble-average
learning curves for an unstable tapped-delay line implementation with uniform input. The
ensemble-average curves were obtained by averaging over 100 and 10000 experiments with
step-size m 7:9.
which is divergent for this value of the step-size (see [1]). Observe in addition
that the larger the value of L, the longer the averaged curve stays closer to the
theoretical curve before ultimately converging away from it.
3.7
MEAN-SQUARE CONVERGENCE
The examples in the previous section indicate that the behavior of the ensembleaverage curves may show signicant differences in relation to the behavior of the
theoretical learning curve. An explanation for the origin of these differences is
pursued in the following sections, which focus in some detail on the case of a singletap adaptive lter.
Thus assume that M 1, in which case wn and un become scalars. Assume
further that the noise signal v is negligible enough so that its effect can be ignored.
In this case, the energy recursion (3.15) collapses to
e2 n 1 Ane2 n;
where An is the random scalar variable
2
2m u2 n m 2 u4 n
m u2 n
1
An 1
:
gun g2 un
gun
4
3:20
94
3:21
In other words, the dynamics of the mean-square behavior of the lter is determined
by EAn, which denotes the model of the above rst-order recursion. Moreover,
since the output error is given by
en dn unT wn unT en;
we nd that
Ee2 n s 2u Ee2 n;
so that studying the evolution of Ee2 n is equivalent to studying the learning curve
of the lter. Hence, the analysis in the sequel focuses on the behavior of e2 n.
Now it is clear from (3.21) that the lter will be mean-square stable if, and only if,
the step-size m is chosen such that
2
m u2
, 1:
EA , 1 () E 1
gu
Observe that since all variables are stationary, by assumption, the time index n is
being dropped for compactness of notation, with fA; ug written instead of
fAn; ung. The expectation of A is fully characterized in terms of the second and
fourth moments of the normalized random variable
u
4
u p :
gu
Indeed, let
4
s 2u Eu 2 ;
r 4u Eu 4 :
95
2s 2u
:
r 4u
s 2u s 2u ;
r 4u r 4u :
For ease of comparison with a later condition (see (3.27), it is convenient to rewrite
the requirement EA , 1 in the equivalent form (in terms of the natural logarithm):
ln EA , 0 where
2
m u2
4
;
A 1
gu
3:22
3.8
ALMOST-SURE CONVERGENCE
In order to account for the differences between the theoretical and the experimental
learning curves, this section now examines the behavior of a single (or typical)
squared-error curve.
Starting from (3.20) and iterating it from time 0 up to time n, one arrives at the
expression
e2 n An 1An 2 . . . A0e2 0;
3:23
3:24
Assuming that the variance of the random variable ln A is bounded, then one can
invoke the strong law of large numbers to conclude that, as n ! 1,
ln e2 n a:s:
! E ln A;
n
3:25
96
where a.s. denotes almost-sure convergence. In other words, for large enough n,
the curve ln e2 n=n converges almost surely to the constant value E ln A. But what
about the sample curve e2 n itself? The answer also follows from the strong law of
larger numbers, which guarantees that, with probability 1, for each experiment V,
there exists a nite integer KV (dependent on the experiment) such that for all
n KV, the sample curve e2 n will be upper bounded by the curve
e2 n e2 0 expnE ln A exp
2n lnln ns ln A
3:26
where
2
m u2
;
A 1
gu
4
3:27
where u again is an iid random variable. This leads to a different condition on m than
the one derived in (3.22) for mean-square stability.
3.9
Comparing the conditions (3.22) and (3.27) for mean-square and almost-sure
convergence behaviors, one sees that there is a clear distinction between them. The
two conditions are not equivalent and, in fact, one always implies the other since, for
any nonnegative random variable A for which E A and E ln A both exist, it holds that
E ln A ln E A:
Therefore, values of the step-size m for which mean-square convergence occurs
always guarantee almost-sure convergence while the converse is not true: a value for
which ln EA . 0 (and thus for which mean-square divergence occurs) can still
guarantee almost-sure convergence, or E ln A , 0 which explains the phenomenon
in Figure 3.5.
However, these distinctions disappear for innitesimally small step-sizes, which
explains why the phenomena described before can pass unnoticed at this level of
adaptation. This is a consequence of the fact that under some reasonable
assumptions about the probability density function of the random variable fug, it
holds that (see, e.g., [1])
E ln A ln E A om ;
3:28
97
s 2u 1;
r 4u 3:
Note that both plots are close together for small m but that they become signicantly
different as m increases. Observe also that E ln A is negative well beyond the point
where lnE A becomes positive. This implies that there is a range of step-sizes for
which a typical curve e2 n converges to zero with probability 1, but Ee2 n
diverges. This explains the simulations in Figures 3.5 and 3.6. This is not a paradox.
Since the convergence is not uniform, there is a small (but nonzero) probability that
a sample curve e2 n will exist such that it assumes very large values for a long
interval of time before converging to zero. Finally, note that the value of m that
achieves the fastest mean-square convergence is noticeably smaller than the stepsize that achieves the fastest almost-sure convergence.
The above results can thus be used to understand the differences between
theoretical and simulated learning curves for large n and for larger step-sizes. In
other words, the almost-sure analysis condition allows one to clarify what happens
when L is xed (the number of experiments) and n is increased (the time dimension);
the ensemble-average curve tends to separate from the true average curve for
increasing n due to the difference in the convergence rates.
3.10
VARIANCE ANALYSIS
While the almost-sure analysis provides an explanation for the behavior of the
ensemble-average curves for large n, one observes from the curves of Figure 3.4 that
for small n, i.e., close to the beginning of the curves, there is usually good agreement
between the learning curve and the ensemble-average curves. This initial behavior
can be explained by resorting to a variance analysis, which focuses on evaluating the
98
Figure 3.7
Probfjz Ezj k g
s 2z
:
k2
99
3:29
which is time dependent in general. It then follows from the above Chebyshevs
inequality that
1
Prob je2 n Ee2 nj Ee2 n 4g 2 n:
2
For example, the bound evaluates to 0.01 for g n 0:05. This means that there is a
99 percent probability that e2 n will be close to its mean (and, more specically, lie
within the interval d0:5Ee2 n; 1:5Ee2 ne). Therefore, the smaller the value of g n,
the closer one expects the sample curve e2 n to be to the theoretical learning curve
at that time instant.
Now recall that the ensemble-average learning curve is constructed by averaging
together several sample curves e2 n to obtain, say,
L
X
^ n 1
de2 nei :
D
L i1
Assuming that the L experiments are independent, then the expected value of the
^ n is still equal to Ee2 n. However, the ratio g n that is
averaged curve D
^
associated with Dn will be smaller and given by
g 0 n
q
^ n
varD
Ee2 n
g n
p :
L
That is, the processpof constructing ensemble-average curves reduces the value of
g n by a factor of L.
^ n is close to
Although a small g n is desirable to conclude that e2 n or D
2
^ n approximates Ee2 n
Ee n, it turns out that g n increases with n (and thus D
less effectively for larger k, which is consistent with the results of the almost-sure
analysis). To see this, dene again the moments (assumed nite):
4
r 4u Eu 4
h 8u Eu 8 ;
s 2u Eu 2 ;
j 6u Eu 6 ;
100
p
where u denotes the normalized variable u= gu. Then
Ee4 n E A2 n e4 0
1 4m s 2u 6m 2 r 4u 4m 3 j 6u m 4 h 8u n e4 0:
Dene further the coefcients
4
3:30
r2 E A 1 2m s 2u m 2 r 4u :
It holds that r4 r22 (with equality only if un2 is a constant with probability 1).
With these denitions, g n is given by
p s
r n r 2n
rn
g n 4 n 2 2n 4 :
r2
r2 1
3:31
Therefore, except for the trivial case of a constant un, g n is strictly increasing,
and thus
lim g n 1;
n!1
3.11
The variance analysis in the previous section shows that that in the initial adaptation
steps, e2 n tends to stay close to its mean, Ee2 n, since the variance of e2 n is
small. As time progresses, the variance grows, and one expects that e2 n will
wander farther and farther away from its mean. In principle, this could mean that
e2 n will assume equally likely large and small values. However, this is usually not
the case. As time increases, e2 n assumes small values more often than large values,
and its probability density function becomes more and more asymmetric!
To explain this behavior, return to and rewrite it as follows
ln
2 X
n1
e n
ln Am:
e2 0
m0
Dene also
v Eln Am;
s 2 Eln Am v 2 ;
3:32
101
which are constants since fung is assumed stationary. Assuming that both v and s 2
are nite, one can use the Central Limit Theorem [19] to conclude that, for n ! 1,
2
1
e n
p ln 2
nv v N0; 1;
e 0
s n
that is, the quantity on the left-hand side tends to a normal distribution with zero
mean and unit variance. It then follows that, as n increases, the distribution of e2 n
can be well approximated by the following probability density function:
pe 2 x
2
1
2
2
p e1=2ns lnx=e 0nv ;
xs 2 p n
x . 0:
Figure 3.8 shows v and s 2 for u n uniformly distributed between 0:5 and 0.5.
Note the behavior similar to that seen in Figure 3.7, where u n is Gaussian. The next
gures show pe 2 x for several situations (in all cases, the vertical bar indicates the
position of E lne2 n=e2 0).
The plots in Figure 3.9 show the probability density function (pdf) for m 0:1,
n 10 (left plot) and n 500 (right plot). In this case, one has from Figure 3.8 that
v 1:679 102 ln E A 1:668 102 , and s 2 2:271 104 . Since v
ln E A and s 2 is small, one expects the learning curve to approximate well the
102
Figure 3.9 Left: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 0:1, n 10. Right: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 0:1, n 500.
behavior of a single run of the lter. This expectation is conrmed by the pdfs of
e2 10 and e2 500, which show that e2 n tends to stay close to its mean.
On the other hand, one can see from Figure 3.10 that for m 2:0 the behavior is
quite different (now one has v 0:4005, ln E A 0:3331, and s 2 0:1521. That
is, v is signicantly larger than ln E A, and the variance is large). Even for n 10
the pdf of e2 n is already quite asymmetric, a characteristic that becomes more
pronounced as n increases. In this situation, e2 n is much more likely to be smaller,
rather than larger, than its average.
Figure 3.10 Left: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 2:0, n 10. Right: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 2:0, n 20:
3.12
103
^ n when n
The above discussion can be used to compare the values of Ee2 n and D
is xed but L is allowed to vary. Indeed, it follows from the expression for g 0 n that
the larger the value of L is, the smaller the value of g 0 n will be. Hence, the more
^ n will be to that of Ee2 n.
experiments one averages, the closer the value of D
Another conclusion that follows from the almost-sure and variance analyses is
that an adaptive lter recursion exhibits two different rates of convergence (even for
single-tap adaptive lters). At rst, for small n, a sample curve e2 n is close to
Ee2 n and therefore converges at a rate that is determined by E ln A. For larger n,
the sample curve e2 n will converge at a rate that is determined by ln EA.
A nal remark: The knowledge that an adaptive lter is almost-sure convergent
does not necessarily guarantee satisfactory performance! Thus assume that a lter is
almost-sure stable but mean-square unstable. It follows from the earlier analysis that
a sample error curve will tend to diverge in the rst iterations (by following the
divergent mean-square learning curve), and only after an unknown interval of time
will the learning curve start to converge.
3.13
CONCLUDING REMARKS
REFERENCES
1. V. H. Nascimento and A. H. Sayed, On the learning mechanism of adaptive lters, IEEE
Trans. Signal Process., Vol. 48, No. 6, p. 1609, 2000.
2. J. E. Mazo, On the independence theory of equalizer convergence, The Bell System
Technical Journal, Vol. 58, p. 963, 1979.
3. O. Macchi and E. Eweda, Second-order convergence analysis of stochastic adaptive
linear ltering, IEEE Trans. Automatic Control, Vol. 28, No. 1, p. 76, 1983.
4. A. Feuer and E. Weinstein, Convergence analysis of LMS lters with uncorrelated
Gaussian data, IEEE Trans. Acoust. Speech Signal Process., Vol. 33, No. 1, p. 222, 1985.
5. V. Solo and X. Kong, Adaptive Signal Processing Algorithms, Prentice Hall, NJ, 1995.
6. H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications,
Springer, 1997.
104
7. H. J. Butterweck, A wave theory of long adaptive lters, IEEE Trans. Circuits and
Systems I, Vol. 48, p. 739, 2001.
8. A. H. Sayed and M. Rupp, A time-domain feedback analysis of adaptive algorithms via
the small gain theorem, Proc. SPIE, Vol. 2563, p. 458, San Diego, CA, 1995.
9. A. H. Sayed and M. Rupp, Robustness issues in adaptive ltering, in DSP Handbook,
Chapter 20, CRC Press, 1998.
10. J. Mai and A. H. Sayed, A feedback approach to the steady-state performance of
fractionally-spaced blind adaptive equalizers, IEEE Trans. Signal Process., Vol. 48, No.
1, p. 80, 2000.
11. N. R. Yousef and A. H. Sayed, A unied approach to the steady-state and tracking
analyses of adaptive lters, IEEE Trans. Signal Process., Vol. 49, No. 2, p. 314, 2001.
12. M. Rupp and A. H. Sayed, A time-domain feedback analysis of ltered-error adaptive
gradient algorithms, IEEE Trans. Signal Process., Vol. 44, No. 6, p. 1428, 1996.
13. A. H. Sayed and T. Y. Al-Naffouri, Mean-square analysis of normalized leaky adaptive
lters, Proc. ICASSP, Vol. 6, p. 3873, Salt Lake City, Utah, 2001.
14. T. Y. Al-Naffouri and A. H. Sayed, Transient analysis of data-normalized adaptive
lters, IEEE Trans. Signal Process., Vol. 51, No. 3, pp. 639 652, March 2003.
15. R. R. Bitmead and B. D. O. Anderson, Adaptive frequency sampling lters, IEEE
Trans. Circuits and Systems, Vol. 28, No. 6, p. 524, 1981.
16. R. R. Bitmead, B. D. O., Anderson and T. S. Ng, Convergence rate determination for
gradient-based adaptive estimators, Automatica, Vol. 22, p. 185, 1986.
17. H. J. Kushner and F. J. Vazquez-Abad, Stochastic approximation methods for systems
over an innite horizon, SIAM Journal of Control and Optimization, Vol. 34, No. 2, p.
712, 1996.
18. R. Durrett, Probability: Theory and Examples, 2nd edition, Duxbury Press, 1996.
19. D. Williams, Probability with Martingales, Cambridge University Press, 2000.
20. A. H. Sayed, Fundamentals of Adaptive Filtering, Wiley, New York, 2003.
ON THE ROBUSTNESS OF
LMS FILTERS
BABAK HASSIBI
California Institute of Technology
4.1
INTRODUCTION
4:1
for some xed weight vector w (Fig. 4.1). Despite its apparent simplicity, the linear
model has broad implications and applies to many different problems and
applications. Most often, the crucial step in writing the model (4.1) is to determine
the input output pairs (hn; yn) and the nature of the approximation . For
example, if we are presented with scalar input output sequences, or time series, un
and yn then a possible model could be
yn w1 un w2 un01 wm un m 1:
4:2
In this chapter we will assume, for simplicity, that the output is a scalar. The more general problem of a
vector output can be handled without much further difculty.
105
106
Figure 4.1
lters; however, we shall not go into details here. We only mention in passing that
the model (4.1) is also of central importance because it can be regarded as the
linearization of more general nonlinear models (such as neural networks) around a
suitable operating point.
In any event, once the model (4.1) has been constructed, the problem is to
determine the best weight vector that describes the relationship between the inputs
fhng and the outputs fyng. The question, of course, is, in what sense do we mean
best? A reasonable choice is to have hnT w match yn in a least-mean-squares
sense, that is, to choose w according to the criterion
min Eyn hnT w2 ;
4:3
Ehnyn p;
Eyn2 ry :
4:4
Figure 4.2
4.1 INTRODUCTION
107
4:5
4.1.1
Therefore the pioneering work of Widrow and Hoff came as a breakthrough since it
provided a recursive way of approximately solving (4.3) without knowledge of the
statistics of the signals involved [3, 1, 2]. Since the statistics of the signals are not
known, the expectation in (4.3) cannot be explicitly performed. Nor can the gradient
of the cost function be computed. However, the key observation of Widrow and Hoff
was that using the instantaneous value of the squared error, yn hnT w2 , rather
than its unknown mean, one can come up with an estimate of the gradient function
via differentiation with respect to w. This so-called instantaneous gradient is given
by hnyn hnT w, and so the algorithm updates estimates of the weight
vector along the negative direction of the instantaneous gradient:
^ n 1;
^ n w
^ n 1 m hnyn hnT w
w
4:6
108
4.1.2
N
X
n1
!1
hnhn
N
X
hnyn:
4:7
n1
N
1X
yn hnT w2 :
N n1
4:8
Note that, compared to (4.3), which was a stochastic least-squares problem, (4.8) is a
deterministic least-squares problem. Thus, there is no need to assume random
processes or to take expectations. Problem (4.8) can be readily solved via a
straightforward differentiation, or completion of squares. The reassuring, and not
altogether unexpected, result is that the solution to (4.8) is also given by (4.7). Thus,
replacing the unknown statistics R and p by their data-based averages and replacing
the mean-squared error by its data-based average are essentially the same thing and
lead to the same solution (4.7).
By all accounts, the solution of (4.7) appears to be a better approximation for
solving (4.3) than that of the LMS lter. However, compared to the LMS algorithm,
which yields a recursive solution, it has the drawback that one needs access to the
entire data set in order to compute the solution. Although this is not an issue in some
applications (such as system identication), it is crucial in many applications, such
as control and communications, where certain decisions must be made in real time,
and so depend on the current estimate of the weight vector. In such applications a
recursive solution is a must.
109
4.1 INTRODUCTION
Fortunately, the situation is easily remedied. Note that the solution to (4.8) at time
m i N (obtained by setting the upper indices in the sums to i) is given by
^ i
w
i
X
!1
hnhnT
n1
i
X
hnyn;
4:9
n1
where we have assumed that the matrix appearing in the parentheses is invertible
(which, P
incidentally is also why we need i m). It is convenient to dene the matrix
Pi in1 hnhnT 1 so that
^ i Pi
w
i
X
hnyn
n1
Pi 11 hihiT 1
i1
X
!
hnyn hiyi
n1
!
i1
Pi 1hihiT Pi 1 X
Pi 1
hnyn hiyi
1 hiT Pi 1hi
n1
^ i 1
w
Pi 1hihiT
^ i 1 Pi 1hiyi
w
1 hiT Pi 1hi
Pi 1hi
hiT Pi 1hiyi
;
1 hiT Pi 1hi
where in the third step we used the matrix inversion lemma A BCD1
A1 A1 BC1 DA1 B1 DA1 . Now the last expression readily yields
^ i w
^ i 1
w
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
4:10
which is the recursion we were pursuing. All that is needed is a recursion for Pi.
But the matrix inversion lemma can again be used to obtain
Pi Pi 1
Pi 1hihiT Pi 1
hiT Pi 1:
1 hiT Pi 1hi
4:11
Together the recursions (4.10 4.11) constitute what is known as the recursive-leastsquares (RLS) algorithm. It has a long history, and its inception goes back to Gauss
and Legendre. Due to its similarity to a differential equation rst studied by Riccati,
the recursion (4.11) is referred to as a Riccati recursion. (For an interesting review of
these, see [7].) Like all recursions, the recursions (4.10 4.11) must be initialized.
110
The RLS algorithm just described, at each timePinstant i, gives the exact solution to
the deterministic least-squares problem minw in1 yn hnT w2 . We also saw
that, under some mild mixing conditions, it converges to the Wiener solution, which
is the optimal solution to the stochastic least-squares problem (4.3). However, one
may wonder whether the estimate provided by the RLS algorithm at any time instant
i, and not just its limiting value, has a stochastic interpretation in its own right. It
turns out that this is indeed the case.
To this end, recall from (4.1) that the linear model we have assumed is
approximate. However, we can always make it accurate by adding an appropriate
disturbance signal vn, so that
yn hnT w vn:
4:12
!
N
1
1 X
T
2
p exp 2
yn hn w ;
2s n1
2p s 2 N
4:13
since, conditioned on w and the hn, the yn are independent Gaussian with mean
hnT w and variance s 2 . The above conditional density is often referred to as the
likelihood function. Any estimator that maximizes it according to the criterion
max py1; . . . ; yNjw; h1; . . . ; hn
w
4:14
4.1 INTRODUCTION
111
Some remarks on the stochastic assumptions that lead to the above observation
are in order. Most importantly, they differ from the stochastic assumptions we made
when obtaining the Wiener solution in two ways. First, we require the disturbance
sequence to be iid Gaussian (which we did not need for the Wiener solution). And
second, we do not need any stochastic assumption on the input vectors hn (as we
did in the Wiener case) since we treat them as known and condition on their
values.
4:15
where we have used Bayes rule. Any estimator that maximizes the above a
posteriori probability is referred to as a maximum a posteriori (MAP) estimator.
Since the denominator is independent of w, and since we assume that w is
independent of the regressor vectors hnso that pwjh1; . . . ; hN pwit
follows that MAP estimators satisfy the criterion
max py1; . . . ; yNjw; h1; . . . ; hNpw:
w
4:16
To obtain an explicit solution, we need to assume a certain model for w and, again,
the standard one is that it is a zero-mean Gaussian random vector with covariance
matrix EwwT P0 , independent of all the other signals involved. In this case, we
have
py1; . . . ; yNjw; h1; . . . ; hNpw
"
#!
N
X
1
1
T
2
T 2 1
yn hn w
exp 2 w s P0 w
; 4:17
K
2s
n1
q
where K 2p Nm s 2N det P0 . Therefore the MAP estimator is one that solves
the following regularized least-squares problem:
"
#
N
X
min wT s 2 P1
yn hnT w2 :
0 w
w
4:18
n1
This is identical to the least-squares problem (4.8), except for the regularization
term (often called the prior) wT s 2 P1
0 w. The solution is identical to that of RLS
112
(4.10 4.11) except that now we should initialize the recursions with
^ 0 0
w
and P0 s 2 P0 :
In fact, this makes the regularized solution more convenient to use than the
nonregularized one, for which we could only start the recurions from time m and
^ m and Pm.
needed to explicitly compute w
4.1.3.3 Least-Mean-Squares Estimation Let us return to the original leastmean-square criterion (4.3) but now apply the above stochastic assumptions. Thus,
^ to denote our estimate of the weight vector and w to denote its true unknown
using w
value, the mean square error (4.3) becomes
^ 2
^ 2 Evn hnT w w
Eyn hnT w
^ w w
^ T hn;
s 2 hnT Ew w
where in the second step we used the fact that vn is independent of w and where we
used our assumption that hn is deterministic to pull it outside the expectation. This
^ k2 . It is
implies that the criterion becomes that of minimizing EkhnT w hnT w
T
often customary to dene the uncorrupted output hn w as the desired signal
dn hnT w. In this case, the criterion is to choose w so as to best match the
desired signal dn in a least-mean-squares sense.
In any event, at time n, the optimal estimate of the desired signal is given by the
conditional mean [8]
d^ n EhnT wjy1; . . . ; yn hnT Ewjy1; . . . ; yn:
This implies that, irrespective of the input vector hn, we may dene the optimal
estimate of the weight vector at time n as
^ n Ewjy1; . . . ; yn:
w
4:19
The fact that the optimal estimate of w here does not depend on hn is signicant
and follows from the linearity of the conditional mean. We remark that this is not
necessarily true of other estimation criteriamore on this later.
When vn is a stationary white Gaussian process, it is well known that the
conditional mean is given by the MAP estimator [9]. Thus the solution to the
regularized RLS problem (4.18) yields the least-mean-squares solution. When vn
is not Gaussian, however, the conditional mean does not coincide with the MAP
estimator and the solution is not given by (4.18).
If we insist that our estimator be linear in the observations, that is, d^ n be a linear
function of fy1; . . . ; yng, then it turns out that the optimal estimator depends only
on the rst- and second-order statistics (the mean and covariance functions) of
the signals involved. In this case the optimal estimator is known as the linear
4.1 INTRODUCTION
113
Some Questions
In the past few sections we have provided the RLS algorithm with a plethora of
properties and optimality criteria. However, at this stage, for the LMS algorithm all
we have provided is a heuristic argument for its being an approximation to the
Wiener solution.
We have argued that, under some mild mixing conditions, the RLS solution
converges to the optimal Wiener solution. The convergenceP
of the RLS algorithm
can be seen from the fact that, as n ! 1, the matrix Pi in1 hnhnT 1 approaches zero2 and therefore so does the gain vector Pi 1hi=1 hiT Pi 1
^ i 1 ! w
^ i ! wo . The LMS algorithm (4.6), on
hi in (4.10), implying that w
the other hand, has no chance of convergence since the gain vector m hn is nonzero
for all time, meaning that, as long as there is a nonzero error signal yn hnT w
(which is always the case when we have a disturbance signal vn), the value of
^ i 1 can never approach w
^ i.3
w
We have also shown that under a Gaussian disturbance model, and assuming that
the input vectors are deterministic, depending on whether we consider a regularizing
term or not, RLS recursively yields the ML and MAP estimates of the weight vector.
It also yields the least-mean-squares solution under the Gaussian assumption and the
linear least-mean-squares solution under the assumption of a zero-mean white
disturbance signal. For LMS, on the other hand, we have no such stochastic
interpretations.
With all that has been said, it appears that RLS should be the choice for adaptive
ltering. Nonetheless, a survey of the applications of adaptive ltering over the past
few decades reveals that the LMS algorithm and its variants are more widely used
P
2
This requires that for any scalar D . 0 there exist a time instant i such that in1 hnhnT . DIm ,
where the latter inequality is in the sense of positive denite matrices. This condition is referred to as
persistence of excitation and is a very reasonable assumption.
3
We should mention that the above argument holds when the LMS algorithm has a constant step size
m . 0. There also exist variants of LMS with a time-varying step size m i; . 0. If m i ! 0, then the
LMS algorithm will converge. However, we will not be considering vanishing step-sizes in this chapter.
114
than RLS and its variants. It is therefore natural to ask why this has been the case.
Apart from the performance and optimality issues just discussed, there are other
criteria that determine the applicability of a certain algorithm or methodology for
different practical problems. These may be listed as follows.
1. Simplicity. Simpler solutions are often preferred in practice, and the LMS
algorithm is certainly simple. However, the RLS algorithm is not really so
complex. It has a structure quite similar to that of LMS; the weight vector is
updated according to the error signal along the direction of a certain gain
vector. The only difference is that computing the gain vector in RLS requires
the propagation of a Riccati recursion.
2. Computational complexity. Algorithms that require fewer computations are
preferred in practice. The LMS algorithm clearly requires Om computations
per iteration. The RLS algorithm, as depicted in (4.10 4.11), requires Om2
computations per iteration, essentially because the Riccati recursion (4.11)
requires a matrix-vector product as well as a vector-vector outerproduct.
However, in many applications the input vectors possess certain structure. The
most common of these is the time series structure hnT un un 1
un m 1 of (4.2), where there is a great deal of redundancy between
hn and hn 1. There exist several clever techniques to exploit this
redundancy and thereby reduce the computations to Om per iteration [10].
These are generally referred to as fast RLS algorithms.
3. Numerical stability. Imprecisions and round-off errors are unavoidable in the
numerical implementation of any algorithm. Algorithms that suffer from
numerical instability in the face of such errors are not suitable in practice. If
the learning rate is not too large, so that the weight vector estimates do not
diverge, then it can be shown that the LMS algorithm is numerically stable. In
other words, round-off errors cannot lead to divergence and other problems.
With RLS, if implemented according to (4.10 4.11), numerical instability can
be an issue since round-off errors and nite precision can lead to the matrix
Pi of the Riccati recursion (4.11) losing its positive deniteness. (When this
happens, RLS may not yield meaningful estimates.) The problem of losing
positive deniteness of the Riccati variable has been known for a long time,
especially in the context of Kalman ltering, and has been resolved by
employing certain square-root algorithms [11, 8]. Rather than propagate the
Riccati variable Pi, these algorithms propagate its square-root factor, that is,
an m m matrix Pi1=2 , such that Pi1=2 PiT=2 Pi. With this trick, loss
of positivity is no longer an issue. Moreover, square-root algorithms make
extensive use of unitary transformations that, since they do not change the
norm of vectors, are the most numerically stable matrix operations that can be
performed.
In their most general form, square-root algorithms for RLS require Om2
operations. Under the time series structure hnT un un 1
un m 1, there exist fast square-root algorithms that require Om
115
4.2
The answer to the question of why LMS is more often used in adaptive ltering can
be found in the assumptions required for the various optimality properties of RLS.
When the inputs and outputs are stationary random processes, the Wiener solution
(4.5) is meaningful and the RLS algorithm converges to it. However, what if the
signals involved are not stationary? Although the Wiener solution is no longer
meaningful, under a persistence of excitation assumption, the RLS algorithm still
convergesbut what does it converge to? Moreover, what if the unknown weight
vector w, which we have so far assumed to be constant, itself varies with time? The
RLS algorithm (4.10 4.11) has a vanishing gain vector and so cannot track a timevarying w.4 The LMS algorithm, with its nonvanishing gain vector, on the other
hand, may be able to perform such tracking.
The stochastic optimality properties of being the ML and/or MAP estimators
require that the additive noise term vn be zero-mean, white, and Gaussian. The
optimality property of being the linear least-mean-squares estimator requires that
vn be zero-mean and white. But what if the vn are not Gaussian? What if they are
not white? In many adaptive ltering applications the additive noise term includes
modeling errors: The true model may be an IIR lter, and so by assuming the FIR
model (4.2), we are neglecting the tail of the lter and relegating it to vn. Or the
model may have some nonlinearities that we have ignored and included in vn. In
any event, with the inclusion of modeling errors, the vn are no longer Gaussian, or
white for that matter, and the stochastic optimality properties do not hold. So what
happens to these algorithms and what happens to their performance?
4
There exist certain variations of the RLS algorithm, such as the exponentially windowed RLS algorithm,
that have a non-vanishing gain vector and so may be suitable for tracking. More on these later.
116
Of course, no matter what the statistics and distributions of the noise term may be,
the RLS algorithm always is optimal in the sense of minimizing the least-squares
cost (4.8) since this is a deterministic cost. Now this very well may be a reasonable
optimality property to have. However, insofar as deterministic least-squares costs
go, it may also be reasonable to consider the weighted least-squares criterion:
min
w
N
1X
qn yn hnT w2 ;
N n1
4:20
where fqn . 0g is a set of weights. More generally, we can consider the weighted
least-squares problem:
min
w
N X
N
1X
yn hnT wqnk yk hkT w;
N n1 k1
4:21
where Q qnk is an N N positive denite matrix. This latter cost, especially, can
be obtained by rst applying the error sequence fyn hnT wg to a linear lter and
then applying the output of the lter to a standard least-squares problem of the form
(4.8).
We thus conclude that there is nothing special about the least-squares criterion
(4.8). Which of the three alternatives (4.8), (4.20), and (4.21) one should use
depends on the application at hand.
4.2.1
The above discussions should bother us on two counts. First, we are left with the
question, what will the performance of RLS be when the stochastic assumptions are
not met? In other words, how robust is RLS with respect to modeling errors and lack
of statistical knowledge of the exogenous signals? Second, if indeed we assume
certain stochastic assumptions (such as Gaussianity, whiteness, etc.), then the
problem of adaptive ltering reduces to a statistical estimation problem. This really
takes the wind out of the sails of adaptive ltering. Instead of being an independent
eld in itself, it is relegated to being a subset of statistical estimation theory. More
importantly, the word adaptive becomes vacuous. What are we adapting to? Nothing
really, since we have assumed perfect stochastic models for all the signals involved.
In reality, adaptive algorithms should be able to operate under different stochastic
assumptions and tolerate different types of modeling errors. In other words, they
should be able to adapt to the stochastic (or otherwise) environment that they are in.
This clearly implies that adaptive algorithms must be robust to variations of the
system model and underlying statistical assumptions.
Therefore adaptation is much more related to robustness with respect to statistical
variation than it is to optimality with respect to a specic statistical model. This is a
very important point that is not nearly as well recognized as it should be. Of course,
due to their robustness properties, adaptive algorithms will be conservative and will
not perform as well as the optimal algorithm for any particular statistical model.
However, we do expect them to perform reasonably well over a wide range of
117
statistical models and environmentsindeed, very much like the way the LMS
algorithm performs in different environments.
In order to begin to quantify what we mean by robustness, it is helpful to pose the
basic robustness question for any estimation algorithm:
Is it possible that small disturbances and modeling errors may lead to large
estimation errors?
Note that in the above question we have made no reference to the statistics of the
disturbances. All we are asking for is that, if the disturbances and modeling errors
are small, then the estimation errors be small no matter what the statistics of the
disturbances may be. In other words, as long as we set up a model that reasonably
describes our data, that is, one in which the disturbances and modeling errors are
small, then a robust estimator will guarantee that the estimation errors will be small.
The above comments imply that any approach to robust estimation requires a
notion of largeness and smallness for the signals involved. For this there exist many
possibilities. For example, one can consider the peak of the absolute value of the
signals as one such measure. In control-theoretic jargon this is referred to as l1 theory
[13]. A perhaps more physical measure, widely used in practice and one that allows
for more analytic tractability, is the energy of the signal. This is what leads to H 1
theory.
4.2.2
The H 1 Approach
The rst systematic study of robustness, within the framework described above, was
done in the context of control theory and was introduced by Zames in 1981 [14].
Zames H 1 theory was concerned not with estimation problems but with the design
of controllers that were robust with respect to model uncertainty and lack of
statistical knowledge on the exogenous signals. H 1 control theory can be regarded
as the outgrowth and extension of classical linear-quadratic-Gaussian (LQG) control
theory, developed in the 1950s and 1960s, which assumed perfect models and
complete statistical knowledge [15]. Indeed, the development of H 1 theory in the
1980s and 1990s is considered one of the signicant achievements in control theory.
Now in the H 1 context, a robust estimator is one for which disturbances with
small energy lead to estimation errors with small energy. Therefore the natural
object to study is the energy gain from the disturbances to the estimation errors. In
particular, since we are interested in having small estimation error energies for all
small disturbance energies, we need to focus on the worst-case energy gain. This is
what is referred to as the H 1 norm.5
5
The norm dened here is really what is knwn as a 2-induced norm (since it is the maximum of the ratio of
energies or 2-norms). For historical reasons, in the control theory literature the 2-induced norm is referred
to as the H 1 norm. Here H stands for Hardy space, the space of all causal and stable functions of a
complex variable, that is, the space of all functions analytic outside the unit circle. The superscript 1
refers to the fact that H 1 is the space of functions analytic outside the unit circle and with nite magnitude
on the unit circle [16]. In any event, the term H 1 norm is a misnomer since, strictly speaking, the 2induced norm becomes an 1 norm (in the frequency domain) only when we consider innite-horizon
linear time-invariant (LTI) systems, which is the context in which H 1 control was originally introduced.
(As is well known, the maximum energy gain of any stable LTI system with transfer function Kz is given
by maxv jKe jv j2 .) Nonetheless, we too shall be guilty of using this loose terminology.
118
Figure 4.3 The H 1 norm is the maximum energy gain from the disturbances to the
estimation errors.
g 2 sup
4:22
P
P
where kvk2 n vn2 , kek2 n en2 , l2 denotes the space of square-summable
sequences and sup refers to supremum.
In H 1 estimation we seek the estimator that minimizes the H 1 B norm (see
Fig. 4.3).
4.2.2.2 Problem 1 (Optimal H 1 Estimation Problem) Find an estimator that
minimizes the H 1 (or 2-induced) norm from the disturbances fvng to the estimation
errors, that is,
kek2
;
all estimators fvng[l2 kvk2
inf
sup
4:23
where inf refers to inmum. Moreover, nd the resulting optimal value g opt inf g .
The minimax nature of H 1 -optimal estimation is evident from the above problem
formulation. H 1 estimation is essentially a game problem: Nature (the opponent)
has access to the unknown disturbance fvng and chooses it to maximize the energy
gain in (4.23), whereas we can choose the estimator to minimize it.
H 1 -optimal estimators safeguard against the worst-case disturbance that
maximizes the energy gain to the estimation errors. Since this worst-case
disturbance is a deterministic sequence, such estimators do not require any statistical
assumptions. Moreover, since the inmization in (4.23) is taken over all possible
disturbances, the resulting estimator will be robust with respect to disturbance
variation. It can, on the other hand, be quite conservative.
We should mention that H 1 theory is a very rich area and that there exist many
different approaches to solving it. The original methods were operator- and functiontheoretic and made use of interpolation theory [17, 18], but there also exist statespace [19], circuit-theoretic [20] and game-theoretic [21] approaches. We shall not
go into any of these here, though the interested reader may consult any of the
119
A First Attack
Let us begin to consider how we can apply the H 1 approach to the adaptive ltering
problem at hand. Suppose we are given the input-output pairs fhn; yngNn1 and
we want to estimate the unknown weight vector w. How should we set up the
problem?
First, let us look at the disturbances. Clearly, there are two unknowns: the weight
vector itself and the unknown additive noise
vn yn hnT w:
We can therefore dene the energy of the disturbances as6
m 1 wT w
N
X
n1
vn2 m 1 wT w
N
X
yn hnT w2 ;
4:24
n1
d^ N1
d^ N 1 hN 1T w2
:
P
m 1 wT w Nn1 yn hnT w2
4:25
6
^ 0, say, then it is more appropriate to dene the
If we have an initial estimate of the weight vector w
disturbance energy as
^ 0T w w
^ 0
m 1 w w
N
X
vn2 :
n1
^ 0 0.
However, without loss of generality we shall assume that w
7
We remark that in the prediction problem under consideration d^ N 1 is allowed to be a function of
hN 1 but not of yN 1.
120
To facilitate solving this problem, let us rst look at the suboptimal problem of
guaranteeing that the maximum energy gain is bounded by the value g 2 . In other
words, for all possible w, we would like to have
d^ N 1 hN 1T w2
, g 2;
P
m 1 wT w Nn1 yn hnT w2
or, equivalently,
m 1 wT w
N
X
yn hnT w2 g 2 d^ N 1 hN 1T w2 . 0:
4:26
n1
Note that due to the minus sign on the last term, this is an indenite quadratic form in
the unknown w. Dening
2
3
y1
6 . 7
y 4 .. 5;
H h1 hN;
d^ d^ N 1
h hN 1;
yN
allows us to rewrite (4.26) as
2
6
wT yT d^ 6
4
m 1 I HHT g 2 hhT
H
HT
2 T
g 2 h
32
76 7
76 y 7 . 0:
54 5
d^
g 2
0
4:27
Now the above indenite quadratic form is positive for all w if, and only if,
1. It has a minimum in the unknown variable w (otherwise its value could
approach 1).
2. d^ can be chosen such that the value at the minimum is positive.
The condition for having a minimum can be readily found from computing the
Jacobian (or second derivative with respect to w and insisting that it be positive
denite, that is,
m 1 I HHT g 2 hhT . 0:
Some simple algebra shows that this is equivalent to
g 2 . hT m I HHT 1 h:
4:28
121
g 2opt
hN 1
mI
N
X
#1
hnhn
hN 1:
4:29
n1
Some further algebra shows that once the minimization over w has been done, the
value of the quadratic form at its minimum is
"
y
d^
g 2
"
HT
#
H
hT
h
!1 " #
y
d^
y
1
I HT H
m
1
y
1 !2
^d 1 hT H 1 I HT H
y
m
m
g 2 hT m I HHT 1 h
4:30
Note that due to (4.28), the second term in the above equation is always nonpositive.
Therefore it is clear that one choice that guarantees the value at the minimum to be
positive is
1
1
1
1
1
I HT H
y hT
I HHT
Hy:
d^ hT H
m
m
m
But this is nothing but the estimate obtained from the solution to the regularized
least-squares problem (4.18) with P0 m I:
d^ N 1 hN 1
mI
N
X
!1
hnhn
n1
N
X
hnyn !
4:31
n1
|{z}
^ ls
w
In other words, after all the trouble of dening robustness with respect to
disturbance variation and introducing the H 1 estimation problem, it turns out that
the optimal solution is still given by the regularized least-squares solution (4.18)a
strange predicament indeed. So was all this worthwhile? Let us probe further.
4.2.4
A Prediction Problem
What we have above is that if we are given the data fhn; yngNn1 , then, for any
new input vector hN 1, the best predictor in the H 1 sense is given by the
regularized least-squares estimate. Of course, the regularized solution we have
122
presented is off-linewe need all the data before we can predict the output for
hN 1. As mentioned earlier, in many applications, real-time constraints are
crucial and at any time n we will need to predict the output at time n 1 using our
past observations. In other words, we need an optimal solution that is recursive.
For the least-squares problem we saw that a recursive solution was readily
available via the RLS algorithm. In other words, at each time i the solution to the
(regularized) RLS algorithm solves the following problem:
min
w
i
X
yn hnT w2 :
n1
d^ i1
d^ i 1 hi 1T w2
;
P
m 1 wT w in1 yn hnT w2
and also, since future values of the disturbance have no effect on the prediction error
at time i:
min max
d^ i1
d^ i 1 hi 1T w2
:
P
m 1 wT w Nn1 yn hnT w2
From these arguments, we conclude that the RLS algorithm recursively solves the
following minimax estimation problem:
"
min
d^ 1;...;d^ N1
max
w
d^ 1 h1T w2
P
m 1 wT w Nn1 yn hnT w2
#
d^ N 1 hN 1T w2
max
:
PN
T
2
w m 1 wT w
n1 yn hn w
4:32
But this is not quite the problem we would like to solve. What we would like to solve
is the following:
PN1
min
d^ 1;...;d^ N1
max
w
d^ n hnT w2
;
P
m 1 wT w Nn1 yn hnT w2
n1
4:33
that is, we would like to minimize the maximum energy gain from the disturbances
to the prediction errors fd^ n hnT wg. The problem with (4.32) is that at each time
123
4.2.4.1
g 2opt
min
d^ 1;...;d^ N1
max
w
T
2
^
n1 d n hn w
P
N
m 1 wT w n1 yn hnT w2
cannot be less than 1. To this end, suppose, without loss of generality, that the initial
^ 0 0. Then one may conceive of a disturbance
guess of the estimator is w
sequence vn that yields an output signal that is zero for all times and therefore
^ 0 0. Thus,
coincides with the output expected from w
vn hnT w
and
yn hnT w vn 0:
In this case, any permissible estimator will not change its estimate of w so that
^ n w
^ 0 0 for all n.8 Moreover, the prediction error becomes
w
d^ n 1 hn 1T w 0 hn 1T w vn;
and so the energy gain is
PN1
T
2
kvk2
n1 hn w
:
P
m 1 wT w kvk2 m 1 wT w Nn1 hnT w2
Note that if an estimator changes its estimate of w when confronted with an all-zero output yn 0, for
all n), then in the disturbance-free case (w 0 and vn 0, for all n) the denominator of (4.33) will be
zero but the numerator nonzero, which makes the energy gain innite. This is clearly not permissible.
8
124
Let us now assume that the fhng have the property that
lim
N!1
N
X
hnT hn 1:
4:34
n1
T
2
n1 hn w
P
N
m 1 wT w n1 hnT w2
PN
T
2
n1 hn w
P
N
m 1 wT w n1 hnT w2
1 e;
which can be made arbitrarily close to one. We thus conclude that g opt 1.
This implies that for all estimators, in the worst case, the prediction error energy
can be no less than the disturbance energy. In other words, in the worst case, it is not
possible to obtain disturbance attenuation.
The question now is whether a worst-case energy gain of unity is achievable. In
other words, is g opt 1? And if so, what is the optimal estimator?
4.2.5
Although this volume is devoted to the LMS algorithm, we have spent most of our
time studying the RLS algorithm and considering robustness issues. It is now time to
return to the LMS algorithm (4.6). If we dene the estimation error of the weight
~ n w w
^ n, then it is straightforward to see that
vector as w
~ n 1 m hnyn hnT w
~ n m 1=2 w
^ n 1:
m 1=2 w
4:35
(The reason for premultiplying by m 1=2 will become clear in a moment.) Moreover,
we have
^ n 1 hnT w
~ n 1:
vn yn hnT w
4:36
Squaring both sides of (4.35) and (4.36) and subtracting the results yields
~ nj2 vn2 m 1 jw
~ n 1j2 hnT w
~ n 12
m 1 jw
^ n 12 ;
1 m jhnj2 yn hnT w
where for any row vector a we have dened jaj2 aT a.
4:37
125
If we now add up all the equations (4.37) from time n 1 to time n N 1, all
~ nj2 cancel out and we are left with
but the rst and last terms of m 1 jw
~ N 1j2
m 1 jw
N 1
X
~ 0j2
vn2 m 1 jw
n1
N 1
X
~ n 12
hnT w
n1
N 1
X
^ n 12 :
1 m jhnj2 yn hnT w
n1
4:38
Note that the second term in the numerator is just the energy of the prediction
~ n 1 hnT w d^ n. Moreover, if we assume that
error hnT w
m
1
;
hnT hn
8n
4:39
then the third term in the numerator of (4.38) is nonnegative, and so we have
PN1
T
2
^
n1 d n hn w
P
2
m 1 wT w N1
n1 vn
1;
4:40
~ 0j2 wT w.
where we have used the fact that jw
The result of (4.40) is signicant since it shows that if the learning rate satises
the bound (4.39), then LMS guarantees an energy gain no greater than 1. Since we
previously argued that g opt 1, this implies that LMS is H 1 -optimal!
4.2.5.1 The Condition on the Learning Rate We have shown that if the learning
rate satises (4.39), then for prediction errors LMS is H 1 -optimal and achieves
g opt 1. But what if (4.39) is not satised? Is it still true that g opt 1?
To answer this question, suppose that g opt 1, so that for any time i there exists
an estimator such that for all disturbances
Pi1 ^
T
2
n1 d n hn w
1:
P
m 1 wT w in1 yn hnT w2
126
m 1 wT w
i
i1
X
X
yn hnT w2
d^ n hnT w2 0
n1
4:41
n1
m 1 I
i
X
n1
hnhnT
i1
X
hnhnT m 1 I hi 1hi 1T 0:
n1
Since the above Jacobian matrix has only one nonunity eigenvalue, this latter
condition is simply
m 1 hi 1T hi 1 0:
But since i was an arbitrary time instant, this is precisely the condition (4.39).
We have thus shown that if g opt 1, then (4.39) must hold. Therefore it follows
that if (4.39) does not hold, then g opt = 1, which implies g opt . 1.
4.2.6
n 0;
g 2 max 2
kd~ k2
:
kvk2
w;v[l m 1 wT w
1. If m inf n
1
, then the minimum value of g 2 is given by
hnT hn
g 2opt 1;
^ n 1, where w
^ n is
and an H 1 -optimal predictor is given by d^ n hnT w
found from the LMS algorithm (4.6).
2. If m . inf n
g 2opt
1
, then
hnT hn
2
1 sup l max 4 m I
n
n
X
!1
hnhn
127
hn 1hn 1 m I5;
T
i0
^ 1 4:42
w
PnhnhnT Pn
;
g 2opt
T
hn Pnhn
g 2opt 1
P0 m I:
4:43
We should remark that the proof for part 1 of the above theorem has already been
given. Proving part 2 requires knowledge of H 1 theory, and so we omit it and refer
the interested reader to [22]. Nonetheless, we have included the statement of part 2
for completeness and only mention that (4.42) is identical to the LMS algorithm
(4.6) except that the learning rate m I has been replaced by the Riccati variable Pn.
Theorem 1 solves the long-standing problems of nding a rigorous basis for the
LMS algorithm. Moreover, it conrms the robustness of the algorithm and gives
theoretical justication for its widespread use in adaptive ltering. More to the point,
the LMS algorithm is widely used not because it is an approximate least-squares
solution (the exact solution, RLS, is readily available), or because it is simple,
computationally efcient, or numerically stable (RLS can be made competitive to
LMS on all these counts), but rather because it is an algorithm that is robust with
respect to disturbance variation, a property of which RLS, for example, cannot
boast. In fact, for prediction errors and in the H 1 setting, it is the optimal (hence
most robust) algorithm in existence.
Since it is a robust algorithm, LMS exhibits reasonable to good performance over
a wide range of environments and operating conditions. However, it cannot hope to
compete with algorithms that know the exact statistics of the signals involved and
are optimized for them. The point is that LMS will invariably perform within reason
no matter what the statistics and modeling errors are.
Finally, the LMS algorithm can be viewed as providing a contractive mapping
from the disturbances to the prediction errors. (This is true since the prediction error
energy is always less than the disturbance energy.) This property turns out to have
signicant implications for studying the stability of a wide class of adaptive
algorithms. The idea is to represent any adaptive algorithm as the feedback
connection of the LMS algorithm and a secondary system and to apply the small-
128
gain theorem from control theory [23, 24]. Since the LMS algorithm is a contraction,
the loop gain is equal to the gain of the secondary system, so stability is guaranteed if
this gain is less than unity. This approach to stability analysis is expounded
in [25].
4.2.7
In the statement of Theorem 1 we were careful to mention that LMS is an H 1 optimal estimator, since we have not yet determined whether or not the H 1 problem
has a unique solution. Let us now explore this issue.
We are interested in determing all predictors that yield
Pi
T
2
^
n1 d n hn w
P
i
m 1 wT w n1 yn hnT w2
1
1
w w
T
"
#T
#
"
i
X
1 0 d^ n hnT w
d^ n hnT w
n1
yn hnT w
yn hnT w
0:
As mentioned several times earlier, the above indenite quadratic form is nonnegative for all w if, and only if, it has a minimum over w and the d^ n can be chosen
such that the value at the minimum is nonnegative. Due to our condition on the
learning rate (4.39), we always have a minimum over w. Minimizing over w, the
value at the minimum is
"
#T "
i
X
1 m jhnj2
^ n 1
d^ n hnT w
n1
^ n 1
yn hnT w
m jhnj2
1 m jhnj2
m jhnj2
#1
"
#
^ n 1
d^ n hnT w
;
^ n 1
yn hnT w
4:44
^ 0 0:
w
4:45
(For a proof of this result and a more general discussion of the minimization of such
indenite quadratic forms see [22], Theorem 3.4.2, and Lemmas 3.4.1 to 3.4.3.)
129
i ^
X
^ n 12
dn hnT w
n1
1 m jhnj
i
X
^ n 1
yn hnT w
n1
m jhnj2 yn d^ n2
0:
4:46
Note that due to the learning rate constraint (4.39), the rst summation in the above
expression is non-positive. Clearly, one choice that renders (4.46) nonnegative is
^ n 1. Plugging this back into (4.45) readily gives the LMS
d^ n hnT w
algorithm. However, is this the only choice that renders (4.46) nonnegative?
Obviously not. As long as the sequences
(
^ n 1
d^ n hnT w
p
1 m jhnj2
)i
and
n1
^ n 1 m jhnj2 yn d^ ngin1
fyn hnT w
are related by a strictly causal contractive mapping, (4.46) holds.9
We thus have shown the following result.
Theorem 2 Consider the setting of Theorem 1 and assume that m inf n 1=jhnj2 .
Then all H 1 -optimal predictors that achieve g opt 1 are given by
^ n 1
d^ n hnT w
q
1 m jhnj2
^ n 2 m jhn 1j2 yn 1 d^ n 1;
fn yn 1 hn 1T w
^ n 3 m jhn 2j2 yn 2 d^ n 2; . . .;
yn 2 hn 2T w
4:47
^ n satises
where fn ; ; . . . is a strictly causal contractive mapping and where w
the recursion (4.45).
An illustration of the parametrization of Theorem 2 is given in Figure 4.4.
The simplest strictly causal contraction is fn ; ; . . . 0 for all n, which gives
the LMS algorithm. Another simple causal contraction is the identity map
Two sequences fang and fbng are said to be related via a strictly causal contraction if, and only if,
P
P
ai fi bi 1; bi 2; . . . and in1 an2 in1 bn2 for all i.
130
Figure 4.4
fn an 1; an 2; . . . an 1. This gives
d^ n
q
^ n 2 m jhn 1j2 yn 1
1 m jhnj2 yn 1 hn 1T w
d^ n 1;
131
respect to criteria other than robustness. It can be shown, for example, that the lter
(4.48) has particularly poor average performance [26].
In fact, the nonuniqueness of the H 1 -optimal lters has led many researchers to
attempt to optimize other criteria over the family of these lters. We may refer to the
superoptimal criterion, as well as to the mixed H 2 =H 1 criterion that attempts to nd
the lter with the best average performance among all those that guarantee a
prescribed worst-case bound [26].
However, we shall not go into any of these here. Instead, we will focus on the
question of whether there is anything special about the LMS algorithm or whether it
is just an arbitrary member of the family of H 1 -optimal lters, not worthy of any
further distinction. To answer this question, we will now turn our attention to nding
a stochastic interpretation of the LMS algorithm.
4.3
A STOCHASTIC INTERPRETATION
Recall that even though the RLS algorithm can be considered as an algorithm that
minimizes the deterministic quadratic forms (4.8) or (4.18), under suitable Gaussian
assumptions on the signals involved, it also yields the ML or MAP estimates. The
reason is that the deterministic quadratic form that RLS minimizes can be
considered to be the (negative of the) exponent of a suitably chosen Gaussian
probability density function (cf. (4.13)).
The LMS algorithm is related to the deterministic quadratic form
1
w w
T
"
#T
i
X
1
d^ n hnT w
n1
yn hnT w
0
1
"
d^ n hnT w
yn hnT w
#
:
4:49
Indeed, referring to our derivation of all H 1 -optimal lters in Section 4.2.7, we rst
minimized the above quadratic form to obtain the quadratic form (4.44), or
^,
equivalently (4.46). Inspection of (4.46) shows that the choice d^ n hnT w
which leads to the LMS algorithm, recursively maximizes this quadratic form.
In other words, LMS performs the following optimization:
max min m
d^ 1;...;d^ i w
1
w w
T
"
#T
i
X
1
d^ n hnT w
n1
yn hnT w
0
1
"
d^ n hnT w
yn hnT w
#!
;
4:50
where the maximization is done recursively, that is, d^ 1 0; d^ 2 depends only on
y1; d^ 3 depends only on y1; y2, and so on.
Now at rst sight it appears that (4.50) cannot be related to a stochastic problem,
since the quadratic form is indenite and so cannot be the exponent of a Gaussian
probability density function. Moreover, we have a minimization over w but a
132
m 1 wT w
i
i
X
X
4
yn hnT w2
d^ n hnT w2 J;
n1
n1
then we can identify the rst two terms as the (negative of the) exponent of a
Gaussian density. More specically, using (4.17), we can write
eJ
2p im=2
where we have assumed that w and the vn are zero-mean independent Gaussian
random variables with variance m I and unity, respectively. Therefore we may also
write
eJ
2p
im=2
py1; . . . ; yNjh1; . . . ; hN
4:51
Now it is not hard to show that for any p p matrix A . 0
1
1
exp a*
b*
exp mina*
q
B a
dadb 2p det A1 p
C b
A B a
b*
:
B* C b
A
B*
!
i
X
d^ n hnT w2 ;
Ejy1;...;yN;h1;...;hN exp
4:52
n1
P
where A m 1 I in1 hnhnT . Since the LHS of (4.52) depends on the d^ n
only through J, we conclude from (4.50) that the LMS algorithm recursively solves
133
d^ 1;...;d^ i
4.3.1
!
i
X
T
2
Ejy1;...;yN;h1;...;hN exp
d^ n hn w :
4:53
n1
Risk-Sensitive Optimality
We can now formalize the result we have obtained in the following theorem.
Theorem 3
n.0
where w and the vn are zero-mean independent Gaussian random variables with
variance m I and unity, respectively. Assume further that (4.39) holds. Then the LMS
algorithm (4.6) recursively solves the problem
min
d^ 1;...;d^ i
!
i
X
Ejy1;...;yN;h1;...;hN exp
d^ n hnT w2 ;
n1
2
i
X
d^ n hnT w2
n1
i
X
d^ n hnT w2 ;
4:54
n1
which is what the RLS algorithm does, the LMS algorithm minimizes the meanexponential-square prediction error (4.53).
The exponential quadratic cost (4.53) was rst introduced in the control theory
context by Jacobson [27]. It has also been championed in statistics by Whittle, who
calls it the risk-sensitive criterion [28]. In fact, Whittle introduces a whole family of
134
2
!
i
X
T
2
^
d n hn w :
n1
Whittle refers to estimators that minimize this criterion as risk-averse. The reason is
that, compared to the mean-square criterion (4.54), the risk-sensitive criterion puts a
much larger penalty (in fact, an exponentially larger penalty) on large values of the
prediction error. In other words, the criterion is more concerned with the occasional
occurrence of large values of prediction error, rather than with the frequent
occurrence of moderate values of error. Some further intuition regarding the
criterion (4.53) can be obtained by expanding the exponential function and noting
that the criterion penalizes all the moments of the prediction error, not just the
second moment.
The smaller the parameter g is, the more risk-averse the estimator, since the
criterion has a stronger exponential function. However, it turns out that g cannot be
made arbitrarily small, and there exists a critical value for g (called g opt ) below
which there exists no estimator that renders the risk-sensitive cost nite. In our
problem, the critical value is g opt 1, since if g , 1 were possible, it would mean
that an H 1 estimator with g , 1 is possible, which as we know is not the case. This
is essentially the second statement of Theorem 3.
We should remark that the risk-sensitive optimality of the LMS algorithm goes
very nicely with the robustness properties we described earlier. Essentially, not
tolerating large values of error is a way of saying that the algorithm is robust. In any
event, the risk-sensitive optimality of LMS is a very interesting property that is not
shared by any of the other H1 -optimal lters of Theorem 2. It is important for two
reasons: rst, it gives a nonobvious stochastic interpretation to the LMS algorithm;
second, it further emphasizes its special nature.
4.4
We have seen that the LMS algorithm outperforms the RLS algorithm when we have
nonstationary signals and need to track time variations of the weight vector w. At
rst sight, one may argue that comparison of the tracking abilities of these two
algorithms is not fair since the LMS algorithm (4.6) has a constant gain vector,
whereas the RLS algorithm (4.10 4.11) has a vanishing-to-zero gain vector and so
can have no hope of tracking a time-varying w. However, the comparison is fair if
we consider that both algorithms deal with the same time-invariant model
yn hnT w vn;
with the only difference being that the RLS algorithm assumes that the disturbance
sequence vn is stationary and white (also Gaussian) and nds the linear least-meansquares (also least-mean-squares) estimate, whereas the LMS algorithm ensures
only that the disturbance sequence is unknown and nds the H 1 -optimal estimate.
135
The point is that the RLS algorithm explicitly uses the fact that w is constant and so
forces the gain vector to go to zero. The LMS algorithm, on the other hand, by virtue
of its robustness to modeling errors and disturbance variation, safeguards us against
a time-varying w by enforcing a nonzero gain vector.
4.4.1
Exponential Windowing
The fact that the vanishing-to-zero gain vector of RLS leads to poor tracking
performance is well recognized in the literature, and so various modications to RLS
that circumvent this shortcoming have been proposed. The most common one is to
use an exponential window and to replace the deterministic cost function (4.18) with
"
#
N
X
T
2
T 2 1
n
min w s P0 w
l yn hn w ;
4:55
w
n1
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0;
w
4:56
which is the recursion we were pursuing. Pi itself satises a Riccati recursion that
can be computed to be
Pi 1hihiT Pi 1
T
1
Pi l
Pi 1
hi Pi 1 ;
1 hiT Pi 1hi
4:57
P0 s 2 P0 :
The reason the gain vector in (4.56) does not go to zero, unlike (4.10), is that, due to
the factor l 1 . 1 in (4.57), the matrix Pi does not go to zero.
Therefore using an exponential window in (4.55) alleviates the problem of a
vanishing-to-zero gain vector and improves the tracking performance. However, it is
also possible to apply the exponential window to the H 1 setting and to obtain a
robust version of (4.56 4.57). All one needs to do is replace problem (4.33) with
PN1 n
T
2
^
2
n1 l d n hn w
g opt
min
max
:
4:58
P
N
T
2
n
d^ 1;...;d^ N1 w m 1 wT w
n1 l yn hn w
It turns out that nding an explicit formula for g opt in (4.58) is not possible.
However, in [29] (Section 11.3.1), the following bound for g opt is obtained: Let
4
h sup hiT hi;
i
Rl i
i
X
j1
l ij h jh jT :
4:59
136
Then
g 2opt sup
i
h s Rl i
;
l i =m s Rl i
4:60
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0;
w
4:61
4:62
The above estimator is one of many possible level-g estimators. But, as with LMS, it
has the distinction of being the risk-sensitive optimal solution.
4.4.2
The exponential windowing scheme just described is really an ad hoc way of dealing
with a time-varying w. In effect, what we are doing is assuming a constant w but
assigning (exponentially) higher weight to more recent observations. A more
fundamental approach would be to introduce a time-varying weight vector into the
model directly. In other words, we should assume that the observed sequence is
given by
yn hnT wn vn;
n . 0:
4:63
The question that then arises is how to describe the time variation of wn. A
reasonable assumption is that wn itself satises the recursion
p
4:64
wn 1 a wn m 1 a 2 un; n . 0;
disturbance
where 0 , a , 1 and un is ap
vector often referred to as process noise.
The reason for the coefcient m 1 a 2 is that, if we assume that
Ew0w0T m I;
EunumT I d mn
4:65
where d mn is the Kronecker delta, then the covariance matrix of the weight vector
wn is constant for all time:
EwnwnT m I;
n . 0:
In other words, even though the weight vector is time-varying, its covariance matrix
is constant for all time. The parameter a clearly determines the rate of the time
variation of wi. The smaller it is, the faster the time variation.
137
a Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0; 4:66
w
a 2 Pi 1hihiT Pi 1
;
1 hiT Pi 1hi
4:67
P0 m I:
(For a detailed discussion of the derivation of the above equations see, for example,
[8, 2].) Note that due to the term m 1 a 2 I in the Riccati recursion, Pi is positive
denite for all times, so the gain vector in (4.66) cannot go to zero. Therefore our
introduction of the time-varying model (4.64) automatically leads to an algorithm
capable of tracking wi.
4.4.2.1 The Leaky LMS Algorithm In the H 1 setting, the disturbance signal un
in (4.64) is assumed to be unknown, and the predicted values of the uncorrupted
output hnT wn are determined via the criterion
PN1
T
2
^
n1 d n hn wn
g 2opt
min
max
:
P
P
N
N
T
T
2
d^ 1;...;d^ N1 w;u[l2 m 1 wT w
n1 un un
n1 yn hn wn
4:68
Note now that we have three disturbancesw, the un, and the vnwhich is why
we have three terms in the denominator of the above energy gain.
Using H 1 theory, one can show the following result.
Theorem 4 Consider the model (4.63 4.64) and assume that m 1=hiT hi for
all i. Then the optimal prediction error energy gain g 2opt , found via solving (4.68),
satises
4:69
where g 2 1 is the inmum, over all i, of the largest positive solution to the
quadratic equation
g 4
138
Note that the above theorem implies that the optimal energy gain can be less than
1.10 Although it is possible to give the expression for an arbitrary level-g predictor,
we shall not do so here. Rather, we will give an interesting, though slightly
suboptimal, predictor that achieves g 1.
To this end, a simple variation of the LMS algorithm that has been proposed for
tracking applications is the leaky LMS algorithm,
^ i a w
^ i 1;
^ i 1 a m hiyi hiw
w
^ 0 0:
w
4:70
Note that compared to the LMS algorithm (4.6), the leaky LMS algorithm attenuates
the weight vector estimate by the factor 0 , a , 1. This allows the algorithm to
forget earlier data and allows better tracking.
Let us now study the consequences of algorithm (4.70) for the time-varying
model (4.63 4.64). If we dene the prediction error of the weight vector at time
~ i wi 1 w
^ i 1, then
i 1 using the observations up to time i as w
subtracting (4.70) from (4.64) allows us to write
~ i 1 am hivi
~ i a I m hihiT w
w
p
1 a 2 m ui;
4:71
^ i 1 vi w
~ i 1. This now
where we have made use of the fact that yi w
~ i 1; vi; uig to the
allows us to write the mapping from the variables fm 1=2 w
~ i; hiT w
~ i 1g as
variables fm 1=2 w
~ i
m 1=2 w
T
~ i 1
hi w
3
2
p # m 1=2 w
~ i 1
2
a I m hihi a m hi
1a I 6
7
vi
5: 4:72
4
T
1=2
m hi
0
0
ui
|{z}
"
1=2
We will now show that the mapping A is a contraction. Indeed, from the above
equation, it follows that
"
I AA*
a 2 m 1 m jhij2 hihiT
a m 1=2 1 m jhij2 hiT
a m 1=2 1 m jhij2 hi
1 m jhij2
#
0;
where the last inequality follows from the fact that the block (2, 2) entry satises
1 m jhij2 0 and its Schur complement is
The reason for this is the existence of the exponential decay factor 0 , a , 1.
139
The fact that A is a contraction implies that the norm of the output variables is less
than the norm of the input variables, that is,
~ i 12 m 1 jw
~ ij2 hiT w
~ i 1j2 vi2 juij2 :
m 1 jw
4:73
Adding all of the above equations from time i 1 to time i N implies that
~ Nj2
m 1 jw
N
N
N
X
X
X
~ n 12 m 1 jw0j2
hiT w
juij2
vi2 ; 4:74
n1
n1
n1
T
~ n 12
n1 hi w
P
P
m 1 jw0j2 Nn1 juij2 Nn1
vi2
1
4:75
for all possible disturbances w0, ui, and vi. In other words, we have shown that
the leaky LMS algorithm guarantees a worst-case prediction error energy gain of
unity.
Theorem 5 Consider the model (4.63 4.64) and assume that m , 1=hiT hi for
all i. Then leaky LMS algorithm (4.70) guarantees
PN
T
~ n 12
n1 hi w
P
P
m 1 jw0j2 Nn1 juij2 Nn1
vi2
1
min
max
d^ 1;...;d^ N1 w;u[l2
m 1 wT w
hnT wn2
1:
PN
T
2
n1 un un
n1 yn hn wn
PN
n1 d n
T
d^ 1;...;d^ N1
!
N
X
Ejy1;...;yN;h1;...;hN exp
d^ n hnT wn2 :
n1
The rst statement of the above theorem follows from the arguments preceding
the theorem. The second statement follows from (4.69). The third statement can be
proven using an argument similar to the one that we presented for the risk-sensitive
optimality of the LMS algorithm.
140
In any event, Theorem 5 demonstrates the robustness of the leaky LMS algorithm
(4.70) for applications where the unknown weight vector varies with time. It also
demonstrates the robustness of the LMS algorithm itself for such applications,
provided that the time variation of the weight vector is slow, that is, a 1, since in
this case there is little difference between the LMS algorithm (4.6) and its leaky
version (4.70).
4.5
FURTHER REMARKS
In addition to yielding a new interpretation for the LMS algorithm and providing it
with a rigorous basis, the results described so far have lent themselves to various
generalizations and to several new results. We close this chapter by listing some of
these.
4.5.1
In this chapter we have focused on prediction errors and predicting the uncorrupted
output hnT w. It is also possible to look at ltered errors
^ n;
hnT w hnT w
4:76
that is, on the error in estimating the uncorrupted output hnT w using observations
up to and including the current time instant n. In this case, the H 1 -optimal algorithm
turns out to be the normalized LMS algorithm
^ n w
^ n 1
w
m
^ i 1;
hnyn hnT w
1 m jhnj2
4:77
which is a commonly used variant of the LMS algorithm. It turns out that optimal
energy gain is g 2opt 1 and that this is true irrespective of the learning rate m .
Results such as the nonuniqueness of the H 1 -optimal estimators, the risk-sensitive
optimality, tracking properties, and so on, all extend to the normalized LMS
algorithm. For a proof of these results the reader may refer to [30].
4.5.2
In many applications, the LMS algorithm is used with a time-varying step-size (or
learning rate), that is,
^ n w
^ n 1 m nhnyn hnT w
^ i 1:
w
4:78
In this case, it is straightforward to show that if the vectors m n1=2 hn are exciting
and m nhnT hn 1 for all n, then the above LMS algorithm with a time-varying
141
d^ 1;...;d^ N1
4.5.3
max
w
PN1
T
2
^
n1 m nd n hn w
1:
P
N
T
wT w n1 m nyn hn w2
4:79
PN1
T
2
^
n1 d n hn w
P
m 1 wT w Nn1 yn hnT w2
q
2
2
1 m h 1 ;
4:80
where h supn hnT hn. Thus, unlike the LMS algorithm, where the optimal
energy gain was independent of m and the hn, for RLS, it highly depends on these.
p
Moreover, for large values of m , the upper and lower bounds in (4.80) grow as m .
This is reminiscent of the robustness properties of LMS, where the learning rate had
to be small enought to guarantee H 1 optimality. More importantly, it shows that the
unregularized least-squares problem (4.8) (corresponding to m 1) can be highly
nonrobust with respect to prediction errors.
2
4.5.4
Mixed H 2 =H 1 Problems
d^ 1;...;d^ N
XN
n1
^ rls n 12 ;
d^ n hnT w
142
subject to
#1
"
#T "
Xn
1 m jhnj2
m jhnj2
^ n 1
d^ n hnT w
n1
m jhnj2
1 m jhnj2
^ n 1
yn hnT w
"
#
^ n 1
d^ n hnT w
0;
^ n 1
yn hnT w
4:81
^ rls n denotes the RLS estimate of the weight vector. The above problem is a
where w
quadratic program and can be readily solved.
4.5.5
Nonlinear Problems
The results presented in this chapter are for linear adaptive lters and can be
somewhat generalized to nonlinear adaptive lters (such as neural networks) if one
linearizes these models around some suitable operating point. Using this approach, it
can be shown (see [33]) that, for nonlinear problems, instantaneous-gradient-based
methods (such as backpropagation [4]) are locally H 1 -optimal. This means that if
the initial estimate of the weight vector is close enough to its true value, and if the
disturbances are small enough, then the maximum energy gain from the disturbances
to the output prediction errors is arbitrarily close to 1. Global H 1 -optimal lters can
also be found in the nonlinear case, but they have the drawback of being innitedimensional (see [34]).
4.6
CONCLUSION
In this chapter we showed that the LMS algorithm is H 1 -optimal. This result solves
the long-standing problem of nding a rigorous basis for the LMS algorithm and
also conrms its robustness. We have argued that compared to exact least-squares
solutions, the wide use of the LMS algorithm over a broad range of applications is
best explained by its robustness to modeling errors and disturbance variation rather
than its simplicity, computational efciency, or numerical stability (for all of which
REFERENCES
143
144
13. M. Dahleh and J. Pearson, l1 -optimal compensators for continuous-time systems, IEEE
Transactions on Automatic Control, vol. 32, pp. 889 895, 1987.
14. G. Zames, Feedback and optimal sensitivity: Model reference transformations,
multiplicative semi-norms and approximate inverses, IEEE Transactions on Automatic
Control, vol. 26, pp. 301 320, 1981.
15. A. Saberi, P. Sannuti, and B. Chen, H2 Optimal Control. Prentice-Hall, Englewood Cliffs,
NJ, 1995.
16. P. Duren, Theory of HP Spaces. Dover, New York, 2000.
17. B. Francis, A Course of H1 Control Theory. Springer-Verlag, New York, 1987.
18. A. Feintuch, Robust control Theory in Hilbert Space. Springer-Verlag, New York, 1998.
19. J. Doyle, K. Glover, P. Khargonekar, and B. Francis, State-space solutions to standard
H 2 and H 1 control problems, IEEE Transactions on Automatic Control, vol. 34, pp.
831 847, 1989.
20. H. Kimura, Chain-Scattering Approach to H1 Control. Birkhauser, Boston, 1997.
21. T. Basar and P. Bernhard, H1-Optimal Control and Related Minimax Design Problems:
A Dynamic Game Approach. Birkhauser, Boston, 1991.
22. B. Hassibi, A. Sayed, and T. Kailath, Indenite-Quadratic Estimation and Control: A
Unied Approach to H2 and H1 Theories. SIAM, Philadelphia, 1999.
23. H. Khalil, Nonlinear Systems. Prentice-Hall, Englewood Cliffs, NJ, 2001.
24. M. Vidyasagar, Nonlinear System Analysis. SIAM, Philadelphia, 2002.
25. A. Sayed and M. Rupp, Error-energy bounds for adaptive gradient algorithms, IEEE
Transactions on Signal Processing, vol. 44, pp. 1982 1989, 1996.
26. B. Halder, B. Hassibi, and T. Kailath, Mixed H 2 =H 1 estimation: Preliminary analytic
characterization and a numerical solution, Proceedings of the 13th World Congress
international Federation of Automatic Control. Vol. J. Identication II, Discrete Event
Systems, pp. 37 42, 1997.
27. D. Jacobson, Optimal stochastic linear systems with exponential performance criteria
and their relation to deterministic games, IEEE Transactions on Automatic Control, vol.
18, pp. 124 131, 1973.
28. P. Whittle, Risk-Sensitive Optimal Control. Wiley, New York, 1990.
29. B. Hassibi, Indenite Metric Spaces in Estimation, Control and Adaptive Filtering. Ph.D.
thesis, Stanford University, 1996.
30. B. Hassibi, A. Sayed, and T. Kailath, H 1 -optimality of the LMS algorithm, IEEE
Transactions on Signal Processing, vol. 44, pp. 267 280, 1996.
31. B. Hassibi and T. Kailath, H 1 bounds for least-squares estimators, IEEE Transactions
on Automatic Control, vol. 46, pp. 309 314, 2001.
32. B. Hassibi and T. Kailath, On adaptive ltering with combined least-mean-squares and
H 1 criteria, Conference Record of the Thirty-First Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1570 1574, 1998.
33. B. Hassibi, A. Sayed, and T. Kailath, H 1 -optimality criteria for LMS and
backpropagation, Advances in Neural Information Processing Systems, vol. 6, pp.
351 359, 1994.
34. B. Hassibi and T. Kailath, H 1 -optimal training algorithms and their relation to
backpropagations, Advances in Neural Information Processing Systems, vol. 7, pp. 191
199, 1995.
JOHN HOMER
Department of Computer Science and Chemical Engineering, The University of Queensland,
Brisbane, Australia
and
ROBERT R. BITMEAD
Department of Mechanical and Aerospace Engineering, University of California, San Diego
5.1
PREAMBLE
For ease of reference some of the notations, and denitions that are used in this
chapter are listed here.
5.1.1
Notation
Im
Efg
N
1X
E limN!1
N 1
Most of this chapter was written when Iven Mareels was visiting the Department of Electrical and
Computer Engineering at the National University of Singapore. The support and hospitality of the
Department are hereby gratefully acknowledged. John Homer is with the Department of Computer
Science and Electrical Engineering, The University of Queensland, Brisbane, Qld 4072 Australia,
homerj@csee.uq.edu.au; Iven Mareels is the Department of Electrical and Electronic Engineering, The
University of Melbourne, Vic 3010, Australia, i.mareels@unimelb.edu.au; and Robert Bitmead is with the
Department of Mechanical and Aerospace Engineering, University of California, San Diego, 9500 Gilman
Drive, La Jolla CA 92093-0411 USA, rbitmead@ucsd.edu.
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
145
146
n 1; 2; . . .
un
yn
dn
dn
en
wn
1
u n
T
R
wo
m.0
en
5.1.2
Denitions
In the discussions frequent use is made of o : and O : estimates; refer to [31] for
detailed denitions. Concisely, for two sequences u1 n and u2 n dened on n
0; 1; . . . ; it is said that u1 is of the order of u2 , denoted as u1 n Ou2 n, provided
that there exists a constant C . 0 and a time instant n0 . 0 such that ku1 nk
Cku2 nk for all n n0 . The notation u1 n ou2 n indicates that u1 n
ku1 nk
Ou2 n and limn!1
0.
ku2 nk
5.2
INTRODUCTION
5.2 INTRODUCTION
147
the input signal u. This dependence is investigated in this chapter. The main tools
used in the analysis are rst- and second-order averaging techniques [2, 3133].
A standing assumption is
Assumption 1 The input u and disturbance d signals are wide sense stationary and
possess well-dened mean, autocorrelation, and cross-correlation functions.
This assumption is elaborated upon in the sequel.
5.2.1
The typical situation is depicted in Figure 5.1. An FIR lter with taps with
adaptively adjusted weight vector w is used to approximate the response of an
unknown but stable lter. The stationary input signal is u. The adaptive lters
output is denoted y, with yn wnT u n, with u n un un 1 un
1T the corresponding regressor vector. The desired signal d is the output of the
unknown lter. The latter may be disturbed by a (stationary) disturbance signal d.
The weight vector w of the adaptive FIR lter is adjusted so as to minimize the error
en dn dn yn in mean-square sense. The weights are updated using
the least-mean-square (LMS) update rule (equivalent to a stochastic gradient
approximation with constant step-size):
wn 1 wn mu nen:
5:1
5:2
Figure 5.1
148
The task of the LMS algorithm is to nd the best (in a least-square sense) linear
approximation for the desired signal d using the regressor vector u . Because of
its computational simplicity and excellent robustness characteristics, it is an
extremely widely used algorithm [2, 1].
The signal environment is considered to be an open loop signal environment
when the input signal is independent of the error signal, so Efundn kg 0 for
all k. A feedback signal environment is one where the error signal may leak back into
the input signal, a situation that occurs, for example, in acoustic or telephony echo
cancellation applications. The main difculty encountered in a feedback signal
environment is (closed loop) stability. In the open loop signal environment, there is
no stability issue for the FIR lter as long as the FIR lter coefcients themselves are
nite. This stability property is one of the main attractions for the use of FIR lters in
an adaptive context.
An in-depth analysis of stability properties of LMS algorithms under open loop
signal conditions can be found in [46]. There the LMS algorithm is analyzed under
weak conditions restricting the interdependence of the regressor vectors and the tail
of their distribution. Most importantly, stationarity is not assumed, an assumption
that is made in this chapter. A form of exponential stability, which entails good
robustness properties, is established using ideas akin to averaging. As is typical in
averaging results, the results are established under the condition that the step-size
parameter is sufciently small. In [46] the tracking performance of LMS algorithms
is also analyzed under various scenarios. Because the assumptions are very weak,
the analysis presented in [46] is unable to reveal dimension dependencies in
performance and/or tracking characteristics. The latter is precisely the topic of this
chapter. The study of LMS algorithms under feedback conditions is not as well
developed. For some rst results, refer to [30, 25].
5.2.2
The dimension parameter and its inuence on the behavior of LMS algorithms has
played an important role in the literature dealing with LMS lters from the very rst
references on the topic. Understanding this dependence becomes even more
important as demanding applications such as adaptive acoustic echo cancellation
and acoustic equalization require FIR lters of very high dimension in order to
achieve good (lter) performance.
Most of the early literature (see, e.g., [3539, 41, 42]) deals with the convergence
rate of LMS lters in terms of the second-order characteristics of the input signal,
namely, the eigenvalues of the correlation matrix R . This observation itself
reveals a link between convergence speed and the dimension via the second-order
moment of the input signal. Such dimension dependence is further supported by the
very general (and hence conservative) theory of Vapnik [34]. The results in [34]
predict under very mild signal conditions a penalty on the convergence rate with an
increase in lter parameter dimension.
Most of the literature dealing with the convergence aspects of LMS lters
attempts to nd a best step-size m (under a variety of input conditions) so as to
5.2 INTRODUCTION
149
150
result is a factor of 2 tighter than the [36] estimate in the white Gaussian signal case.
This estimate clearly shows a dimension effect as well as the inuence of the input
signals autocorrelation function. The wider the spread of the eigenvalues of the
autocorrelation function (which is the more likely as the dimension grows), the
slower the expected initial convergence.
In [3] the behavior of the expected squared error Efen2 g is analyzed for small n
as a measure of transient performance. The assumptions that the input is i.i.d and the
regressor vector u is independent of w (not unreasonable for small m) are imposed.
It is shown that the initial convergence rate is optimized by choosing m 1=s2e0 ,
where s2e0 is the variance of the expected initial parameter error. With this choice, it
is further shown that the expected squared error Efen2 g converges like 1 1=n
for n small. This shows that the length of the FIR lter penalizes the LMS
algorithms convergence. The actual dependence on the signals autocorrelation
function is, of course, absent because the signal was assumed to be i.i.d. over time.
Similarly, [4] considers an adaptive normalized LMS algorithm under the condition
that both input and disturbance signals are white and uncorrelated. The authors
consider the expected squared error and establish that the initial convergence rate
decreases as 1=. A closer analysis of the result in [4] reveals that it is slightly
different from [3]. The estimates of the convergence speed differ by an Om2 term.
Given that both [3] and [4] use rst-order averaging techniques [31] to establish their
results, this is completely acceptable, as all estimates are at best om correct.
More recently in [5], for normalized LMS, the inverse of the condition number of
the normalized (unit variance) input autocorrelation matrix R has been studied. It is
assumed that the input is Gaussian. A heuristic argument is mounted indicating that
the convergence speed is inversely proportional to this condition number. The main
result establishes that this condition number grows with the length of the FIR lter.
A cost function that captures more adequately the transient performance of the
LMS algorithm is inspired by [38]
N
1X
kEfengk2
:
N!1 N
kwo k2
n1
Ce wo lim
5:4
5.2 INTRODUCTION
151
autocorrelation function. See, for example [12], where the step-size parameter is
different for each of the tap estimates and, moreover, adaptively adjusted based on
the size of the total update for the particular weight. These ideas are not pursued
here. In the same vein [13, 16] may be mentioned, in which variants of an algorithm
that were originally proposed by [14, 15] and that can be traced back to [17] are
analyzed. At each sample interval, the algorithm updates only those m , tap
weights for which the corresponding regressor entries are largest. The algorithm
requires as overhead a sorting of the regressor vector in descending order of
magnitude, which can be efciently implemented. The computational cost is
Om logm, which has to be compared to O for the classical algorithm.
Through a simulation analysis, it is shown that the penalty on the convergence and
performance is minimal as long as m is selected appropriately. Reference [16]
provides further theoretical justication for the algorithms performance. An
analogous analysis is performed in [18], where the update is combined with an afne
projection to provide improved performance. These algorithms are somewhat akin
to the active tap algorithms proposed in the context of acoustic echo canceling [23].
In acoustic echo canceling applications the effective length of the FIR lter can
be large compared to the actual number of required (nonzero) taps. This can be
intuitively attributed to the way the signal is constructed: travel delay and
reections. The situation is as depicted in Figure 5.2; the shaded regions indicate
where tap weights are important. In such circumstances it pays not only to identify
the tap weights, but also to determine which taps should be identied. In view of the
fact that dimension adversely affects the LMS learning performance, this strategy
promises a signicant improvement in transient performance as compared to the
brute force estimation of all time taps over the entire effective length of the FIR
lter. It is therefore no surprise that the literature dealing with acoustic echo
canceling is preoccupied with reducing the number of adaptively adjusted FIR lter
Figure 5.2
152
weights. The key issue is how to determine which of the possible taps should be
updated.
A few authors consider block processing of data, either in time domain or in a
linear transform domain. In [6] large FIR lters are adaptively updated not in time
domain, but after a linear transformation such as a discrete cosine transformation.
Data are block processed where the block length is larger than the maximum FIR
delay. In the transformed domain the FIR coefcients, which are considered most
active, are updated using a normalized LMS-like algorithm. The authors consider
various options on how to reduce the computational cost of the updates in the
transform domain. The computational cost is linear in the number of taps to be
updated (which is much less than the FIRs total delay) and the data block length. It
is shown through a simulation study that the convergence rate compares favorably
with time domain based normalized LMS algorithms. In [7] a block data method is
considered in the time domain. In every block of N data points the algorithm
determines the P most signicant taps from a possible maximum of M FIR taps. The
integers satisfy P , M , N. The rst most signicant tap is the tap with the largest
weight, as determined through a projection (the regressor vector most aligned with
the output vector). This process is then repeated on the residual, what remains of
the output vector after removal of the most aligned regressor vectors, until either P
taps are determined or the residual is deemed sufciently small. The computational
cost is OMP per sample interval, which should be compared with the OM
computational cost for a normal LMS algorithm.
In the context of decision feedback equalizers, the tap selection issue is also
considered; see [911]. In [9] a simple feedforward decision feedback equalizer is
considered. In [10] the feedback decision equalizer is sparse. The method appears to
rely on some rather strong prior information about the signal environment in order to
determine which taps to update.
In [1921] the more conventional LMS or normalized LMS is considered with a
tap activity measure based on input-output cross-correlation estimates. A heuristic
argument indicates that this correlation analysis allows one to rank the most
important taps, which are then updated using the normal LMS algorithm.
The disadvantage of all these two-stage approaches is that the tap selection
mechanism is essentially divorced from the optimization task to be performed by the
LMS algorithm. In contrast, the approach expounded in the sequel directly selects as
active those taps that contribute most to the minimization of the least-squares cost
function. This approach is advocated in [23, 26, 27, 29].
5.2.3
Chapter Organization
The remainder of the chapter is organized as follows. First, the effect of dimension
and correlation properties of the input signal on the convergence properties of
standard LMS adaptive lters is studied. The basic assumptions are formulated, the
averaging analysis is performed, and a particular measure of the quality of the
adaptive lters behavior is proposed. The main theorem that characterizes how
dimension and correlation properties affect the performance measure follows. The
result is illustrated with a number of representative simulations.
153
The next section deals with LMS adaptive lters with a variable number of
nonzero or active taps. This situation is analyzed under the condition that the input is
required to be white. A measure for detecting active taps is introduced. Based on this
measure, an algorithm that combines detection of active taps with standard LMS
adaptation is then proposed. The results are illustrated with some simulation studies.
A modication valid for mildly correlated signals is argued heuristically and
presented.
Pointers to open questions and further reading conclude the chapter.
5.3
The open loop signal environment is studied. The main result is obtained through
rst- and second-order averaging techniques, without necessarily imposing a stochastic framework for the signals. Basic Cesaro mean assumptions for rst- and
second-order moments sufce to derive the results. First, the assumptions are
introduced. The basic averaged equations are then derived. Next, the performance
measure that captures both transient and asymptotic behavior of the LMS algorithm
is introduced. In order to make the results independent of a particular ltering
situation, necessary to discuss the inherent algorithmic properties, it is assumed that
the orientation of the desired Wiener solution is drawn from a uniform distribution.
5.3.1
To quantify how the dimension affects the convergence rate of the LMS algorithm in
the open loop signal case, the following assumptions are imposed.
Assumption 2 (i) The input, un, and disturbance, dn, signals are zero mean,
bounded, and stationary so that the following limits exist for all :
X0
1 N1k
u nu nT ;
N!1 N
nk
R lim
s2u lim
N!1 N
N1k
X0
un2 ;
nk0
X0
1 N1k
dn2 :
N!1 N
nk
s2d lim
(ii) The input and disturbance signals are uncorrelated with each other over
time:
X0
1 N1k
undn m 0;
N!1 N
nk
lim
8m:
154
rn lim
(iv)
(v)
(vi)
(vii)
n 2; 1; 0; 1; 2; . . .
P
is absolutely summable: 1
n1 jrnj , 1. This guarantees the existence
of the power spectrum of the input signal.
The power spectrum Fuu v of the input signal is positive denite
Fuu v . 0, 0 v 2p. This implies that the input signal covariance
matrix R is positive denite for all .
The LMS step-size is m 1=s2u 1=traceRn :
The normalized unknown Wiener solution wo =kwo k is independent of the
input signal and has a probability distribution which is uniform in direction
in the -dimensional space (or, equivalently, the unknown channel vector
has equal probability of pointing in any direction in the -dimensional
subspace).
The LMS initial estimate w0 is zero.
Remark 1 Assumption 2 (ii) implies, among other things, that the Wiener solution
is the stationary point for the LMS algorithm. Condition (iv) ensures that the Wiener
solution is an attractive point for the LMS algorithm, regardless of the dimension .
One says that the input signal is persistently exciting of any dimension.
Finally, condition (vi) allows one to average out the effects of any particular lter
situation and concentrate solely on the LMS dynamics itself. It could be argued that
it is a strong assumption to divorce the Wiener solution from the input signal.
Indeed, in general, the Wiener solution may depend on the input signal, but of
course, this is not a very desirable situation. It is a most convenient assumption, as
without it, the calculations for the performance indicator become rather tedious and
uninformative.
Condition (vii) is a natural consequence of (vi); there is simply no prior
knowledge to justify any other choice.
5.3.2
Averaging
Rather than discussing the convergence properties of the original LMS algorithin
(5.2), which requires one to study a time-varying linear equation, an intermediate,
averaged time-invariant equation, which closely captures the behavior of the LMS
algorithm, is obtained rst. Assumption 2, in particular conditions (i), (ii), and (iii),
enables the following approximation.
Consider the averaged equation
wav n 1 I mR wav n mp ;
wav 0 w0 0:
5:6
155
Then, under Assumption 2, conditions (i) (iv), standard averaging results guarantee
that the solution wav n of (5.6) is an om approximation for wn, the solution of
(5.2), uniformly over time, because R . 0. More precisely, for all m sufciently
small (at least satisfying condition (v) from Assumption 2), the following bound
holds:
wav n wn Oum:
5:7
5:8
Under Assumption (2) it can be deduced that um om for any choice of compact
domain D and any choice of horizon parameter L.
Remark 2 Under the particular condition (iii) imposed by Assumption 2, one can
p
actually estimate that um om O m.
Remark 3 In essence the above conclusion allows one to study equation (5.6)
rather than the LMS equation (5.2) in order to describe both transient and
asymptotic properties. This is the power of time-based averaging analysis. It is an
approximation result, which here, thanks to the asymptotic stability of the averaged
equation, is valid over the entire time axis [31, 2].
Remark 4 Note that the stationary point of (5.6) is the Wiener solution. It follows
from equations (5.6) and (5.7) that the LMS algorithms solution converges
geometrically to an Oum neighborhood of the Wiener solution.
5.3.3
In order to study the transient performance of the LMS algorithm, consider the
following cost functional:
C^ e Ewo lim
N!1
N
X
kwav n wo k2
n0
kwo k2
5:9
In view of the stability of the Wiener solution for equation (5.6), the sum in the
above can be seen to be bounded, and hence C^ e is well dened. It clearly captures the
transient performance, not the asymptotic performance, which is characterized by
(5.8).
156
N!1
N
X
kI mR n wo k2
n0
kwo k2
N X
X
gj wo 2
lim
1 mlj 2n Ewo
:
N!1
kwo k2
n0 j1
5:10
X
1
C^ e lim
1 mlj 2n
N!1 n0 j1
1X
1
1
:
2 j1 mlj 2 mlj
5:11
5:12
Clearly, the constant terms are irrelevant when compared to the O1=m terms. This
suggests that an appropriate measure for the transient learning cost of the LMS
algorithm is given by the expression
Ce;
5.3.4
1
traceR1
:
2m
5:13
In the previous section it was argued that the convergence cost, or transient learning
cost for the LMS algorithm in a typical situation (typical because the Wiener
solution dependence is averaged out), is determined by (5.13). In this section, some
analytic results on how this convergence cost Ce; depends on signal properties and
dimension are provided. Of course, in any particular signal environment, it is
actually feasible to compute the cost functional Ce; for different parameters and
simply observe the dimensional dependence.
The following result holds for any signal environment conforming to
Assumption 2.
157
0;
5:14
where 1=r is the (1, 1) element of R1
and b k; 1=r is the (k; 1) element of
R1
.
1 1
2m 2 p
p
p
F1
uu vd v:
5:15
A detailed proof for this result can be found in [22, 29]. Part 1 and Part 2
effectively follow from the Levinson algorithm applied to the inverse of the inputs
correlation matrix, exploiting its Hermitian and Toeplitz structure. Part 3 is a
standard result from [44].
The results encapsulated in Theorem 1 may be paraphrased as follows:
If the input signal u is discrete white, the dimension does not affect the
convergence speed of the LMS algorithm. This is a clear pointer for
prewhitening lters advocated in conjunction with LMS algorithms in, for
example, echo-cancellation applications.
When the input signal u is not discrete white, the effect of dimension is more
pronounced the more u deviates from discrete white. The expression Ce;1 can
be effectively interpreted as measuring the required lter power to whiten the
signal u. It can also be observed that, under the constraint of unity signal
power, Ce;1 attains its minimum (and thus becomes a tight bound) when the
input signal is discrete white, that is, for signals with a constant power
spectrum [29].
To appreciate the effect of the input signal not being discrete white, Figure 5.3
represents the convergence cost function Ce;1 against the lter pole a [ 0; 1 for
the input signal which is rst-order ltered white noise un 1 aun 1n.
Here the variance s1 of the white noise 1 is scaled such that u has unity total power
158
1 p
1 p iv
Fuu vdv s21
je aj2 dv. As Figure 5.3 clearly illustrates,
2p p
2p p
the more the signal u is correlated (a closer to 1), the worse the transient
performance becomes (compared to the white noise case).
More generally, for unit power input signals u, described by autoregressively
ltered white noise signals, the convergence cost function can be expressed as
follows.
1
Theorem 2 Suppose that the unit power signal, un, is described by the mth-order
autoregressive (AR) model
a0 un a1 un 1 a2 un 2 am un m 1n;
5:16
Ce;
2
4
2m 2
am
a20 1 a21 1 a22 1
:
2m
5:17
Remark 5 Models of the form (5.16) are typically used for voiced speech. In such
circumstances the typical maximal delay m is of the order of 10. In acoustic echo
159
cancellations typical FIR lter orders are of the order of 1000, so the stated
restriction of m is not a limiting factor in this context.
Remark 6 The unit power constraint for u in (5.16) imposes the following
constraint on the autoregressive lter Az1 a0 a1 z1 a2 z2 am zm :
1
2p
1
d v 1:
jv j2
jAe
p
Remark 7 As already observed in the general case, for signals described by (5.16)
the convergence cost function also increases with the lter dimension . In
particular, it follows from (5.17) that in this special case
Ce;1 Ce;
5:18
and
norm
norm
Ce;
Ce;1
By way of illustration, Figure 5.4 represents Ce; (on the vertical axis) for various
AR models (with m 10) against 10 (on the horizontal axis). Table 5.1 includes
the three AR coefcient sets A1, A2, and A3 used to construct Figure 5.4. The AR
models, corresponding to equation (5.16), are obtained through application of the
Yule-Walker method [44] to segments of unit variance voiced speech. Also included
in Table 5.1 is the equivalent AR coefcient set A0 for a unit variance white signal.
The limit in the convergence cost function, or the correlation level, for each of these
signals is, respectively, mCe;1 1:0A0, 122:9A1, 306:4A2, 195:1A3. This
implies that the (normalized) LMS convergence cost function for voiced speech
inputs typically is more than 100 times greater than that for white inputs of the same
variance. For lter lengths greater than 40, the same is also true for the
unnormalized LMS cost function, as indicated in Figure 5.4. Note that the graph for
A0 is not discernible from the horizontal axis. Figure 5.4 clearly suggests that in
160
Figure 5.4 The effect of autocorrelation and lter dimension on transient performance. The
gure represents Ce; (vertical axis) for various AR models against 10 (horizontal axis).
applications such as acoustic echo cancellation, which include speech input signals
and lter lengths of 100 to the order of 1000, input signal whitening techniques
should improve the convergence speed by more than 100-fold.
5.3.5
Step-size Selection
The performance of a typical LMS algorithm consists not only of the transient
performance but also of the asymptotic performance. As indicated in the averaging
TABLE 5.1 Voiced Speech AR Coefcient Sets (Used for Figure 5.4)2
AR Filter
A0
A1
A2
A3
AR Coefcients (m 10)
1.0 0.0
3.7743
1.7124
6.6274
1.6608
5.3217
0.3747
0.0
6.2498
0.6484
11.5847
0.3274
9.2948
2.2628
6.2608
0.9251
8.2213
3.7856
7.0933
0.3028
3.7777
0.6609
3.5218
2.5476
2.8152
1.7444
The AR lters are designed to satisfy the unit power constraint; see Remark 6.
2.8273
0.2540
4.4682
0.4924
2.5805
1.1053
1.8502
2.0949
2.4230
161
result, asymptotically the adaptive FIR lter approximates the ideal Wiener lter,
with an error of the order of Oum given in equation (5.8). Under the signal
conditions imposed in Assumption 2, it follows that the least-squares performance
error of the adaptive lter in steady state is Om in excess of the Wiener lter
p
performance. (Here use is made of the estimate Oum O m, as indicated in
Remark 2.)
It is natural to propose a step-size selection that tries to achieve good transient
performance as well as good asymptotic performance. This would lead to a criterion
for step-size selection of the form
Jm
1 1
2m 2p
p
p
F1
uu vd v Om:
5:19
1 p 1
F vdv :
O
4p p uu
5:20
5.4
VARIABLE-DIMENSION FILTERS
In applications like acoustic echo cancellation, FIR lters with large total delay are
required to obtain adequate echo suppression. In the presence of colored input
signals, this is particularly bad news for LMS algorithms, but even in the white input
case this causes performance difculties.
In this section, a particular method of detecting the active taps (see Fig. 5.2) in
conjunction with a typical LMS algorithm is discussed. The selection of the active
taps is geared toward achieving good asymptotic performance. The detection
mechanism in conjunction with a normal LMS algorithm provides an estimation
approach with greatly enhanced asymptotic performance compared to the direct
estimation of an FIR lter with as many taps as the total delay requires. The price to
be paid for this enhanced performance is a marginal increase in computational cost.
The proposed detection method is shown to be structurally consistent; that is, it
identies correctly which taps are active when the input signal is white. No
structural consistency results are available for colored imput signals, but the method
is shown to be robust with respect to deviations from white input signals.
162
5.4.1
In the case of sparse FIR lter estimation (a lter like the one in Fig. 5.2), detection
of the active taps is important even in the case of white input signals. Although in
this case the convergence speed of the LMS algorithm is not affected by dimension,
the nal asymptotic performance of the adaptive lter is greatly affected. Indeed, if
all performance taps were adaptively estimated, then because an LMS estimate is
never exact but only om accurate, each tap estimate contributes an om error to the
nal adaptive FIR lter performance. With taps estimated, this leads to an excess
error of the order of om for the adaptive FIR lter compared to the ideal Wiener
lter. If, on the other hand, only m ! taps actually contributed to the Wiener lter
solution, and only those m taps were LMS adaptively estimated, then the nal
adaptive FIR lter would have an excess error of only omm ! om over the ideal
Wiener lter. Clearly, it pays to detect those taps that actively contribute to the FIR
lters performance.
In the presence of colored input signals, there is a second equally compelling
reason to consider detection of active taps based on the LMS convergence
performance. As is clear from the previous section, the dimension of the regressor
vector and autocorrelation properties of the input signal inuence the convergence
properties of LMS algorithms in a nontrivial and detrimental manner.
5.4.2
Signal Environment
In order to focus the ideas, consider the following signal environment assumptions
in addition to the standing Assumptions 1 and 2, which are consistent with the
intuition behind Figures 5.1 and 5.2.
Assumption 3
m
X
wo tj un tj ;
5:21
j1
where it is expected that m ! and where the indices of the nonzero Wiener
lter weights tj , , for j 1; ; m are unknown. Denote the collection of
ftj ; j 1; ; mg as Jo .
(ii) The input signal is discrete white, with variance s2u and thus R s2u I .
(iii) The disturbance is discrete white and uncorrelated with the input.
5.4.3
Under Assumptions 1, 2 and 3, the performance of an LMS lter without active tap
detection is given by
lim Efen2 gs2d s2u
n!1
X
j1
Efwo j w j2 g:
5:22
163
n!1
s2d
2 m
:
1 su
2
5:23
The excess in asymptotic performance (as compared to the Wiener lter) is entirely
due to the variance error in the estimated tap weights. (There is no bias error.)
Now consider the case where only a portion of the tap weights are being
estimated and the others are simply set at zero. Let the estimated weights have
indices ^tj j 1; . . . ; k, with k , . Denote this collection of indices as J and its
complement with respect to the full set of indices as J c . Let J1 Jo > J be the set of
the indices of Wiener coefcients that are being estimated. Denote J2 J c > J, the
collection of the indices of those Wiener coefcients that are not estimated. Let
J3 J > Joc , be the set of indices of those coefcients that are estimated but have
no corresponding nonzero Wiener coefcient. The asymptotic lter performance is
then
lim Efen g
2
n!1
s2d
s2u
!
X m
X
wo j2 :
s2u
2
j[J1 <J3
j[J2
5:24
Clearly, the asymptotic performance can be further reduced by making J3 the empty
set. If the Wiener coefcient is zero, it should not be estimated. The contribution of
the summation over J2 to the asymptotic performance is the bias error. The adapted
model set does not include the actual Wiener lter, hence the bias terms in the
asymptotic performance. The contribution of the bias error to the overall performance can be minimized by removing every index j [ J2 for which 2wo j2 . ms2d .
From the above, it follows that if a tap weight contributes less than the expected
parameter variance, it should not be estimated. This observation will guide the tap
selection procedure. It is clear that to implement the procedure, it will be necessary
to estimate the variance of the disturbance s2d . Moreover, in general, it transpires that
the best adaptive lter performance would be achieved by a lter that may have
fewer taps than the Wiener lter. Structural consistency is therefore not an essential
property to aim for, although structural consistency is denitely better than having
too many parameters estimated.
164
same: Either the asymptotic performance and/or the convergence cost benets from
eliminating all estimation of tap weights that contribute less than the expected noise
oor. Structural consistency does not lead to optimal LMS lter performance.
5.4.4
The previous discussion suggests the active tap locations by considering the
following indicator:
P
n
Xn j
kj1 dk dkuk j
P
n j nkj1 u2 k j
2
:
5:25
Indeed, because of the discrete white noise character of u and the fact that the
disturbance d and the input u are uncorrelated, it follows that Xn j converges in
probability as n ! 1 to w2o js2u . It follows that for sufciently large n, the most
active taps can be simply ordered according to the size of Xn j. This is formally
established in [26].
Clearly, this activity measure does not provide a means of detecting how many
taps are to be used in an LMS adapted lter. It does, though, provide a means of
detecting the m most active taps over a total FIR horizon of , given both integers m
and .
Using a consistency argument, [26] argues and proves that the following test
provides a consistent activity measure.
5:26
165
Combining the active tap detection result with an LMS update algorithm may be
achieved as follows:
1. Step 1. Detection at time n:
(a) Construct for j [ 0; 1
P
n
Xn j
kj1 dk dkuk j
P
n j nkj1 u2 k j
(b) Construct
Kn
n
1X
d j d j2 :
n j1
(c) Construct
En
n
1X
e j2 :
n j1
2
:
166
5.4.6
The above analysis critically depends on the whiteness of the input signal. In case
the input signal is not white, one could consider introducing an input prewhitening
lter before detection and LMS updates take place. This is common in the acoustic
echo-cancellation situation [28, 29]. As the introduction of prewhitening lters
signicantly increases the computational complexity, a more direct approach with a
modied active tap test may be advantageous in other applications.
The main difculty with the activity measure Xn j in the colored input case is
that the detection threshold (as dened in Theorem 3) is too low. Indeed the
167
2s2dd
!
logn
2
1 :
P
n
s4u Lj1 R 1; j
Here L is the effective length of the autocorrelation function of the input signal. It
amounts to assuming that Efunun L jg 0 or is negligible for all integers
j 0. Note that in case the signal is white, the above threshold is identical to the
threshold discussed in Section 5.4.5.
Unfortunately, with the new threshold, some inactive taps will necessarily be
labeled as active (for a signal with an autocorrelation length of the order of , all taps
would be labeled active). Structural consistency is lost. In order to combat this, the
LMS estimated FIR lter weights can be used to obtain a better activity measure by
essentially eliminating the cross-correlation in the detection phase (this assumes that
the LMS estimation works, despite the extra estimated weights). Such a bootstrap
process appears to work in practice, but no formal result indicating the structural
consistency is available. Extensive simulations are reported in [28].
5.4.7
Simulation Examples
The following examples are based on the algorithm suggested in Section 5.4.6,
compared with a standard LMS algorithm, without active tap detection.
The design parameters in the active tap detection LMS algorithm are the
forgetting factor a 0:9 and the step-size m 0:001.
The unknown FIR lter is represented in Figure 5.5.
The performance of the active tap detection LMS algorithm is represented in
Figure 5.6. Figure 5.6a corresponds to a signal environment in which both the input
Figure 5.5 Unknown FIR lters parameters 300; number of active taps m 11.
168
Figure 5.6 kenk2 for the active tap detection LMS algorithm applied to the sparse FIR
lter of Figure 5.5.
Figure 5.7 kenk2 for the standard LMS algorithm applied to the sparse FIR lter of
Figure 5.5.
169
u and the disturbance signal are discrete white, and uncorrelated, zero mean unit
variance Gaussian processes. In Figure 5.6b the disturbance signal is a Gaussian
rst-order AR process with AR coefcient 0.8, driven by a unit variance white noise.
Figure 5.6 displays the evolution of the parameter estimation error kenk2 . This
gure should be compared with Figure 5.7, which displays the same information for
the standard LMS algorithm. Clearly, the asymptotic performance is signicantly
worse in the LMS algorithm case. Although not directly measurable, the asymptotic
performance is about m= 1=30 times better for the algorithm with detection. This
is in line with the theory. Observe also that the convergence time is about the same
for both algorithms. This clearly illustrates the observation that the dimension does
not affect the convergence cost in the white input case.
The active tap detection part of the algorithm is illustrated in Figure 5.8. For the
same lter circumstances as before, Figure 5.8a corrsponds to a signal-to-noise ratio
of 1, while Figure 5.8b corresponds to a signal-to-noise ratio of 10. In both cases the
input and the disturbance are zero mean white Gaussian signals. s2u 1, and in
Figure 5.8a s2d 1 and in Figure 5.8b s2d 0:1. Reasonably quickly, the correct
number of taps is estimated. As illustrated in Figure 5.6, the parameters converge
quickly to the correct values as well.
Figure 5.8 Estimated number of active taps for the detection-enhanced LMS algorithm
applied to the sparse FIR lter of Figure 5.5.
170
5.5
DISCUSSION
REFERENCES
1. B. Widrow, S. Stearns, Adaptive Signal Processing, Englewood Cliffs, NJ, Prentice-Hall,
1985.
2. V. Solo, X. Kong, Adaptive Signal Processing Algorithms, Stability and Performance,
Englewood Cliffs, NJ, Prentice-Hall, 1995.
3. K. Wesolowski, C. M. Zhao, W. Rupprecht, Adaptive LMS transversal lters with
controlled length, IEE Proceedings-F, Vol. 139, pp. 233 239, 1992.
4. K. Fujii, J. Ohga, Equation for brief evaluation of the convergence rate of the normalised
LMS algorithm, IEICE Trans. Fundamentals, Vol. E76-A, pp. 2048 2051, 1993.
5. P. E. An, M. Brown, C. J. Harris, On the convergence rate performance of the
normalised least-mean square adaptation, IEEE Trans. on Neural Networks, Vol. 8, pp.
1211 1214, 1997.
6. T. E. Hunter, D. A. Linebarger, An alternative formulation for low rank transform
domain adaptive ltering, Proceedings of ICASSP2000, Vol. 1, pp. 29 32, Piscataway,
NJ, 2000.
7. S. F. Cotter, B. D. Rao, Matching pursuit based decision-feedback equalisers,
Proceedings of ICASSP2000, Vol. 5, pp. 27132716, Piscataway, NJ, 2000.
8. S. Gollamudi, S. Nagaraj, S. Kapoor, Y. F. Huang, Set-membership ltering with a setmembership normalised LMS algorithm with an adaptive step-size, IEEE Signal
Processing Letters, Vol. 5, pp. 111114, 1998.
9. S. Ariyavisitakul, N. R. Sollenberger, L. J. Greenstein, Tap-selectable decision feedback
equaliser, Proceedings of ICCC97, Vol. 3, pp. 1521 1526, New York, 1997.
REFERENCES
171
172
REFERENCES
173
47. D. C. Farden, Stochastic approximation with correlated data, IEEE Trans. Information
Theory, Vol. 27, pp. 105 113, 1981.
48. S. K. Jones, R. K. Cavin, W. M. Reed, Analysis of error gradient adaptive linear
estimators for a class of stationary dependent processors, IEEE Trans. on Information
Theory, Vol. 28, pp. 318 329, 1982.
49. M. R. Leadbetter, G. Lindgren, H. Rootzen, Extremes and Related Properties of Random
Sequences and Processes, New York, Springer-Verlag, 1982.
CONTROL OF LMS-TYPE
ADAPTIVE FILTERS
NSLER
EBERHARD HA
Signal Theory Group, Darmstadt University of Technology, Darmstadt, Germany
6.1
INTRODUCTION
Adaptive ltering is a very powerful tool in modern signal processing, and its
importance is still increasing. In the past few decades, several algorithms like fast
recursive least squares, fast Newton, and afne projection algorithms have been
developed in order to achieve fast convergence at low or moderate computational
complexity. Nevertheless, due to its simplicity and its numerical robustness, the
least-mean-square (LMS) algorithmespecially its normalized version, the NLMS
algorithmis still one of the most important adaptive algorithms.
In this chapter we will focus on control aspects of LMS-type adaptive lters. In
most real implementations the desired signal is distorted by measurement noise.
Depending on the application, the signal-to-noise ratio can even exceed the 0 dB
threshold. In order to achieve a large speed of convergence and a small steady-state
error in the presence of measurement noise, control is absolutely necessary. The
chapter is organized as follows:
In Section 1 we will mention briey the relation of system design and its
impacts on control. In particular, the choice of the processing and control
structure enables or disables several degrees of freedom, which can be
exploited for control purposes.
175
176
6.1.1
Notation
Among the enormous number of applications where the LMS or NLMS algorithm
can be utilized, we will address only system identication problems according to
[12] in this chapter. In Figure 6.1 the general setup as well as some notation issues
are depicted.
Several assumptions will be made in this chapter. Firstly, we will assume that the
system, which should be identied, can be modeled with sufcient accuracy as a
6.1 INTRODUCTION
177
linear nite impulse response (FIR) lter. Its impulse response will be denoted by
hi n. The subscript i addresses the ith coefcient of the impulse response at time
index n. We will not assume that we have time-invariant systems; therefore, we need
a time index as well as a coefcient index. The impulse response of the FIR lter can
be written as a vector:
hn h0 n; h1 n; . . . ; hN1 nT :
6:1
The output of the unknown system yn consists of the desired signal dn and
additional measurement noise nn. We will distinguish here between stationary
measurement noise ns n and nonstationary noise nn n:
yn dn nn
dn ns n nn n:
6:2
N
1
X
hi nun i
i0
6:3
h nun u nhn:
T
In the last line of Eq. 6.3, vector notation was also used for the excitation signal:
un un; un 1; . . . ; un N 1T :
In Table 6.1 the most important symbols as well as their meanings are listed.
TABLE 6.1 Notation
Symbol
Meaning
dn
d^ n
en
hn
nn
nn n
ns n
un
wn
wo
yn
D; Dn
e n
m ; m n
Desired signal
Output signal of the adaptive lter
Error signal
Impulse response vector of the unknown system
Measurement noise
Nonstationary part of the measurement noise
Stationary part of the measurement noise
Excitation signal
Impulse response vector of the adaptive lter
Wiener solution
Distorted output signal of the unknown system
Regularization parameter
System mismatch vector
Step-size
6:4
178
6.1.2
Control Structures
In this chapter, we will mention two possibilities for control purposes: weighting the
lter update by multiplication with a step-size and increasing the denominator of the
update term by regularization. Due to the scalar normalization within the NLMS
update, both forms of control can easily be exchanged (see Section 6.3.3).
Nevertheless, their practical implementations often differ. For this reason, we will
deal with both possibilities here. Furthermore, we will distinguish between a scalar
step-size and a step-size matrix.
6.1.2.1 Scalar Control Parameters For computing the lter update according to
the NLMS update, the error signal en is required:
en yn d^ n
hT nun nn wT nun:
6:5
For the impulse response of the adaptive lter wi n the same (vector) notation as for
the impulse response of the unknown system was used:
wn w0 n; w1 n; . . . ; wN1 nT :
6:6
enun
:
kunk2 D
6:7
m0
6 0
6
wn 1 wn 6
6 ..
4 .
0
0
m1
wn diagfm g
0
0
..
.
.
m N1
..
enun
kunk2 D
3
7
7 enun
7
7 kunk2 D
5
6:8
6.1 INTRODUCTION
179
6.1.3
Besides selecting different control structures, the system designer also has the
possibility of choosing between different processing structures: fullband processing,
block processing, and subband processing.
180
Delay resolution
Time resolution
Fre. resolution
Delay resolution
Time resolution
Fre. resolution
Delay resolution
Subband
Fre. resolution
Block
Time resolution
Fullband
Scalar step-size
Vector step-size
Explanation
Sec. 6.1.3.1
Sec. 6.1.3.2
Sec. 6.1.3.3
Before these three processing structures and the related control possibilities are
described in more detail in the next three subsections, Table 6.2 gives an overview of
the advantages and disadvantages of the different structures. The possibility of using
different control parameters for each coefcient wi n of the lter vector wn is
mentioned with the term delay resolution.
6.1.3.1 Fullband Processing Fullband processing structures, according to
Figure 6.1, offer the possibility of adjusting the control parameters m n and Dn
differently in each iteration. For this reason, fullband processing has the best time
resolution of all processing structures. If a matrix step-size is utilized (Eq. 6.8), the
additional degree of freedom can be exploited to adapt each coefcient wi n of the
lter vector wn individually. Especially for impulse responses which concentrate
their energy on only a few coefcients (see Fig. 6.2), this is an important advantage
for control purposes.
In the left part of Figure 6.3 an impulse response of a loudspeaker-enclosuremicrophone system is depicted. Details of this kind of system are explained in
Section 6.4. We will use the impulse response in this and the next two subsections to
demonstrate the advantages and disadvantages of the different processing structures.
The two diagrams in the right part of Figure 6.3 show the control freedoms in the
delay-frequency domain. The term delay in this context represents the coefcient
index i of the impulse response hi n. If only scalar control parameters are used,
neither frequency-selective nor delay-selective control is possible. For this reason,
the delay-frequency domain is not segmented in the upper left diagram. If a matrix
step-size is applied, selectivity in delay direction is possible. The lower left diagram
is therefore segmented vertically. Even if fullband processing structures do not have
the possibility of frequency-selective control, they have the very basic advantage of
not introducing any articial delay into the signal paths. For some applications this is
a necessary feature.
6.1 INTRODUCTION
Figure 6.3
181
6.1.3.2 Block Processing Long time domain based adaptive lters require huge
processing power due to their large number of coefcients. For many applications,
such as acoustic echo or noise control, algorithms with low numerical complexity
are necessary. To solve the complexity problem, adaptive lters based on block
processing [34, 35] can be used.
In general, most block processing algorithms collect B input signal samples
before they calculate a block of B output signal samples. Consequently, the lter is
adapted only once every B sampling instants. To reduce the computational complexity, the convolution and adaptation are performed in the frequency domain (see
Fig. 6.4).
Besides the advantage of reduced computational complexity, block processing
also has disadvantages. Due to the computation of only one adaptation every B
samples, the time resolution for control purposes is reduced. if the signal-to-noise
ratio changes in the middle of the signal block, for example, the control parameters
can only be adjusted to the mean signal-to-noise ratio (averaged over the block
length). Especially for large block length B and therefore for a large reduction of
computational complexity, the impacts of the reduced time resolution clearly turn
out.
If a vector step-size is chosen and the lter update is performed in the frequency
domain, a new degree of freedom arises. Each frequency bin of the update of the
transformed lter vector Wb e j2p =Bm ; n can be weighted individually. Especially if
the system has low-pass, bandpass, or high-pass character and the involved signals
are stationary, the convergence speed can be increased. In the left part of Figure 6.5,
the magnitude of the Fourier transform of the impulse response of Figure 6.3 is
depicted. The dark area represents the basic control area if a matrix step-size is used.
182
Figure 6.4 Block processing structure. To reduce computational complexity, the convolution and the adaptation are performed in the frequency domain.
Figure 6.5
6.1 INTRODUCTION
183
As well as in Figure 6.3, two delay-frequency areas are depicted in the right part
of Figure 6.5. If the matrix step-size is applied in the frequency domain, the delayfrequency area is split horizontally, showing the control freedom for individual
control possibilities over the frequency.
Besides all advantages of block processing, another inherent disadvantage of this
processing structure should also be mentioned. Due to the collection of B samples, a
signicant delay is introduced in the signal paths.
with m [ f0 . . . M 1g:
Figure 6.6
Subband structure.
184
In Figure 6.6 the subband signals are grouped in vectors. For example, the vector
usb n u0 n; u1 n; . . . ; uM1 nT
belongs to all subband excitation signals (channel 0 to M 1) at the subsampled
time index n.
In contrast to block (frequency domain) processing, the subband structure offers
the system designer an additional degree of freedom. Detectors and control
mechanisms can be implemented separately for each channel. If matrix step-sizes
are applied in each channel, delay-selective control is also possible. In Figure 6.7 a
delay-frequency analysis of the impulse response of Figure 6.3 is depicted.
Even without using matrix step-sizes, it is possible to control each subband
individually. For this reason, the delay-frequency area is even for the case of scalar
control parameters segmented horizontally. Also, the orders of the adaptive lters
can be adjusted individually in each channel according to the statistical properties of
the excitation signal, the measurement noise, and the impulse response of the system
to be identied. If matrix step-size control is applied, the delay-frequency area can
also be segmented vertically.
Using subsampled signals leads to a reduction of computational complexity. All
necessary forms of control and detection can operate independently in each channel.
The price to be paid for these advantages is a signicant delay introduced into the
signal path by the analysis and synthesis lter banks.
Figure 6.7
6.1 INTRODUCTION
6.1.4
185
Control Principles
Besides the possibilities of different processing and control structures, the system
designer can choose between different control principles. In contrast to the processing and the control structure (which should be matched to the application), the
authors strongly recommend the usage of a state-dependent control strategy (as
described in Subsection 6.1.4.2 as well as in the rest of this chapter). Nevertheless, in
a very few applications, a binary (on off) control can be applied.
6.1.4.1 Binary Control Strategy One basic control principle that is often used in
real implementations is simply to switch the step-size or the regularization
parameter between two values: 0 and m fix , respectively 1 and Dfix . Even though this
method does not require explicit calculation of optimal control parameters, one
should have a reliable indication for the choice of values. An adaptation with the
nonzero step-size m fix or the noninnity regularization parameter Dfix should be
performed in those iterations where the distance between the unknown and the
adaptive system can be decreased on average; in all other cases, the lter should not
be adapted (see Sec. 6.2). We will see in Section 6.3 that an adaptation step is
successful according to the criterion mentioned above if the xed control parameters
are smaller than twice the optimal values for state-dependent control:
0 , m fix , 2m opt n;
0 , Dfix , 2Dopt n:
6:9
6.1.5
Concluding Remarks
The aim of this section was to provide a basic introduction to control and processing
structures for the NLMS algorithm. The system designer should be aware of the
alternatives that he or she can choose from, even in an early state of the design
process. The optimal choice depends crucially on the application. Therefore, we will
give an application example and discuss the impacts of the choice of the processing
and control structure in Section 6.4. In the next sections, we will provide a general
understanding of the adaptation process in dependence on the control parameters
and the measurement noise.
186
6.2
In this section, convergence and stability issues for LMS-type adaptive lters are
discussed. In the literature, a variety of excellent convergence analyses and LMS
derivations (e.g. [9, 13, 23, 43]) can be found. Most of them are based on eigenvalue
analyses of the autocorrelation matrix of the excitation signal and provide insight in
the LMS (and NLMS) algorithm.
Our aim is to provide a basic understanding and a general overview of the
convergence properties in the presence of measurement noise. In particular, the
dependence of the convergence speed and the nal misadjustment on the control
parameters m and D are investigated. For a better understanding, we will mention only scalar, time-invariant control parameters. Furthermore, only system
identication problems according to [12] will be addressed here.
6.2.1
In the left part of Figure 6.8 the general structure of a classical system identication
is depicted. Minimizing the expectation of the squared error signal
Efe2 ng ! min
6:10
Suy V
:
Suu V
6:11
Figure 6.8 General structure of a classical system identication (left) and the denition of
the system mismatch vector (right).
187
The quantities Suu V and Suy V denote the auto power spectral density and the
cross power spectral density, respectively. If the measurement noise nn and the
excitation signal un are orthogonal, the frequency response of the optimal solution
will be
Wo e jV
Sud V
He jV :
Suu V
6:12
In this case, the optimal solution for the adaptive lter will be an ideal copy of the
unknown system. Furthermore it should be mentioned that stationarity of the signals
and time-invariant systems were assumed in the Wiener approach.
Even if the average power of the error signal en is a valuable criterion for
minimization purposes, it is not very useful if information about the convergence
statewhich is very important for control purposesis wanted. A large power of
the error signal may be due to poor system identication or may stem from a large
measurement noise power. Therefore, a better procedure would be to estimate the
power of the undistorted error. This signal is dened as (see Fig. 6.8)
eu n en nn dn d^ n:
6:13
If the power of the undistorted error is zero, the output of the adaptive lter d^ n will
be equal to the desired signal dn. Besides the fact that the undistorted error power
cannot be measured directly, a zero undistorted error does not necessarily mean that
both lters are identical. They may be different from each other at frequencies that
are not excited by un.
Another possible way of judging the convergence state is to monitor the system
mismatch vector, which is dened as the difference vector of the impulse responses
of the unknown and the adaptive lter:
e n hn wn:
6:14
In order to derive a scalar cost function, only the squared norm of this vectorcalled
the system distanceis utilized. This quantity is independent of the properties of the
excitation signal.
Therefore, in the following derivations, we will try to minimize the expected
system distance:
Efke nk2 g ! min:
6.2.2
6:15
Using the denition of the system mismatch vector and assuming that the system is
time-invariant hn 1 hn, the equation for the NLMS lter update can be
used to derive an iteration of the system mismatch vector:
e n 1 e n m
enun
:
kunk2 D
6:16
188
The aim of an adaptive algorithmn should be to minimize the expected squared norm
of the system mismatch vector. Using Eq. 6.16, the squared norm can be computed
recursively:
ke n 1k2 e T n 1e n 1
e T ne n 2m
ke nk2 2m
ene T nun
e2 nuT nun
m2
2
kunk D
kunk2 D2
6:17
eneu n
e2 nkunk2
m2
:
2
kunk D
kunk2 D2
The error signal en consists of its undistorted part eu n and the measurement noise
nn:
en eu n nn e T nun nn:
6:18
Using this denition and assuming further that un and nn are statistically
independent and zero-mean (mu 0, mn 0), the expected squared norm of the
system mismatch vector can be written as
e T nunuT ne n
Efke n 1k g Efke nk g 2m E
kunk2 D
T
e nunuT ne n n2 nkunk2
m 2E
:
kunk2 D2
2
6:19
For large lter orders N 1 @ 1 the squared norm of the excitation vector can be
approximated by a constant which is equal to N times the variance of the signal:
kunk2 N s 2u :
6:20
m 2 N s 2u
s 2n
N s 2u D2
!
m 2 N s 2u
2m
N s 2u D2 N s 2u D
6:21
189
6:22
6:23
m 2 N s 2u
s 2n :
N s 2u D2
The rst row in Eq. 6.23 shows the contraction due to the undistorted adaptation
process. The factor
Am ; D; s 2u ; N 1
m 2 N s 4u
2m s 2u
N s 2u D2 N s 2u D
6:24
will be called contraction parameter and should be always smaller than 1. The
second row in Eq. 6.23 describes the inuence of the measurement noise. This signal
disturbs the adaptation process. After introducing the abbreviation
Bm ; D; s 2u ; N
m 2 N s 4u
;
N s 2u D2
6:25
which is called the expansion parameter, Eq. 6.23 can be written in a shorter form:
s2
Efke n 1k2 g Am ; D; s 2m ; N Efke nk2 g Bm ; D; s 2u ; N n2 :
|{z} s u
|{z}
Contraction parameter
6:26
Expansion parameter
The contraction and expansion parameters are quantities with dimension 1, and both
are dependent on the control variables m and D as well as on the lter order N 1
and the excitation power s 2u . If the inuence of the measurement noise should be
eliminated completely, the expansion parameter Bm ; D; s 2u ; N should be zero. This
can be achieved by setting the step-size to zero or the regularization parameter to
innity. Unfortunately, in the case of these choices, the lter will no longer be
adapted. In Figure 6.9 the value of the contraction and expansion parameters for a
lter length of N 100 and an input power of s 2u 1 are depicted.
190
Figure 6.9 Contraction and expansion parameters. The upper two diagrams show surface
plots of both parameter functions for xed excitation power s 2u 1 and a xed lter length of
N 100. In the lower two diagrams appropriate surface plots are depicted. For a fast
convergence, a compromise between fastest contraction (Am ; D; s 2u ; N ! min) and no
inuence of the measurement noise (Bm ; D; s 2u ; N ! 0) has to be found. The optimal
compromise will depend on the convergence state of the lter (see Sect. 6.3).
6:27
6:28
has to be found. The optimal compromise will depend on the convergence state of
the lter. In Section 6.3 this question will be answered. Here we will rst investigate
the convergence of the lter for xed control parameters. We can therefore solve the
191
n1
s 2n X
Am ; D; s 2u ; Nm :
s 2u m0
6:29
n!1
s 2n
1
s 2u 1 Am ; D; s 2u ; N
m N s 2n
:
2 m N s 2u 2D
6:30
For the uncontrolled adaptation (m 1, D 0), the nal system distance will be
equal to the inverse of the signal-to-noise ratio:
lim Efke nk2 gjm 1;D0
n!1
s 2n
:
s 2u
6:31
Finally, two simulation examples are presented in Figure 6.10 to indicate the validity
of approximation 6.29. In both cases, white noise was chosen as excitation and
measurement noise with a signal-to-noise ratio of 30 dB. In the rst simulation, the
parameter set m 1 0:7 and D1 400 has been used. According to approximation
6.30, a nal misadjustment of about 35 dB should be achieved. For the second
parameter set the values are m 2 0:4 and D2 800. With this set a nal
misadjustment of about 39dB is achievable.
In the lowest diagram of Figure 6.10, the measured system distance as well as its
theoretic progression are depicted. The theoretic and measured curves are mostly
overlaying; the maximal (logarithmic) difference is about 3dB.
6.2.3
The convergence speed is of special importance at the start of an adaptation and after
changes of the system hn. In these situations, we will assume that inuence of the
measurement noise can be neglected. Therefore, the recursive computation of the
expected system distance can be simplied to
Efke n 1k2 gjs 2n 0 Am ; D; s 2u ; NEfke nk2 g:
6:32
6:33
192
Figure 6.10 Simulation examples and theoretic convergence. In order to validate approximation 6.29, two simulations are presented. White noise was used as excitation (depicted
in the upper diagram) and measurement noise (presented in the middle diagram) with a signalto-noise ratio of 30 dB. In the lower diagram the measured system distance as well as its
theoretic progression are depicted. The theoretic and measured curves are mostly overlaying.
Also, the predicted nal misadjustments according to approximation (6.30) coincide with the
measured ones very well.
193
m ,2
2D
:
N s 2u
6:34
6:35
0:
@m
@D
m opt
Dopt
After inserting the denition of contraction parameter (Eq. 6.24) and equating both
differentiations, we nally get maximal convergence speed if the step-size and the
regularization parameter are related as
m 1
D
:
N s 2u
6:36
194
6:41
This means that after a number of iterations which are equal to two times the lter
order, a decrease of the system distance of about 10dB can be achieved. We can also
learn from approximation 6.39 that short lters converge much faster than long ones.
In order to elucidate this relationship, two convergences with different lter orders
(N 500 and N 1000) are presented in Figure 6.12.
To show the validity of approximation 6.41, triangles with edge length 10 dB and
2N are added to the convergence plots. Especially at the start of the convergence, the
theoretical decreases of the system distance t the measured ones very well.
6.2.4
195
Figure 6.12 Two convergences with different lter orders are depicted (upper diagram:
N 500; lower diagram: N 1000). White noise was used for the excitation signal as well as
for the measurement noise, with a signal-to-noise ratio of about 30 dB. To show the validity of
approximation 6.41, triangles with edge length 10 dB and 2N are added to the convergence
plots. Especially at the start of the convergence, the theoretical decreases of the system
distance t the measured ones very well.
m 2 s 2n
:
N s 2u
6:42
n!1
m s 2n
:
2 m s 2u
6:43
196
In Figure 6.13 three convergences with different choices for the step-size (m 1,
m 0:5, and m 0:25) are presented. The same boundary conditions (white noise
for the excitation and the measurement noise, 30 dB signal-to-noise ratio, N 1000)
were used as the rst simulation series of subsection 6.2.2.
According to approximation 6.43, a nal misadjustment of the lter of
30dB, for the choice m 1,
34.8dB, for the choice m 12, and
38.5dB, for the choice m 14
should be achieved. The speed of the initial convergence can be computed as
m 2 m
dB
:
10 log10 1
N
Iteration
6:44
Figure 6.13 Three convergences without regularization but with different choices for the
step-size (m 1, m 0:5, and m 0:25) are depicted. The same boundary conditions (white
noise for the excitation and the measurement noise, 30 dB signal-to-noise ratio, N 1000)
were used as in the rst simulation series of Subsection 6.2.2. A nal misadjustment of the
adaptive lter smaller than the signal-to-noise ratio can be achieved only if the step-size is
smaller than 1. This leads to a decreased speed of convergence.
197
6.2.5
By analogy with the previous subsection, the recursive approximation of the mean
squared norm of the system mismatch vector can be simplied if only regularization
control (m 1) is applied:
Efke n 1k gjm 1
2
!
N s 4u 2s 2u D
1
Efke nk2 gjm 1
N s 2u D2
6:45
N s 4u
s 2n
:
2 s2
2
N s u D
u
For D 0 the system distance converges for n ! 1 toward
lim Efke nk2 gjm 1
n!1
N s 2u
s 2n
:
N s 2u 2D s 2u
6:46
As in the previous section, three simulations with only regularization control are
presented. In Figure 6.14 the convergences with the same boundary conditions
except for the choice of the control parametersas in the previous example are
depicted.
The simulation runs were performed with the choices D 0, D 1000, and
D 4000. According to these values and approximation 6.46, nal misadjustments
of the adaptive lter of 30dB, 34.8 dB, and 39.5dB should be achievable. The
initial convergence speed (see recursion 6.45) can be stated as
10 log10
!
N s 4u 2s 2u D
dB
:
1
2
2
Iteration
N s u D
6:47
As well as being shown in Figure 6.13, the expected nal misadjustments and the
convergence speeds are depicted in Figure 6.14.
6.2.6
Concluding Remarks
In this section the convergence properties of the NLMS algorithm in the presence of
measurement noise were investigated. Approximations for the speed of convergence, as well as the nal misadjustment of the adaptive lter in dependence of
198
Figure 6.14 Three convergences without step-size control but with different choices for the
regularization parameter (D 0, D 1000, and D 4000) are depicted. The same boundary
conditions (white noise for the excitation and the measurement noise, 30dB signal-to-noise
ratio, N 1000) as in Figure 6.13 were used. A nal misadjustment of the adaptive lter
smaller than the signal-to-noise ratio can be achieved only if the regularization parameter is
chosen to be larger than 0. As in the case of step-size control, this leads to a decreased speed of
convergence.
the control parameters, were derived. It was shown that with parameter sets, which
lead to an optimal initial convergence speed (e.g., m 1, D 0), the nal
misadjustment cannot be smaller than the signal-to-noise ratio.
Especially for small signal-to-noise ratio, this might not be sufcient for adequate
system identication. Reducing the step-size or increasing the regularization
parameter leads to smaller nal misadjustments, but at the same time to a reduced
convergence speed.
A fast speed of convergence as well as good steady-state behavior can be
achieved only if time-variant control parameters are used. If only step-size control
(D 0) is implemented, one should start with a large step-size (m 1). The smaller
the system distance will become, the more the step-size should be reduced.
Similar strategies should be applied if only regularization control or hybrid
control is implemented. In the next section, optimal choices for the step-size
and the regularization parameter in dependence of the convergence state and the
signal-to-noise ratio will be derived.
199
6.3
In the previous section, two competitive demands were made on the step-size and
the regularization parameter:
In order to achieve a fast initial convergence or a fast readaptation after system
changes, a large step-size (m 1) and a small regularization parameter
(D 0) should be used.
To achieve a small nal misadjustment ke nk2 ! 0 a small step-size
(m ! 0) and/or a large regularization parameter (D ! 1) is necessary.
Both requirements cannot be fullled with xed (time-invariant) control
parameters. Therefore, a time-variant step-size and a time-variant regularization
parameter
m ! m n
D ! Dn
as well as optimal choices for them are introduced in this section. In most system
designs, only step-size or only regularization control is used. For this reason, in the
rst two subsections optimal choices for both control parameters are derived. As
suggested in the last section, the optimization criterion will be the minimization of
the expected system distance.
Both control strategies can easily be exchanged, which is shown in Subsection
6.3.3. Nevertheless, their practical implementations often differ. Even if in most
cases only step-size control is implemented, in some situations control by
regularization or a mixture of both is a good choice as well. Therefore, in the last part
of this subsection, some hints concerning the choice of the control structure are
given.
6.3.1
enun
:
kunk2
6:48
Supposing that the system is time invariant (hn 1 hn), the recursion of
the expected system distance can be denoted as follows (see Eq. 6.17, Sec. 6.2.2,
200
for D 0):
eneu n
Efke n 1k g Efke nk g 2m nE
junk2
2
e n
:
m 2 nE
kunk2
2
6:49
For determining an optimal step-size [47], the cost function, that is, the squared
norm of the system mismatch vector, should decrease (to be precise, should not
increase) on the average for every iteration step:
Efke n 1k2 g Efke nk2 g 0:
6:50
Inserting the recursive expression for the expected system distance (Eq. 6.49) in
relation 6.50 leads to
e2 n
eneu n
m nE
2m nE
0:
kunk2
kunk2
2
6:51
eneu n
E
kunk2
:
0 m n 2 2
e n
E
kunk2
6:52
The largest decrease of the system distance is achieved in the middle of the dened
interval. To prove this statement, we differentiate the expected system distance at
time index n 1. Setting this derivation to zero leadsdue to the quadratic surface
of the cost functionto the optimal step-size:
@Efke n 1k2 g
0:
@m n
m nm opt n
6:53
We assume that the step-sizes at different time instants are uncorrelated. Therefore
the expected system distance at time index n is not dependent on the step-size m n
(only on m n 1; m n 2; . . .). Using this assumption and the recursive
201
2
eneu n
e n
2
0;
m
nE
opt
kunk2
kunk2
2
e n
eneu n
m opt nE
E
;
kunk2
kunk2
eneu n
E
kunk2
m opt n 2
:
e n
E
kunk2
2E
6:54
If we assume that the squared norm of the excitation vector kunk2 can be
approximated by a constant, as well as that the signals nn and un (and therefore
also nn and eu n) are uncorrelated, the optimal step-size can be simplied as
follows:
m opt n
Efe2u ng
:
Efe2 ng
6:55
In the absence of measurement noise (nn 0) the distorted error signal en equals
the undistorted error signal eu n and the optimal step-size is 1. If the adaptive lter
is well adjusted to the impulse response of the unknown system, the power of the
residual error signal eu n is very small. In the presence of measurement noise, the
power of the distortion nn and therefore also the power of the distorted error signal
en increases. In this case, the numerator of Eq. 6.54 is visibly smaller than the
denominator, resulting in a step-size close to 0 so the lter is not or only
marginally changed in such situations. Both examples show that Eq. 6.55 is (at least
for these two boundary cases) a useful approximation.
To show the advantages of a time-variant step-size control, the simulation
example of Section 6.2.4 (see Fig. 6.13) is repeated. This time a fourth convergence
curve, where the step-size is estimated by
m^ opt n
e2u n
e2 n
6:56
is added. The terms e2u n and e2 n denote short-term smoothing (rst-order IIR
lters) of the squared input signals:
e2 n g e2 n 1 1 g e2 n;
6:57
6:58
202
The time constant was set to g 0:995. The resulting step-size m^ opt n is depicted
in the lower diagram of Figure 6.15. At the beginning of the simulation the shortterm power of the undistorted error is much larger than that of the measurement
noise. Therefore the step-size m^ opt n is close to 1 and a very fast initial convergence
(comparable to the case D 0 and m 1) can be achieved (see the upper diagram
of Fig. 6.15). The better the lter converges, the smaller is the undistorted error.
With decreasing error power the inuence of the measurement noise in Eq. 6.58
increases. This leads to a decrease of the step-size parameter m n.
Figure 6.15 Convergence using a time-variant step-size. To show the advantages of timevariant step-size control, the simulation example of Section 6.2.4 is repeated. This time a
fourth convergence curve, where the step-size is estimated as proposed in Eq. 6.56, is added.
The resulting step-size is depicted in the lower diagram. If we compare the three curves with
xed step-sizes (dotted lines) and the convergence curve with the time-variant step-size (solid
line), the advantage of a time-variant step-size control is clearly visible.
203
Due to the reduction of the step-size, the system distance can be further reduced.
If we compare the convergence curves with xed step-sizes and the curve with a
time-variant step-size (see the upper diagram of Fig. 6.15), the advantage of a timevariant control is clearly visible. A step-size control based on Eq. 6.55 is able to
achieve a fast initial convergence (and also a fast readaptation after system changes)
as well as a good steady-state performance.
Unfortunately, neither the undistorted error signal eu n nor its power is
accessible in most real system identication problems. In Section 6.5, methods for
estimating the power of this signal and therefore also for the optimal step-size are
presented for the application of acoustic echo control.
6.3.2
enun
:
kunk2 Dn
6:59
e n 1 e n
enun
:
kunk2 Dn
6:60
204
As it was done in the previous section, we approximate the squared norm of the
excitation vector kunk2 by a constant, and we suppose the signals nn and un
(and therefore also nn and eu n) to be orthogonal. Therefore the derivative
simplies as follows:
@Efke n 1k2 g
Efe2u ng
2
@Dn
kunk2 Dn2
kunk2
2
Efe2 ng:
kunk2 Dn3
6:63
6:64
6:65
Due to the orthogonality of the distortion nn and the excitation signal un, the
difference of the two expectation operators can be simplied:
Dopt n
Efn2 ngkunk2
:
Efe2u ng
6:66
If the excitation signal is white noise and the excitation vector and the system
mismatch vector are assumed to be uncorrelated, the power of the undistorted error
signal can be simplied:
Efe2u ng Efe T nunuT ne ng
Efe T nEfu2 nge ng
6:67
6:68
205
Efn2 ng
:
Efke nk2 g
6:69
As well as in the last subsection, the control based on approximation 6.69 will be
compared with xed regularization control approaches via a simulation. The power
of the measurement noise was estimated using a rst-order innite impulse response
(IIR) smoothing lter:
n2 n g n2 n 1 1 g n2 n:
6:70
The time constant g was set to g 0:995. Using this power estimation, an
estimation of the optimal regularization parameter was computed as
n2 n
:
D^ opt n N
ke nk2
6:71
6.3.3
If only step-size control or only regularization is applied (see Eq. 6.48 and Eq. 6.59),
the control parameters can easily be exchanged. The comparison of both update
terms
m nen
en
kunk2
kunk2 Dn
6:72
m n
kunk2
kunk2 Dn
Dn kunk2
1 m n
:
m n
6:73
6:74
206
207
these algorithms for numerically stabilizing the solution. As a second step this
stabilization can be utilized for control purposes. But also for the scalar inversion
case, as in NLMS control, regularization might be superior to step-size control in
some situations.
For the pseudo-optimal step-size, the short-term power of two signals has to be
estimated: s 2e n and s 2eu n. If rst-order IIR smoothing of the squared values as
presented in Subsection 6.3.1 or a rectangle window for the squared values is
utilized for estimating the short-term power, the estimation process has its inherent
inertia. After a sudden increase or decrease of the signal power, the estimation
methods follow only with a certain delay.
On the other hand, control by regularization, as proposed in Eq. 6.71, requires
only the power of the measurement noise. In applications with time-invariant
measurement noise but time-variant excitation, power control by regularization
should be preferred. In Figure 6.17 a simulation with stationary measurement noise
is presented. The excitation signal changes its power every 1000 iterations in order
to have signal-to-noise ratios of 20 dB and 20 dB, respectively. The excitation
signal and the measurement noise are depicted in the upper two diagrams of Figure
6.17.
The impulse response of the adaptive lter as well as that of the unknown system
are causal and nite. The order of both was N 1 999. After 1000 iterations the
power of the excitation signal is enlarged by 40 dB. Both power estimations of the
step-size control need to follow this increase. The power of the distorted error starts
from a larger initial value than the power estimation of the undistorted error. At the
rst few iterations the resulting step-size does not reach values close to 1 (optimal
value). As a consequence, the convergence speed is also reduced. Due to the
stationary behavior of the measurement noise, the regularization controlespecially
the power estimation for the measuremnt noisedoes not have inertia problems.
After 2000 iterations the power of the excitation signal is decreased by 40 dB.
Both power estimations utilized in the step-size computation decrease their value at
the rst few following iterations by nearly the same amount. This leads to a step-size
which is a little too large, resulting again in reduced convergence speed.
In the lowest diagram of Figure 6.17 the system distances resulting from step-size
as well as regularization control are depicted. The superior behavior of regularization control in this example is clearly visible. The advantages will disappear if
the measurement noise also shows nonstationary behavior.
6.3.4
Concluding Remarks
208
Figure 6.17 Regularization versus step-size control. The excitation signal and the
measurement noise used in this simulation are depicted in the upper two diagrams. The
excitation signal varies its power every 1000 iterations in order to have signal-to-noise ratios
of 20 dB and 20 dB, respectively. In the lowest diagram, the system distances resulting from
step-size as well as regularization control according to Eq. 6.56 and Eq. 6.71 are depicted. The
superior behavior of regularization control in this example is clearly visible. The advantages
will disappear if the measurement noise also shows nonstationary behavior.
209
In the next section, we will derive control methods based on different estimation and
detection principles for the application of acoustic echo control. This should serve as
an example of how signal and system properties can be exploited to develop robust
control mechanisms for the NLMS algorithm. Before we start with the description of
the control methods, a brief outline of acoustic echo control and the related signals
and systems is presented in this section.
6.4.1
The problem of acoustic echo arises wherever a loudspeaker and a microphone are
placed such that the microphone picks up the signal radiated by the loudspeaker and
its reections at the boundaries of the enclosure [2, 4, 16]. In the case of
telecommunication systems, the users are annoyed by listening to their own speech
delayed by the round-trip time of the system. If both conversation partners are using
telephones with hands-free capabilities, the electro-acoustic circuit may furthermore
become unstable and produce howling.
To avoid these problems, an adaptive lter can be placed parallel to the
loudspeaker-enclosure-microphone (LEM) system (see Fig. 6.18). If one succeeds in
matching the impulse response wn of the lter exactly with the impulse response
hn of the LEM system, the signals un and en are perfectly decoupled without
any disturbing effects to the users of the electro-acoustic system.
For the application of hands-free telephones the measurement noise consists of
two components: the signal produced by the local speaker and background noise.
Sources for background noise in ofces can be air conditioning systems or computer
fans. In contrast to the speech component of the measurement noise, the latter type
of signal can be modeled as a stationary signal.
6.4.2
LEM Systems
210
the enclosure, the reection properties of its boundaries, and the position of
objectsespecially the loudspeaker and the microphonewithin the enclosure.
Depending on the application, it may be possible to design this system such that the
reverberation timedened as the time necessary for a 60dB decay of the sound
energy after switching off the sound sourceis small, resulting in a short impulse
response. Examples of this solution are telecommunication studios. On the other
hand, electronic means are the only tools to provide hands-free communication out
of ordinary ofce rooms or cars, for example.
In general, the acoustic coupling within an enclosure is formed by a direct path
between the loudspeaker and the microphone and a very large number of echo paths.
The impulse response can be described by a sequence of delta impulses delayed
proportionally to the geometrical length of the related path. Reectivity of the
boundaries of the enclosure and the path length determine the impulse amplitude [1].
The reverberation time of an ofce is typically on the order of a few hundred
milliseconds; that of the interior of a car is a few tens of milliseconds. The upper two
parts of Figure 6.19 show impulse responses of LEM systems measured in an ofce
and in a car. The microphone signals have been sampled at a rate of 8kHz. These
impulse responses are highly sensitive to any changes within the LEM system. This
is explained by the fact that, assuming a sound velocity of 343 m/s and 8 kHz
sampling frequency, the distance traveled between two sampling instants is 4.3 cm.
Therefore, a 4.3cm change in the length of an echo path moves the related impulse
by one sampling interval. Thus, the need for an adaptive echo cancellation lter is
evident.
The order N 1 of the lter should be chosen in dependence of the expected
reverberation time of the LEM system. If the coefcients of the adaptive lter match
211
Figure 6.19 Measured impulse responses and maximum echo reduction. In the two upper
diagrams, two impulse responses of LEM systems are depicted. The top one was measured in
an ofce with a reverbation time of about 300ms. The middle one is the impulse response of
the passenger cabin of a car (BMW 520) with a reverbation of about 60 ms. In the lowest
diagram the maximal echo reduction according to Eq. 6.76 is depicted.
212
6:75
the maximal echo attenuation can be computed as a function of the lter order
N 1:
Efe2 n; Ng
Efy2 ng nn0;wi nhi n;i[f0;...;N1g
2
P1
PN1
E
i0 hi nun i
i0 hi nun i
n P
2 o
1
E
i0 hi nun i
2 o
hi nun i
n P
2 o :
1
E
i0 hi nun i
E
n P
6:76
1
iN
6:77
In the lowest part of Figure 6.19 this quantity is depicted. To achieve an echo
attenuation of 45 dB, as recommended by the International Telecommunication
Union (see Sec. 6.4.5), a lter order of about 1600 in ofces and 500 in cars is
required. The adaptation of high-order lters causes very high demands on the
computational power of the utilized hardware.
The logarithmic decay of the impulse responses can be exploited when a step-size
matrix is used. In [28, 29, 40] exponentially weighted step-size matrices are
investigated. For the rst coefcients large step-sizes are used, while the updates for
coefcients with large indices are weighted with a small parameter. Especially
during the initial convergence and after room changes, a signicant increase of the
adaptation speed can be achieved using these types of matrix step-sizes.
In most ofces and cars, high-frequency absorbing materials are used (e.g.,
carpets, curtains), leading to faster decay of the high-frequency components of the
echo signal. In Figure 6.20, a time-frequency analysis of the impulse response of an
ofce is shown.
If subband processing is used, these properties can be exploited. In the highfrequency subbands, the lter orders can be reduced and the unused memory and
computing power can be used for enlarged echo cancellation lters in the low
frequency subbands.
213
Figure 6.20 Time-frequency analysis of the impulse response of an ofce. Dark colors
belong to time-frequency areas with large energy; light colors represent low-energy areas.
Due to high-frequency absorbing materialswhich can be found in most ofcesechoes
decay faster at high than at low frequencies.
6.4.3
Speech Signals
214
Figure 6.21 Example of a speech sequence. In the top part a 5-s sequence of a speech signal
is depicted in the time domain. The signal was sampled at 8 kHz. The middle diagram shows
the mean power spectral density of the entire sequence (periodogram averaging). In the
bottom diagram, a time-frequency analysis is depicted. Dark colors represent areas with high
energy; light colors display areas with low energy.
215
6.4.4
Background Noise
Besides the signal of the local speaker and the echo signal, the microphone also
picks up local background noise. The background noise can be interpreted as a
second component of measurement noise. In the case of a hands-free telephone used
in an ofce, the noise of personal computer (PC) fans or air conditioning might
disturb the system identication process. If someone phones from a car, engine,
wind, or rolling noise might be sources for the distorting signal.
In contrast to speech signals, most background noises show nearly stationary
behavior. To show the typical power spectral densities of background noises, two
signals were analyzed. The results are depicted in Figure 6.22. The rst analyzed
signal was background noise measured in a car travelling on a motorway at 100 km/
h. The upper curve in Figure 6.22 shows the average power spectral density of this
car noise. Secondly, the noise of a PC fan and air conditioning systemboth
recorded in an ofcewas measured. The estimated average power spectral density
is also depicted.
Unless the far-end excitation signal un has the same spectral envelope as the
background noise, an adaptive lter structure which allows the use of a different
control parameter at different frequencies should be preferred.
6.4.5
Regulations
Besides the physical boundary conditions mentioned above, there are some
administrative restrictions. The characteristics of hands-free telephone systems are
specied by the International Telecommunication Union (ITU) as well as by the
European Telecommunication Standards Institute (ETSI). The most severe
restrictions for signal processing are the tolerable delays for front-end processing:
only 2 ms [26] are allowed for stationary telephones and only 39 ms [11] are allowed
for mobile telephones (GSM). Furthermore, an echo attenuation of about 45 dB in
216
Figure 6.22 Average power spectral densities of typical background noises. The upper
curve shows the average power density of background noise measured in a car on a motorway
at 100km/h. The lower curve shows the power spectral density of noise produced by a PC fan
and an air conditioning system.
the case of single-talk and 30dB in the case of double-talk (remote and local
speakers are talking) is required.
Due to the severe delay restriction for stationary phones, only fullband structures
or hybrid structures are applicable. Filter bank systems (consisting of an analysis and
a synthesis part) as well as overlapping discrete Fourier transforms (DFTs) are
introducing a delay considerably larger than 2ms. Therefore, at least the convolution
part has to be implemented in fullband. Hybrid structures allow the adaptation to be
performed in a domain other than the convolution. In these mixed processing
structures the adaptation (but not the convolution) process is delayed by a few
sample instants. The fullband lter impulse response is computed via dedicated
transformations [10, 33].
6.4.6
Concluding Remarks
The aim of this section was to introduce general aspects of acoustic echo control
with emphasis on the statistical properties of the involved signals and systems. It was
217
stated that these properties should strongly inuence the choice of the processing
and control structure.
If only the characteristics of speech signals are considered, subband or frequency
domain implementations should be preferred. These structures are also the best
choice from the point of view of computational complexity if a large system order is
required. For hands-free telephones used in ofce or car environments this is
certainly true. The only drawback of these structures is the delay they introduce.
In the next section several detection and estimation methods will be presented.
Most of them use the signal and system properties presented in this section.
6.5
According to the adaptation rule of the NLMS algorithm, the coefcients wi n are
modied intensively if the error signal en is rather large. In acoustic echo
cancellation a large error signal can have two reasons:
After an abrupt change of the system hn the adaptive lter wn is no longer
matched. Therefore, a good estimation d^ n of the desired signal dn is not
possible, leading to a large error signal. In these situations the lter wn
should be readapted using a large step-size and a small regularization
parameter as quickly as possible.
An increase in measurement noise due to activity of the local speaker also
leads to an increase in the error signal. In those situations, the adaptation steps
should be reduced in order to preserve the convergence state already reached.
A small step-size or a large regularization parameter should be used.
Distinguishing between both situations is a very challenging task. For estimation of
the optimal step-size according to Eq. 6.55, it is necessary to estimate the power of
the undisturbed error signal, which is not accessible. Since the LEM system and the
adaptive FIR lter provide a parallel structure, the signal eu n (see Eq. 6.18) can be
noted as
eu n uT ne n:
6:78
6:79
The second factor, the expected system distance, indicates the echo coupling
6:80
218
Figure 6.23 Model of the parallel arrangement of the LEM system and the echo
cancellation lter. To estimate the power of the undisturbed error signal, the parallel structure
of the adaptive lter plus the LEM system is modeled as a coupling factor.
after the echo cancellation. Figure 6.23 illustrates the idea of replacing the parallel
structure by a coupling factor. With this notation, the optimal step-size can be
written as:
m n
:
Efe2 ng
Efe2 ng
6:81
6:82
The sum of all local signals nn consists of a (nearly) stationary background noise
ns n and nonstationary local speech nn n. If orthogonality of the latter two signals
219
is assumed, Efn2 ng should be at least as large as the power of the local background
noise Efn2s ng. This power can easily be estimated using techniques known from
noise reduction (see Sec. 6.5.3). To reduce the inuence of estimation errors, a
maximum function can be applied:
Efn2 ng maxfEfn2s ng; Efe2 ng Efu2 ngb n:
6:83
6.5.1
6:84
For approximation of the optimal control parameters according to Eq. 6.81 and Eq.
6.84, several detection and estimation methods are introduced in this section. Even if
not all of them can be utilized only for acoustic echo control (see the previous
section), a signicant number exploit the statistical and physical properties of the
signals and systems involved in acoustic echo control. All of the detectors and
estimators presented here can be grouped into ve classes:
Schemes for short-term power estimation
Schemes for estimating the local background noise
Basic principles for estimating the power of the undisturbed error signal eu n
and a coupling factor b n
Principles for the detection of local speech activity
Principles for detecting enclosure dislocations, called rescue detectors
Details about real-time implementation, computational complexity, and reliability
can be found in the corresponding references.
6.5.2
For estimation of the short-term power of a signal, a rst-order IIR lter can be
utilized, as indicated in Figure 6.24.
To detect rising signal powers (especially of the error signal en), very fast,
different smoothing constants for rising and falling signal edges are proposed
g r , g f :
u2 n 1 g nu2 n g nu2 n 1
6:85
with
g n
g r : if u2 n . u2 n 1;
g f : otherwise:
6:86
220
Figure 6.24
IIR lter.
It should be mentioned that the above estimation contains a bias due to the different
smoothing constants. When comparing the powers of different signals or computing
their ratio, the knowledge of this bias is not necessary if the same method was used
for both power estimations.
Besides taking the squared input signal [6], the absolute value [25, 39] may be
applied. The advantage is the reduced dynamical range. Especially for xed-point
implementations with only a limited amount of processing power and memory,
smoothing of the magnitude is often preferred.
Besides other methods for estimating the short-term power of a signal, the
weighted squared norm of the excitation vector should be mentioned here. The
squared norm kunk2 can also be computed recursively:
1
1
kunk2 kun 1k2 u2 n u2 n N:
N
N
6:87
The squared norm of the excitation vector is already computed within the NLMS
algorithm. To use this method for other signals, the memory demand seems
prohibitive. For speech signals and appropriate chosen time constants g r ; g f and
lter orders N, all presented short-term power estimations do not really differ.
Figure 6.25 shows a speech sequence in the top diagram as well as the three
mentioned short-term power estimations in the lower three diagrams. The maximal
difference between the estimation methods is below 4dB.
6.5.3
221
Figure 6.25 Examples of short-term power estimations. In the top diagram a typical
example of a speech sequence is depicted. The lower three diagrams show the results of
different short-term power estimations: smoothing the squared signal (second diagram),
smoothing the absolute value of the signal (third diagram), and the squared norm of the signal
vector (vector length 1000).
222
background noise can be modeled as a weak stationary process for at least the
duration of the observation interval.
In practice, two basic schemes are often applied. The rst approach smoothes
either the short-term power estimation y2 n or the instantaneous values y2 n
respectively jynj with a rst-order IIR lter. The time constants are set according to
the result of a local speech activity detector. If local speech is detected, the
background noise estimation is stopped. Otherwise, the smoothing is performed in
such a way that a decrease of the short-term power is followed much faster than an
increase.
The second scheme is called the minimum statistic [31, 32]. In this approach a
minimum search over the last NMS values of the short-term power of the microphone
signal is performed. As in the rst approach, the search length NMS is chosen such
that an interval of a few seconds is covered. Usually a minimum search requires a
large amount of memory and the application of sorting algorithms. The method
described in [31] varies the interval order slowly over time, leading to a reduced
computational load and a large reduction of the required memory.
We will not go further into the details of background noise estimation. The
interested reader is referred to the cited references.
6.5.4
For the determination of the control parameter according to Eqs. 6.81 and 6.84,
knowledge of the coupling factor or the system distance, respectively, is necessary.
In general, the system distance according to
ke nk2 khn wnk2
6:88
e i n wi n i [ 0; . . . ; ND 1;
6:89
N
kwD nk2
ND
D 1
N NX
w2 n;
ND i0 i
6:90
223
where N denotes the length of the adaptive lter. The vector wD n consists of the
rst ND coefcients of the adaptive lter vector:
wD n w0 n; w1 n; . . . ; wND 1 nT :
6:91
The general structure of this estimation method is depicted in Figure 6.26. A stepsize or regularization control based on the estimation of the system distance with the
delay coefcients generally shows good performance. When the power of the error
signal increases due to local speech activity, the step-size is reduced and the
regularization parameter is increased as the denominator in Eq. 6.81 increases. Thus
divergence of the lter is avoided.
If the additional delay in signal path is not tolerable, a two-lter structure
according to [5] may be used instead. However, the determination of the control
parameters according to this method may lead to a freezing of the adaptation when
the LEM system changes. The phenomenon can be observed in the lowest diagram
of Figure 6.27, after 60,000 iterations. Freezing occurs because a change in the LEM
system also leads to an increase in the power of the error signal and consequently to
a reduced step-size; respectively, an increased regularization parameter. Thus, a new
adaptation of the lter and the delay coefcients is prevented and the system freezes.
To avoid this, we require an additional detector for LEM changes that either sets the
delay coefcients or the control parameters such that the lter can readapt.
Figure 6.26 General structure of the system distance estimation based on delay coefcients.
If an additional articial delay is introduced into the LEM system, this delay is also modeled
by the adaptive lter. Utilizing the known property of adaptive algorithms to spread the lter
mismatch evenly over all lter coefcients, the known part of the system mismatch vector can
be extrapolated according to Eq. 6.90 for estimating the system distance.
224
Figure 6.27 Simulation example for the estimation of the system distance or the coupling
factor. White noise was used for the excitation as well as for the local signal (see the two upper
diagrams). Double-talk took place between iterations 30,000 and 40,000. After 60,000
iterations, an enclosure dislocation was simulated (a book was placed midway between the
loudspeaker and the microphone).
b P n
e2u n
u2 n
8
2
>
< g b n 1 1 g e n :
P
u2 n
>
:
b P n 1
:
225
6:93
This measure has to be calculated for different delays l due to the time delay of the
loudspeaker-microphone path. The parameter LC has to be chosen such that the time
delay of the direct path between the loudspeaker and the microphone falls into the
interval 0; LC . Based on the assumption that the direct echo signal is maximally
correlated with the excitation signal, the open-loop correlation measure has its
maximum at that delay. In contrast, no delay has to be considered for the closed-loop
226
correlation measure:
P
NC 1 ^
m0 d n myn m
r CL n PNC 1
^
m0 jd n myn mj
P
NC 1 ^
m0 d n mdn m nn m
PNC 1
^
m0 jd n mdn m nn mj
6:94
This is due to the fact that both signals are synchronous if a sufciently adjusted
echo-cancelling lter is present. Both correlation values have to be calculated for a
limited number of samples NC , where a larger number ensures better estimation
quality. However, there is a trade-off between the estimation quality and the
detection delay. The latter can lead to instability.
A decision for remote single-talk can be easily generated by comparing the
correlation value with a predetermined threshold. In Figure 6.28, simulation results
for the correlation values are shown. It is clear that the closed-loop structure ensures
more reliable detection. However, in cases of misadjusted adaptive lters, this
detector provides false estimations (e.g., at the beginning, or after local dislocation
at sample 60,000).
Another possible way to detect remote single-talk is to compare the complex
cepstrum of two signals. The complex cepstrum x n of a signal xn is dened as the
inverse z-transform of the logarithm of the normalized z-transform of the signal xn
[37]:
log
1
X
Xz
x izi
X0
i1
Xz
1
X
xizi :
6:95
6:96
i1
The cepstrum exists if the quantity logXz=X0 fullls all conditions of a ztransformation of a stable series.
The cepstral distance measure is dened in [18] with a focus on the problem of
determining the similarity between two signals. A modied, truncated version
adapted to acoustic echo control problems can also be applied:
dc2 n
NX
cep 1
c y i; n c d^ i; n;
6:97
i0
227
Figure 6.28 Simulation example for detecting remote single-talk with distance measure
principles. Three methodsa closed- and an open-loop correlation analysis as well as a
ceptral analysisare depicted in the lower three diagrams. Speech signals were used for the
excitation as well as for the local signal. Double-talk occurred during iterations 30,000 and
40,000. At iteration 60,000, the impulse response of the LEM system was changed, leading to
detection problems in the closed-loop correlation analysis.
228
229
Figure 6.29 Two-lter scheme (reference and shadow) for detecting enclosure dislocations.
Both lters are controlled independently. If one lter produces an error power much smaller
than that of the other, either the lter coefcients can be exchanged or the parameters of the
control mechanism can be reinitialized to enable convergence.
develop a detection mechanism: If the error signal of the shadow lter falls below
the error signal of the reference lter for several iterations, enclosure dislocations are
detected (in Fig. 6.29, ts n describes this detection result). The step-size is enlarged
to enable the adaptation of the reference lter toward the new LEM impulse
response.
In Figure 6.30, simulation results for this detector are shown. In the top graph, the
powers of the error signal of both the reference and the shadow lter are pictured.
Due to the fact that the number of coefcients for the shadow lter is smaller than for
the reference lter, a faster convergence of the shadow lter is evident. However, a
drawback of the decreased number of coefcients is the lower level of echo
attenuation. After 60,000 iterations, when an enclosure dislocation takes place, fast
convergence of the shadow lter can be observed, whereas the reference lter
converges only slowly. Therefore an enclosure dislocation is detected (second graph
in Fig. 6.30), which leads to a readjustment of the reference lter. At the beginning
of the simulation, enclosure dislocations are also detected. However, this conforms
with the requirements of the detector, because the beginning of the adaptation can
also be interpreted as an enclosure dislocation due to misadjustment of the lter.
A second detection scheme analyzes power ratios separately in different
frequency bands. The aim of this detector is to distinguish between two reasons for
increasing echo signal power: changes of the LEM impulse response or local speech
activity. In [30], it was shown that a typical change of the room impulse response
(e.g., caused by movements of the local speaker), mainly affects the higher
frequencies of the difference transfer function He jV We jV corresponding to
the system mismatch vector e n hn wn. The reason for this characteristic is
that movements of the local speaker may cause phase shifts up to 180 degrees for
230
Figure 6.30 Simulation examples for the detection of enclosure dislocations. Stationary
noise with the same spectral characteristics as speech (linear predictive analysis of order 40)
was used for the excitation signal as well as for the local signal. Double-talk takes place during
iterations 30,000 and 40,000. At iteration 60,000 the impulse response of the LEM system was
changed. For both methods (shadow lter and separate highpass and lowpass coupling
analyses), the detection results as well as the main analysis signals are depicted.
231
0
Vg
0
See V; ndV
Syy V; ndV
p
and
See V; ndV
Vg
Syy V; ndV
qHP n pg
6:98
are analyzed, where the short time power spectral density is calculated by
recursively squared averaging. The cutoff frequency Vg should be chosen close to
700 Hz. A structure for the detector is proposed in Figure 6.31.
Figure 6.31 Highpass and lowpass coupling analyses for detection of enclosure
dislocations. Movements of persons mostly change the high-frequency characteristics of
the LEM system, whereas activity of the local speaker also affects the low-frequency range.
This relationship can be used to differentiate between increasing error powers due to doubletalk or to enclosure dislocations.
232
There are different ways to nally generate the information about local
dislocations. In [30], local dislocations are detected by processing differential values
of qLP n and qHP n to detect a change of the LEM transfer function. However, if
the peak indicating this change is not detected clearly, the detection of the LEM
change is totally missed. Another approach is based only on the current value of a
slightly smoothed quotient qLP n [3]. Our approach is to average the quotient
qHP n=qLP n by summing over the last 8000 samples. This procedure considerably
increases the reliability of the detector but introduces a delay in the detection of
enclosure dislocations.
Simulation results are depicted in Figure 6.30. In the third graph, the lowpass and
highpass power ratios, qLP n and qHP n, respectively, are shown. It can be observed
that both ratios rise close to 5dB at 30,000 iterations during double-talk periods. In
contrast, when a local dislocation occurs after 60,000 samples, there is a clear
increase in the highpass power ratio, whereas the lowpass power ratio is subject to
only a small increase. The fourth graph shows the detection result of the sliding
window for the quotient of the two power ratios. The enclosure dislocation is
detected reliably, but with a small delay.
6.5.7
Having described some of the most important detection principles in the previous
section, we will now present an overview of the possibilities for combining these
detectors into an entire step-size or regularization control unit.
In Figure 6.32, possible combinations for building a complete step-size control
unit are depicted. The system designer has several choices, which differ in
computational complexity, memory requirements, reliability, dependence on some
types of input signals, and robustness in the face of nite word length effects.
Most of the proposed step-size control methods are based on estimations of the
short-term power of the excitation and error signals. Estimating these quantities is
relatively simple. The estimation of the current amount of echo attenuation is much
more complicated. This quantity was introduced at the beginning of this section as
the echo coupling b n, which is an estimation of the norm of the system mismatch
vector ke nk2 . Reliable estimation of this quantity is required not only for
estimating the power of the undisturbed error signal e2u n but also for the interaction
of the echo cancellation with other echo-suppressing parts of a hands-free telephone,
that is, loss control and postltering [20].
Using the delay coefcients method for estimating the system distance has the
advantage that no remote single-talk detection is required. Furthermore, the tail of
the LEM impulse response, which is not cancelled because of the limited order of the
adaptive lter, does not affect this method. The disadvantage of this method is the
articial delay which is necessary to generate the zero-valued coefcients of the
LEM impulse response. If ITU-T or ETSI recommendations [11, 26] concerning the
delay have to be fullled, the coupling factor estimation should be preferred or a
two-lter scheme has to be implemented. A second drawback of the delay
233
234
6.5.8
Concluding Remarks
The aim of this section was to show how the specic properties of the system, which
should be identied, and of the involved signals can be exploited to build a robust
and reliable adaptation control. For all necessary estimation and detection schemes,
the system designer has several possibilities to choose from. A compromise between
reliability, computational complexity, memory requirements, and signal delay
always has to be found.
235
236
Figure 6.34 Simulation example of an entire adaptation control unit. Speech signals were
used for excitation as well as for local distortion (see the top two diagrams). After 62,000
iterations a book was placed between the loudspeaker and the microphone. In the third
diagram, the real and estimated system distances are depicted. The lowest two diagrams show
the step-size and the regularization parameter.
237
m n
u2 nb D n
maxfe2 n n2s n; u2 nb D ng
Dn N
n2s n
:
b D n
6:99
6:100
6.7
238
REFERENCES
1. J. B. Allen and D. A. Berkley, Image Method for Efciently Simulating Small-Room
Acoustics, J. Acoust. Soc. Am., vol. 65, pp. 943 950, 1979.
2. J. Benesty, T. Gansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay, Advances in Network
and Acoustic Echo Cancellation, Springer, Berlin, 2001.
3. C. Breining, A Robust Fuzzy Logic-Based Step Gain Control for Adaptive Filters in
Acoustic Echo Cancellation, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 2,
pp. 162 167, Feb. 2001.
4. C. Breining, P. Dreiseitel, E. Hansler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G.
Schmidt, and J. Tilp, Acoustic Echo Control, IEEE Signal Processing Magazine, vol.
16, no. 4, pp. 42 69, 1999.
5. T. Burger and U. Schultheiss, A Robust Acoustic Echo Canceller for a Hands-Free
Voice-Controlled Telecommunication Terminal, Proc. of the EUROSPEECH 93,
Berlin, vol. 3, pp. 1809 1812, Sept. 1993.
6. T. Burger, Practical Application of Adaptation Control for NLMS-Algorithms Used for
Echo Cancellation with Speech Signals, Proc. IWAENC 95, Roros, Norway, 1995.
7. R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing, Prentice Hall,
Inc., Englewood Cliffs, New Jersey, 1983.
8. J. Deller, J. Hansen, and J. Proakis, Discrete-Time Processing of Speech Signals, IEEE
Press, New York, 1993.
9. P. S. R. Diniz, Adaptive FilteringAlgorithms and Practical Implementations, Kluwer
Academic Publishers, Boston, 1997.
10. M. Dorbecker and P. Vary, Reducing the Delay of an Acoustic Echo Canceller with
Subband Adaptation, Proc. of the IWAENC 95, International Workshop on Acoustic
Echo and Noise Control, Roros, Norway, pp. 103 106, 1995.
11. ETS 300 903 (GSM 03.50), Transmission Planning Aspects of the Speech Service in the
GSM Public Land Mobile Network PLMS System, ETSI, France, March 1999.
12. P. Eykhoff, System IdenticationParameter and State Estimation, John Wiley & Sons,
Chichester, England, 1974.
13. A. Feuer and E. Weinstein, Convergence Analysis of LMS Filters with Uncorrelated
Gaussian Data, IEEE Transactions on Acoustics Speech, and Signal Processing, vol.
ASSP-33, no. 1, pp. 222 230, Feb. 1985.
14. R. Frenzel and M. Hennecke, Using Prewhitening and Stepsize Control to Improve the
Performance of the LMS Algorithm for Acoustic Echo Compensation, Proc. of the
ISCAS-92, IEEE International Symposium on Circuits and Systems, vol. 4, pp. 1930
1932, San Diego, CA, 1992.
15. T. Gansler, M. Hansson, C.-J. Ivarsson, and G. Salomonsson, A Double-Talk Detector
Based on Coherence, IEEE Trans. on Communications, vol. 44. no. 11. pp. 1421 1427,
1996.
16. S. L. Gay and J. Benesty (eds), Acoustic Signal Processing for Telecommunications,
Kluwer, Boston, MA, 2000.
17. G. Glentis, K. Berberidis, and S. Theodoridis, Efcient Least Squares Adaptive
Algorithms for FIR Transversal Filtering, IEEE Signal Processing Magazine, vol. 16,
no. 4, pp. 13 41, July 1999.
REFERENCES
239
18. A. H. Gray and J. D. Markel, Distance Measures for Speech Processing, IEEE Trans. on
Acoustic Speech and Signal Processing, vol. ASSP-24, no. 5, pp. 380 391, 1976.
19. Y. Haneda, S. Makino, J. Kojima, and S. Shimauchi, Implementation and Evaluation of
an Acoustic Echo Canceller Using Duo-Filter Control System, Proc EUSIPCO 96,
Trieste, Italy, vol. 2, pp. 1115 1118, 1996.
20. E. Hansler and G. Schmidt, Hands-Free TelephonesJoint Control of Echo
Cancellation and Postltering, Signal Processing, vol. 80, no. 11, pp. 2295 2305,
Nov. 2000.
21. E. Hansler, The Hands-Free Telephone ProblemAn Annotated Bibliography, Signal
Processing, vol. 27, no. 3, pp. 259 271, 1992.
22. E. Hansler, The Hands-Free Telephone ProblemAn Annotated Bibliography Update,
Annales des Telecommunications, Special Issue on Acoustic Echo Control, no. 49, pp.
360 367, 1994.
23. S. Haykin, Adaptive Filter Theory, 3rd Edition, Prentice Hall Inc., Englewood Cliffs,
New Jersey, 1996.
24. P. Heitkamper and M. Walker, Adaptive Gain Control and Echo Cancellation for HandsFree Telephone Systems, Proc. EUROSPEECH 93, Berlin, pp. 1077 1080, Sept. 1993.
25. P. Heitkamper, An Adaptation Control for Acoustic Echo Cancellers, IEEE Signal
Processing Letters, vol. 4, no. 6, pp. 170 172, 1997.
26. ITU-T Recommendation G.167, General Characteristics of International Telephone
Connections and International Telephone CircuitsAcoustic Echo Controllers, Helsinki,
Finland, March 1993.
27. A. Mader, H. Puder, and G. U. Schmidt, Step-Size Control for Acoustic Echo
Cancellation FiltersAn Overview, Signal Processing, vol. 80, no. 9, 1697 1719, Sept.
2000.
28. S. Makino and Y. Kaneda, Exponentially Weighted Step-Size Projection Algorithm for
Acoustic Echo Cancellers, IEICE Trans. Fundamentals, vol E75-A, no. 11, pp. 1500
1507, 1992.
29. S. Makino, Y. Kaneda, and N. Koizumi, Exponentially Weighted Step-Size NLMS
Adaptive Filter Based on the Statistics of a Room Impulse Response, IEEE Trans.
Acoustics, Speech, and Signal Processing, vol. 1, no. 1, pp. 101108, 1993.
30. J. Marx, Akustische Aspekte der Echokompensation in Freisprecheinrichtungen, VDIFortschritt-Berichte, Reihe 10, no. 400, Dusseldorf, 1996.
31. R. Martin, An Efcient Algorithm to Estimate the Instantaneous SNR of Speech
Signals, Proc. EUROSPEECH 93, Berlin, pp. 1093 1096, Sept. 1993.
32. R. Martin, Spectral Subtraction Based on Minimum Statistics, Signal Processing VII:
Theories and Applications Conference Proceedings, pp. 1182 1185, 1994.
33. R. Merched, P. Diniz, and M. Petraglia, A New Delayless Subband Adaptive Filter
Structure, IEEE Trans. on Signal Processing, vol. 47, no. 6, pp. 1580 1591, June 1999.
34. W. Mikhael and F. Wu, Fast Algorithms for Block FIR Adaptive Digital Filtering,
IEEE Trans. on Circuits and System, vol. 34, pp. 1152 1160, Oct. 1987.
35. B. Nitsch, The Partitioned Exact Frequency Domain Block NLMS Algorithm, a
Mathematical Exact Version of the NLMS Algorithm Working in the Frequency
Domain, International Journal of Electronics and Communications, vol. 52, pp. 293
301, Sept. 1998.
240
36. K. Ochiai, T. Araseki, and T. Ogihara, Echo Canceler with Two Echo Path Models,
IEEE Trans. on Communications, vol. COM-25, no. 6, pp. 589 595, 1977.
37. A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Inc.,
London, 1975.
38. H. Puder, Single Channel Noise Reduction Using Time-Frequency Dependent Voice
Activity Detection, Proc. IWAENC 99, Pocono Manor, Pennsylvania, pp. 68 71, Sept.
1999.
39. T. Schertler and G. U. Schmidt, Implementation of a Low-Cost Acoustic Echo
Canceller, Proc. IWAENC 97, London, pp. 49 52, 1997.
40. T. Schertler, Selective Block Update of NLMS Type Algorithms, 32nd Annual
Asilomar Conf. on Signals, Systems, and Computers, Conference Proceedings, pp. 399
403, Pacic Grove, California, Nov. 1998.
41. G. U. Schmidt, Step-Size Control in Subband Echo Cancellation Systems, Proc.
IWAENC 99, Pocono Manor, Pennsylvania, pp. 116 119, 1999.
42. W.-J. Song and M.-S. Park, A Complementary Pair LMS Algorithm for Adaptive
Filtering, Proc. ICASSP 97, Munich, vol. 3, pp. 2261 2264, 1997.
43. B. Widrow and S. Stearns, Adaptive Signal Processing, Prentice-Hall, Inc., Englewood
Cliffs, New Jersey, 1985.
44. P. P. Vaidyanathan, Mulitrate Digital Filter Banks, Polyphase Networks, and
Applications: A Tutorial, Proc. of the IEEE, vol. 78, no. 1, pp. 56 93, Jan. 1990.
45. P. P. Vaidyanathan, Mulitrate Systems and Filter Banks, Prentice-Hall, Inc., Englewood
Cliffs, New Jersey, 1992.
46. S. Yamamoto, S. Kitayama, J. Tamura, and H. Ishigami, An Adaptive Echo Canceller
with Linear Predictor, Trans. of the IECE of Japan, vol. 62, no. 12, pp. 851 857, 1979.
47. S. Yamamoto and S. Kitayama, An Adaptive Echo Canceller with Variable Step Gain
Method, Trans. of the IECE of Japan, vol. E 65, no. 1, pp. 1 8, 1982.
48. H. Yasukawa and S. Shimada, An Acoustic Echo Canceller Using Subband Sampling
and Decorrelation Methods, IEEE Trans. Signal Processing, vol. 41, pp. 926 930,
1993.
49. H. Ye and B.-X. Wu, A New Double-Talk Detection Algorithm Based on the
Orthogonality Theorem, IEEE Trans. on Communications, vol. 39, no. 11, pp. 1542
1545, 1991.
AFFINE PROJECTION
ALGORITHMS
STEVEN L. GAY
Bell Labs, Lucent, Murray Hill, New Jersey
7.1
INTRODUCTION
241
242
The APA and its more efcient implementations have been applied to many
problems. It is especially useful in applications involving speech and acoustics. This
is because acoustic problems are often modeled by long nite impulse response
(FIR) lters and are often excited by speech which can be decorrelated with a
relatively low-order prediction lter. The most natural application is in the acoustic
echo cancellation of voice [6, 38, 14, 41]. More recently, APA and its descendants
have debuted in multichannel acoustic echo cancellation [50, 37, 38, 23]. It is also
useful in network echo cancellation [47], a problem that also has long adaptive FIR
lters.
APA has also been used in equalizers for data communications applications [19,
54], active noise control [28], and neural network training algorithms [18].
7.2
THE APA
The APA is an adaptive FIR lter. An adaptive lter attempts to predict the most
recent outputs, fdn; dn 1; . . . ; dn N 1g of an unknown system, wsys ,
from the most recent system inputs, fun; un 1; . . . ; un L N 1g and the
previous system estimate, wn 1. This arrangement is shown in Figure 7.1.
The two equations that dene a relaxed and regularized APA are as follows. First,
the system prediction error is calculated:
en dn Unt wn 1
7:1
7:2
where the superscript t denotes transpose, I denotes the identity matrix, and the
following denitions are made:
1. un is the excitation signal and n is the time index.
Figure 7.1
243
2.
7:3
un un; un 1; . . . ; un L 1t
is the L length excitation or tap-delay line vector.
3.
7:4
an un; un 1; . . . ; un N 1t
is the N length excitation vector.
4.
2
6
Un un; un; . . . ; un N 1 6
4
ant
an 1t
..
.
3
7
7
5
7:5
an L 1t
is the L by N excitation matrix.
5.
wn w0 n; w1 n; . . . ; wL1 nt
7:6
is the L length adaptive coefcient vector where wi n is the ith adaptive tap
weight or coefcient at time n.
6.
wsys w0;sys ; w1;sys ; . . . ; wL1;sys t
7:7
is the L length system impulse response vector where wi;sys is the ith tap
weight or coefcient.
7. yn is the measurement noise signal. In the language of echo cancellation, it
is the near-end signal which consists of the near-end talkers voice and/or
background noise.
8.
yn yn; yn 1; . . . ; yn N 1t
7:8
7:9
is the N length desired or the system output vector. Its elements consist of the
echo plus any additional signal added in the echo path.
244
10.
en dn unt wn
7:10
7:11
7:12
under the constraint that the new coefcients yield an N length a posteriori error
vector, dened as,
e1 n dn Unt wn;
7:13
that is element by element a factor of 1 m smaller than the N length a priori error
vector,
en dn Unt wn 1:
7:14
e1 n m en Unt rn:
7:15
7:16
7:17
245
7:18
7:19
which is the familiar NLMS algorithm. Thus, we see that APA is a generalization of
NLMS.
7.3
We now show that the APA as expressed in (7.1) and (7.2) indeed represents a
projection onto an afne subspace. In Figure 7.2a we show the projection of a vector,
wn 1, onto a linear subspace, S, where we have a space dimension of L 3 and a
subspace dimension of L N 2. Note that an L N dimensional linear subspace
is a subspace spanned by any linear combination of L N vectors. One of those
combinations is where all of the coefcients are 0; so, a linear subspace always
includes the origin. Algebraically, we represent the projection as
g Qwn 1;
7:20
0
Q V4 0
0
3
0 0
1 0 5Vt
0 1
7:21
and V is a unitary matrix (i.e., a rotation matrix). In general, the diagonal matrix in
(7.21) has N 0s and L N 1s along its diagonal.
Figure 7.2
subspace.
(a) Projection onto an afne subspace. (b) Relaxed projection onto an afne
246
Figure 7.2b shows a relaxed projection. Here, g ends up only partway between
wn1 and S. The relaxed projection is still represented by (7.20), but with
2
1m
Q V4 0
0
0
1
0
3
0
0 5Vt
1
7:22
7:23
where f is in the null space of Q; that is, Qf equals an all-zero vector. Figure 7.3b
shows a relaxed projection onto the afne subspace. As before, m 1=3.
Manipulating (7.1), (7.2), and (7.9) and assuming that yn 0 and d 0, we can
express the APA tap update as
wn I m UnUnt Un1 Unt wn 1
m UnUnt Un1 Unt wsys :
7:24
Dene
Qn I m UnUnt Un1 Unt
Figure 7.3
subspace.
7:25
(a) Projection onto an afne subspace. (b) Relaxed projection onto an afne
247
1m
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
1m
1
..
.
1
and if m 1,
2
6
6
6
6
6
Qn Vn6
6
6
6
4
0
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
.
0
1
..
7:26
1
where there are N 0s and L N 1s in the diagonal matrix. Similarly, dene
Pn m UnUnt Un1 Unt
2
m
6
..
6
.
6
6
6
m
Vn6
6
0
6
6
..
6
4
.
7:27
3
7
7
7
7
7
7Vnt ;
7
7
7
7
5
7:28
0
and if m 1,
2
6
6
6
6
6
Pn Vn6
6
6
6
4
1
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
.
1
0
..
.
0
7:29
248
where there are N 1s and L N 0s in the diagonal matrix. That is, Qn and Pn
represent projection matrices onto orthogonal subspaces when m 1 and relaxed
projection matrices when 0 , m , 1. Note that the matrix Qn in (7.26) has the
same form as in (7.21). Using (7.25) and (7.27) in (7.24), the APA coefcient vector
update becomes
wn Qnwn 1 Pnwsys ;
7:30
which is the same form as the afne projection dened in (7.23), where now Q
Qn and f Pnwsys . Thus, (7.1) and (7.2) represent the relaxed projection of the
system impulse response estimate onto an afne subspace which is determined by
(1) the excitation matrix Un (according to 7.25 and 7.27) and (2) the true system
impulse response, wsys (according to 7.30).
7.4
REGULARIZATION
Equation (7.30) gives us an intuitive feel for the convergence of wn to wsys . Let us
assume that m 1. We see that as N increases from 1 toward L, the contribution to
wn from wn 1 decreases because the nullity of Qn is increasing, while the
contribution from wsys increases because the rank of Pn is increasing. In principle,
when N L, wn should converge to wsys in one step, since Qn has a rank of 0 and
Pn a rank of L. In practice however, we usually nd that as N approaches L, the
condition number of the matrix, Unt Un begins to grow. As a result, the inverse of
Unt Un becomes more and more dubious and must be replaced with either a
regularized or a pseudo-inverse. Either way, the useful, i.e., signal-based rank of
Pn ends up being somewhat less than L. Still, for moderate values of N, even when
the inverse of Unt Un is regularized, the convergence of wn is quite impressive,
as we shall demonstrate.
The inverse of Unt Un can be regularized by adding the matrix d I prior to
taking the inverse. The matrix I is the N by N identity matrix and d is a small
positive scalar. Where Unt Un may have eigenvalues close to zero, creating
problems for the inverse, Unt Un d I has d as its smallest possible eigenvalue,
which, if large enough, yields a well-behaved inverse. The regularized APA tap
update is then
wn wn 1 m UnUnt Un d I1 en:
7:31
7:32
Using (7.1) and (7.27), we can express the coefcient error update as
Dwn I PnDwn 1 m UnUnt Un d I1 yn:
7:33
7.4 REGULARIZATION
249
7:34
7:35
7:36
The excitation matrix Un can be expanded using the singular valued decomposition (SVD) to
Un FnSnVnt ;
7:37
7:38
7:39
Multiplying (7.39) from the left by Fnt and dening a rotated coefcient error
vector,
Dgn Fnt Dwn;
7:40
7:41
250
Each element of Dgn has its own convergence gain factor, the ith one being
t Dg r i 1 m
r 2i
d
r 2i
1 m r 2i d
;
r 2i d
7:42
7:43
7:44
7:45
t z r i m
ri
:
r 2i d
7:46
p
p
p
The maximum
p of t z is 1=2 d , occurring at r i d . For r i @ d , t z 1=r i , and
for r i ! d , t z r i =d .
Figures 7.5 and 7.4 show the shape of t Dg and t z as a function of r i for a xed
regularization, d . In Figure 7.5 d 300, and in Figure 7.4 d 0. In both gures the
step-size m 0:98.
In both gures t Dg is at or approaches 1 m and t z behaves as 1=r i for large r i .
A t Dg , 0 dB means that the coefcient error would decrease for this mode if the
noise were sufciently small. We will return to this thought in the next section.
In Figure 7.4, where there is no regularization, d 0, the noise amplication
factor, t z , approaches innity, and the coefcient error convergence factor, t Dg
remains very small as the excitation singular value, r i , approaches zero. This means
that for modes with little excitation, the effect of noise on the coefcient error
increases without bound as the modal excitation singular value approaches zero.
Contrast this with the behavior of t z and t Dg when d 300, as in Figure 7.5. The
noise amplication
p factor, t y , becomes much smaller, and t Dg approaches 0 dB as r i
drops below d 17:3. This means that for modes with little excitation, the effect
of noise on the coefcient error is suppressed, as is any change in the coefcients.
7.4 REGULARIZATION
251
Figure 7.4 The nonregularized modal convergence and noise amplication factors as a
function of the modal input signal magnitude.
Figure 7.5 The regularized modal convergence and noise amplication factors as a function
of the modal input signal magnitude.
252
7.5
One may prove the convergence of an adaptive ltering algorithm if it can be shown
that each iteration, or coefcient update, is a contraction mapping on the norm of the
coefcient error vector. That is, the norm of Dwn should always be less than or
equal to the norm of Dwn 1. In this section we show that this indeed is a property
of APA when there is no noise [6, 32], and that when noise is present, we show the
conditions under which the contraction mapping continues to hold [6]. We begin by
rewriting the coefcient error update, Dwn, as the sum of two mutually orthogonal
parts,
Dwn I P_ nDwn 1
P_ n PnDwn 1 m UnUnt Un d I1 yn;
7:47
where
P_ n UnUnt Un1 Un;
7:48
7:49
Multiplying from the left by Fn and applying (7.41), we can write the ith element
of Dgn; Dgi n, as
Dgi n 1 k i Dgi n 1
r i n2
r i n
Dg
n
1
m
z
n
;
ki m
i
i
r i n2 d
r i n2 d
7:50
where
ki
1
0
1iN
:
N,iL
7:51
7:52
253
7:53
holds. It twill be instructive, however, to also consider the case where a slight
amount of expansion, denoted by a small positive number, CG (the G stands for
growth), is allowed. This is expressed by the inequality
CG , kDgi n 1k kDgi nk:
7:54
We will refer to this as an expansion control mapping. This approach will allow us to
investigate the behavior of the contraction mapping of APA when the excitation
singular value is very small. Note that by simply setting CG 0 we once again get
the contraction mapping requirement. Using (7.50) and (7.52), we write the
requirement for the expansion control for mode i as
CG , k i kDgi n 1k
r i n2
r i n
Dg
n
1
m
z
n
k i k i m
:
i
i
r i n2 d
r i n2 d
7:55
From now on, we will drop the use of the k i with the understanding that we are only
concerned with the ith mode where 1 i N. Assuming that Dgi n 1 0
(assuming otherwise yields the same result) and then manipulating (7.55), we obtain
r n2 d
m r i nDgi n 1 CG i
r i n
2 m r i n2 2d
r n2 d
:
, m zi n ,
Dgi n 1 CG i
r i n
r i n
7:56
First, let us consider the case where r i n2 @ d . By dropping small terms, we may
simplify inequality (7.56) to
m r i nDgi n 1 CG r i n , m zi n
, 2 m r i nDgi n 1 CG r i n:
7:57
Concentrating on the right-hand inequality, the more restrictive of the two, and
considering the case where CG 0, we see that as long as the noise signal
magnitude is smaller than the residual echo magnitude for a given mode, the
inequality is upheld, implying that there is a contraction mapping on the coefcient
error for that mode. Allowing some expansion, CG . 0, we see that the noise can be
larger than the residual error by CG r i n=m . Since we have assumed that r i n is
254
large and we know that 0 , m 1, then for a little bit of expansion we gain a great
deal of leeway in additional noise.
The inequalities of (7.57) also provide insight into the situation where there is no
regularization and the modal excitation is very small. Then the noise power must
also be very small so as not to violate either the contraction mapping or expansion
control constraints.
If, however, we allow regularization and r i n2 ! d , inequality (7.56) becomes
CG
d
d
, zi n , 2Dgi n 1 CG
;
m r i n
m r i n
7:58
m r i n
zi n , 2Dgi n 1 CG :
d
7:59
In the inequalities of (7.59) as r i n gets smaller, the noise term in the middle
becomes vanishingly small, meaning that CG , the noise expansion control constant,
may also become arbitrarily small.
Of course, one may look at both sets of inequalities, (7.57) and (7.58), and
conclude that decreasing m would have the same effect as increasing d . But if one
also observes the coefcient error term, one sees that there is a greater price paid in
terms of slowing the coefcient error convergence when m is lowered as opposed to
increasing d . Recalling the modal coefcient error reduction factor of (7.42),
t Dg r i
1 m r 2i d
:
r 2i d
7:60
For modes where r i n2 @ d , a small m will slow the modal coefcient error
convergence by making ti;coeff 1. On the other hand, a m close to unity will speed
the convergence by making ti;coeff d =r i n2 a very small value, given our
assumption.
The inequalities (7.57) and (7.58) show that the regularization parameter plays
little part in the noise term for those modes with large singular values but heavily
inuences the noise term for those modes with small singular values. So, in
analyzing the effect of the regularization parameter, it is useful to focus attention on
the lesser excited modes. Accordingly, we observe that the maximum allowable
noise magnitude is directly proportional to the regularization parameter, d :
max jzi nj CG
d
:
m r i n
7:61
Therefore, if the noise level increases, the regularization level should increase by the
same factor to maintain the same degree of regularization.
7.6
255
Using the matrix inversion lemma, we can show the connection between APA and
RLS. The matrix inversion lemma states that if the nonsingular matrix A can be
written as
A B CD;
7:62
7:63
7:64
7:65
n 1t wn 1
d n 1 U
7:66
7:67
1
m Un 1 Un 1Un 1 Un 1 d I en 1:
t
7:68
256
7:69
Recognizing that the lower N 1 elements of (7.66) are the same as the upper N 1
elements of (7.67), we see that we can use (7.69) to express en as
en dn Unt wn 1
dn unt wn 1
1 m en 1
en
:
1 m en 1
7:70
Then, for m 1,
en en Unt wn 1
en
:
0
7:71
7:72
Equation (7.72) is very similar to RLS. The difference is that the matrix which is
inverted is a regularized, rank-decient form of the usual estimated autocorrelation
matrix. If we let d 0 and N n, (7.72) becomes the growing windowed RLS.
7.7
First, we write the relaxed and regularized afne projection algorithm in a slightly
different form,
en dn Unt wn 1
1
zn Un Un d I en
wn wn 1 m Unzn;
t
7:73
7:74
7:75
257
(7.74), Kinv is about 7. One way to reduce this computational complexity is update
the coefcients only once every N sample periods [9], reducing the average
complexity (over N sample periods) to 2L Kinv N multiplies per sample period.
This is known as the partial rank algorithm (PRA). Simulations indicate that when
very highly colored excitation signals are used, the convergence of PRA is
somewhat inferior to that of APA. For speech excitation, however, we have found
that PRA achieves almost the same convergence as APA. The main disadvantage of
PRA is that its computational complexity is bursty. So, depending on the speed of
the implementing technology, there is often a delay in the generation of the error
vector, en. As will be shown below, FAP performs a complete N-dimensional APA
update each sample period with 2L ON multiples per sample without delay.
7.7.1
Earlier, we justied the approximation in relation (7.69) on the assumption that the
regularization factor d would be much smaller than the smallest eigenvalue in Utn Un .
In this section we examine the situation where that assumption does not hold, yet we
would like to use relation (7.69) anyway. This case arises, for instance, when N is
selected to be in the neighborhood of 50, speech is the excitation signal, and the
near-end background noise signal energy is larger than the smaller eigenvalues of
Utn Un .
We begin by rewriting (7.68) slightly:
e1 n 1 I m Un 1t Un 1Un 1t Un 1 d I1 en 1: 7:76
The matrix Un 1t Un 1 has the similarity decomposition
Un 1t Un 1 Vn 1Ln 1Vn 1t ;
7:77
7:78
e01 n 1 Vn 1t e1 n 1;
7:79
and
respectively, we can multiply (7.76) from the left by Vn 1t and show that the ith
a posteriori modal error vector element, e01;i n 1, can be found from the ith a priori
258
7:80
1
1 m e0i n 1 l i n 1 @ d
e0i n 1
l i n 1 ! d
7:81
Assume that d is chosen to be approximately equal to the power of yn. Then, for
those modes where l i n 1 ! d , e0i n 1 is mainly dominated by the background
noise and little can be learned about hsys from it. So, suppressing these modes by
multiplying them by 1 m will attenuate somewhat the background noises effect
on the overall echo path estimate. Applying this to (7.81) and multiplying from the
left by Vn 1, we have
e1 n 1 1 m en 1;
7:82
and from this (7.70). From (7.76) we see that approximation (7.70) becomes an
equality when d 0, but then, the inverse in (7.76) is not regularized. Simulations
show that by making adjustments in d the convergence performance of APA with
and without approximation (7.76) can be equated. We call (7.82) the FAP
approximation as it is key to providing the algorithms low complexity. Further
justication of it is given in Section 7.7.7.
The complexity of (7.76) is L operations to calculate en and N 1 operations to
update 1 m en 1. For the case where m 1, the N 1 operations are
obviously unnecessary.
7.7.2
In many problems of importance the overall system output that is observed by the
user is the error signal. In such cases, it is permissible to maintain any form of wn
that is convenient as long as the rst sample of en is not modied in any way. This
is the basis of FAP. The delity of en is maintained at each sample period, but wn
^ n, is maintained, where only the last column of Un is
is not. Another vector, w
^ n in each sample period [10]. Thus, the
weighted and accumulated into w
computational complexity of the tap weight update process is no more complex than
NLMS, L multiplications.
259
One can express the current echo path estimate, wn, in terms of the original
echo path estimate, w0, and the subsequent Uis and zis:
wn w0 m
n1
X
Un izn i:
7:83
i0
n1 X
N 1
X
un j iz j n i;
7:84
i0 j0
n1 X
N1
X
un j ij 1 j iz j n i;
7:85
i0 j0
where
j 1 j i
1
0
0jin1
elsewhere:
7:86
N 1 n1j
X
X
j0
un kj 1 kz j n k j:
7:87
kj
N1
n1
XX
un kz j n k j:
7:88
j0 kj
Now we break the second summation into two parts, one from k j to k N 1
and one from k N to k n 1, with the result
wn w0 m
N 1 X
N 1
X
j0 kj
un kz j n k j m
n1 X
N 1
X
un kz j n k j;
kN j0
7:89
260
where we have also changed the order of summations in the second double sum.
Directing our attention to the rst double sum, let us dene a second window as
1 0kj
j 2 k j
7:90
0 elsewhere:
Without altering the result, we can use this window in the rst double sum and begin
the second summation in it at k 0 rather than k j:
N1 X
N1
X
un kj 2 k jz j n k j
j0 k0
N1 X
N1
X
un kz j n k j:
7:91
j0 kj
Now we again exchange the order of summations and use the window, j 2 k j, to
change the end of the second summation to j k rather than j N 1:
N
1 X
k
X
un kz j n k j
k0 j0
N 1 N
1
X
X
un kj 2 k jz j n k j:
7:92
k0 j0
N 1
X
un k
k
X
z j n k j m
j0
k0
n1
X
un k
kN
N 1
X
z j n k j:
j0
7:93
We dene the rst term and the second pair of summations on the right side of (7.93)
as
^ n w0 m
w
n1
X
un k
N
1
X
z j n k j
7:94
j0
kN
N 1
X
un k
k0
k
X
z j n k j;
7:95
j0
where
2
z 0 n
6 z 1 n z 0 n 1
En 6
4 ...
z N1 n z N2 n 1 z 0 n N 1
3
7
7:
5
7:96
261
0
En zn
:
En 1
7:97
7:98
N
1
X
z j n N 1 j
7:99
j0
^ n 1 m un N 1EN n:
w
7:100
7:101
:
1 m en 1
7:102
Unfortunately, wn 1 is not readily available to us. But we can use (7.101) in the
rst element of (7.102) to get
n 1E n 1
^ n 1 m unt U
en dn unt w
e^ n m r~ nt E n 1;
7:103
7:104
where
^ n 1;
e^ n dn unt w
7:105
262
and
n 1 r~ n 1 xna n un La n L;
r~ n unt U
7:106
7.7.3
7:107
Rn Unt Un d I
7:108
Dene
and let an and bn denote the respective optimum forward and backward linear
predictors for Rn and let Ea n and Eb n denote their respective prediction error
n and R
~ n as the upper left and lower right N 1 by
energies. Also, dene R
N 1 matrices within Rn, respectively. Then, given the identities
0t
1
anant
~ n1
Ea n
R
"
#
~ n1 0
1
R
bnbnt
t
E
n
b
0
0
Rn1
0
0
7:109
7:110
7:111
n1 e n
z n R
7:112
1
0
zn ~
anant en:
zn
Ea n
7:113
Similarly, multiplying (7.110) from the right by en and using (7.107) and (7.112),
1
z n
zn
bnbnt en:
Eb n
0
7:114
263
1
z n
bnbnt en:
zn
0
Eb n
7:115
7:116
7.7.4
7:117
FAP
The FAP algorithm with regularization and relaxation is given in Table 7.1. Step 1 is
of complexity 10N when the FTF (fast transversal lter, an FRLS technique) is used.
Steps 3 and 9 are both of complexity L, steps 2, 6, and 7 are each of complexity 2N,
TABLE 7.1 FAP with Regularization and Relaxation
Step Number
0
1
2
3
4
5
6
7
8
9
10
Computation
Initialization: Ea 0 Eb 0 d
a0 1; 0t
b0 0; 1t
Use sliding windowed FRLS to update
Ea n, Eb n, an, and bn
r~ n r~ n 1 una~ n un La~ n L
^ n 1
e^ n dn unt w
en e^ n m r~ nt E n 1
en
en
1 m en 1
1
0
zn ~
anant en
zn
Ea n
1
z n
zn
bnbnt en
0
Eb n
0
En zn
En 1
^ n w
^ n 1 m un N 1EN n
w
z~ n 1 1 m z~ n
Equation Reference
See Appendix
7.106
7.105
7.103
7.70
7.113
7.114
7.97
7.100
7.117
264
en
0
En
an:
En 1
Ea n
7:118
FAP without relaxation is shown in Table 7.2. Here, steps 3 and 6 are still
complexity L, step 2 is of complexity 2N, and steps 4 and 5 are of complexity N.
Taking into account the sliding windowed FTF, we now have a total complexity of
2L 14N.
7.7.5
Simulations
Figure 7.6 shows a comparison of the convergence of NLMS, FTF, and FAP
coefcient error magnitudes. The excitation signal was speech sampled at 8 kHz; the
system impulse response of length, L 1000, was xed; and the white Gaussian
additive noise, yn, was 30 dB down from the system output. Soft initialization was
used for both algorithms. For FTF, Ea 0 and Eb 0 were both set to 2s 2u (where s 2u is
the average power of un ) and l , the forgetting factor, was set to 3L 1=3L. For FAP,
Ea 0 and Eb 0 were set to d 20s 2u and N was 50. FAP converges at roughly the
same rate as FTF with about 2L complexity versus 7L complexity, respectively. Both
FAP and FTF converge faster than NLMS.
1
2
3
4
5
6
Computation
Initialization: Ea 0 Eb 0 d
a0 1; 0t
b0 0; 1t
Use sliding windowed FRLS to update
Ea n, Eb n, an, and bn
r~ n r~ n 1 una~ n un La~ n L
^ n 1
e^ n dn unt w
en e^ n m r~ nt E n 1
en
0
En
an
En 1
Ea n
z~ n 1 1 m z~ n
Equation Reference
See Appendix
7.106
7.105
7.103
7.70
7.117
Figure 7.6
excitation.
265
Comparison of coefcient error for FAP, FTF, and NLMS with speech as
7.7.6
Numerical Considerations
FAP uses the sliding window technique to update and downdate data in its implicit
regularized sample correlation matrix and cross-correlation vector. Errors introduced by nite arithmetic in practical implementations of the algorithm therefore
cause the correlation matrix and cross-correlation vector to take random walks with
respect to their innite precision counterparts. A stabilized sliding windowed FRLS
algorithm [11] has been introduced, with complexity 14N multiplications per sample
period (rather than 10N for nonstabilized versions). However, even this algorithm is
266
Figure 7.7
stable only for stationary signals, a class of signals which certainly does not include
speech. Another approach, which is very straightforward and rather elegant for FAP,
is to periodically start a new sliding window in parallel with the old sliding window,
and when the data are the same in both processes, replace the old sliding window
based parameters with the new ones. Although this increases the sliding window
based parameter calculations by about 50 percent on average (assuming that the
restarting is done every L N sample periods), the overall cost is small since only
those parameters with computational complexity proportional to N are affected. The
overall complexity is only 2L 21N for FAP without relaxation and 2L 30N for
FAP with relaxation. Since this approach is basically a periodic restart, it is
numerically stable for all signals.
7.7.7
We now explore the effect of the FAP approximation of (7.82) on the noise term of
the coefcient update. Returning to the noise term of the APA update as expressed in
(7.43), we have
Ty;APA m UnUnt Un d I1 yn:
7:119
FAP has a similar update, except that the noise vector is weighted with the diagonal
matrix
Dm diagf1; 1 m ; . . . ; 1 m N1 g;
7:120
267
which gives
Ty;FAP m UnUnt Un d I1 Dm yn:
7:121
The norm of Ty;FAP can be upper bounded by using the Schwartz inequality,
kTy;FAP k m kUnUnt Un d I1 kkDm ynk:
7:122
N
1
X
Efjui jg;
7:123
i0
1 1 m N
:
m
7:124
7:125
Taking the ratio of the FAP to the APA noise term upper bounds, we get
kTy;FAP kMA 1 1 m N
:
kTy;APA kMA
Nm
7:126
This expression represents the proportional decrease in noise due to the FAP
approximation compared to APA. As mentioned above, to maintain the same level
of regularization, the FAP regularization must be multiplied by the same factor.
Thus,
dF
1 1 m N
d A;
Nm
7:127
268
Figure 7.8
Figure 7.9 The eigenvalues of, Ruu , the N by N excitation covariance matrix of the
experiment of Figure 7.8, along with the noise and regularization levels.
7.8
269
This section discusses block exact methods for FAP [33] and APA. Block exact
methods were rst introduced by Benesty and Duhamel [55]. They are designed to
give the so-called block adaptive ltering algorithms, whose coefcients are updated
only once every M samples (the block length), the same convergence properties as,
per-sample algorithms, those whose coefcients are updated every sample period.
The advantage of block algorithms is that since the coefcients remain stationary
over the block length, fast convolution techniques may be used in both the error
calculation and the coefcient update. The disadvantage of block algorithms is that
because the coefcients are updated less frequently, they are slower to converge.
The block exact methods eliminate this disadvantage.
In this section we consider block exact FAP updates for a block size of length M.
The goal of the block exact version is to produce the same joint-process error
sequence of en dn ut nwn as the per-sample version of Table 7.1. First,
consider the calculation of the FAP joint-process error signal, e^ n and begin with an
example block size of M 3. At sample period n 2,
^ n 3:
e^ n 2 dn 2 un 2t w
7:128
At sample period n 1,
^ n 2
e^ n 1 dn 1 un 1t w
^ n 3
dn 1 un 1t w
7:129
m un 1 un N 1EN n 2;
t
7:130
The FAP coefcient vector update from sample period n 3 to n can be written as
^ n w
^ n 3 un N 1EN n
w
un NEN n 1
un N 1EN n 2:
7:131
270
7:132
dn 2
u n 2
3
2
32
EN n
0 ut nun N ut nun N 1
6
76
7
m4 0 0
ut n 1un N 1 54 EN n 1 5;
0 0
0
EN n 2
t
7:133
and
ri n ut nun i:
7:134
0
4
e^ 3 n e 3 n m 0
0
rN n
0
0
32
3
rN1 n
EN n
rN n 1 54 EN n 1 5:
0
EN n 2
7:135
7:136
7:137
271
7:138
1
0
jik
ji=k
0 j , M; 0 k M:
7:139
So, Ji has its ith elements equal to unity and all others, zero. For M 3,
0
0
1
0
0
1
1
J0
0
0
1
0
0
J1
7:140
and
7:141
Also, dene
dM n dn; dn 1; . . . ; dn M 1t ;
7:142
UM n un; un 1; . . . ; un M 1 ;
a N;M n un N; un N 1; . . . ; un N M 2t ;
7:143
7:144
272
TABLE 7.3
Step Number
Computation
^ n M , rN ;M n M ,
Initially, w
and EN n M are available
from the previous block
^ n M
e M n dM n UtM nw
for i M 1 to 0
rN ;M n i rN ;M n i 1
un ia N ;M n i
un i La N ;M n i L
e^ n i e n i m rtN ;m n iJi1 Fn
Calculate En i and en i using Table 7.1, steps 1, 2, 4
through 8, and 10
end of for-loop
^ n w
^ n M m UM n N 1Fn
w
2
3
4
5
6
7
Now we can write the block algorithm for arbitrary block size, M. It is shown in
Table 7.3.
Note that in step 5, that part of Fn which has yet to be calculated at step i lies in
the null space of Ji1 , so there is no problem of needing a value that is not yet
available. The complexity of steps 2 and 8 is each about 2ML multiply/adds. Steps 3
through 7 have about 2:5M 2 20MN multiply and/or adds. So the average
complexity per sample period is 2L 2:5M 20N multiply and/or adds. We can
reduce the complexity of steps 2 and 8 by applying fast convolution techniques using
either fast, FIR ltering (FFF) or the FFT. For example, consider the use of the FFF
method; then, using the formulas given by Benesty and Duhamel, the complexity of
steps 2 and 8 become 23=2r R 3M 5 1=M multiplications/sample
223=2r 1R 43=2r 2M 3 additions/sample, where r log2 M, and
R L=M. If L 1024 and M 32, the BE-FAP average complexity for steps 2 and
8 would be 577 multiplications and 996 additions compared to 2048 multiplications
and additions for the comparable FAP calculation. Letting N M, the remaining
calculations (steps 3 through 8) of BE-FAP amount to an average of about 720
multiplies per sample. For standard FAP, the remaining complexity is about 640
multiplications. So, whereas FAP would have a complexity of 2048 640 2688
multiplies per sample, BE-FAP can achieve a lower complexity of 577 720
1297 multiplies per sample. Rombouts and Mooner [27] have introduced sparse
block exact FAP and APA algorithms. The idea is to change the constraint on the
optimization problem from making the N most recent sample periods a posteriori
error zero to N of every kth sample periods a posteriori error zero. So, instead of
dealing with sample periods fn; . . . ; n N 1g, one deals with sample periods
fn; n k; . . . ; n kN 1g. The advantage is that since speech is only correlated
over a relatively short time, the excitation vectors of the new Xn are less correlated
with each other, so Xt nXn needs less regularization and the algorithm will
achieve faster convergence.
7.9
273
Very often, many channels of network echo cancellers are grouped together at VoIP
(voice over Internet protocol networks) gateways. VoIP increases the round-trip
delay of the voice signal, greater echo canceller performance, ERLE (echo return
loss enhancement), is required to prevent the user from being annoyed by echo. In
addition, price pressure requires that the cost in memory and multiply/accumulate
cycles be lower than in previous implementations. In [13] both of these requirements
are addressed. The complexity of the coefcient updates is lowerered by updating
only part of the coefcients in each sample period. On the other hand, the
convergence is accelerated by using afne projections to update the selected
coefcients.
Let us break up the coefcient vector, wn, and the coefcient update vector,
rn, into M blocks of length N, where we dene M and N such that L MN,
wn w0 nt ; w1 nt ; . . . ; wM1 nt t
7:145
rn r0 nt ; r1 nt ; . . . ; rM1 nt t
7:146
and
0 ; . . . ; 0 ; ri n ; 0 ; . . . ; 0 ;
t
t t
7:147
where in (7.147) we use the fact that the update vector is zero the ith block, the one
that is to be updated. It is also useful to dene data blocks:
Un U0 nt ; U1 nt ; . . . ; UM1 nt t :
7:148
Recall that in APA we minimized length of the coefcient update vector, rn, under
the constraint that the a posteriori error vector was zero. We do the same here, but we
restrict ourselves to updating only the block of coefcients that yield the smallest
update subvector, ri n. First, let us derive the update vector for an arbitrary block i.
The ith cost function is
Ci d ri nt ri n e1 n2 ;
7:149
7:150
en Unt rn
en Ui nt ri n;
where in the last step we have used (7.147).
7:151
274
Using (7.151) in (7.149) we may take the derivative of Ci with respect to ri n, set
it equal to 0, and solve for ri n, yielding,
ri n Ui nUi nt Ui n d I1 en:
7:152
As we stated earlier, we update the block that has the smallest update size. That is,
we seek
i min krj nk
0 j,M
7:153
7:154
where in the last step we assumed that d was small enough to ignore. The coefcient
update can be expressed as
wi n wi n 1 m Ui nUi nt Ui n d I1 en
i min kent Uj nt d I1 enk:
7:155
0 j,M
7.10
CONCLUSIONS
This chapter has discussed the APA and its fast implementations including the FAP.
We have shown that APA is an algorithm that bridges the well-known NLMS and
RLS adaptive lters. We discussed APAs convergence properties and its performance in the presence of noise. In particular, we discussed appropriate methods
of regularization.
When the length of the adaptive lter is L and the dimension of the afne
projection (performed each sample period) is N, FAPs complexity is either 2L
14N or 2L 20N, depending on whether the relaxation parameter is 1 or smaller,
respectively. Usually N ! L. We showed that even though FAP entails an approximation that is not entirely valid under regularization, the same convergence as
for APA may be obtained by adjusting the regularization factor by a predetermined
scalar value. Simulations demonstrate that FAP converges as fast as the more
complex and memory-intensive FRLS methods when the excitation signal is speech.
The implicit correlation matrix inverse of FAP is regularized, so the algorithm is
easily stabilized for even highly colored excitation.
7.11
APPENDIX
In this appendix we derive an N-length sliding windowed fast-recursive-leastsquares algorithm (SW-FRLS). The FRLS algorithms usually come in two parts.
One is the Kalman gain part, and the other is the joint process estimation part. For
7.11 APPENDIX
275
FAP we only need to consider the Kalman gain part since that gives us the forward
and backward prediction vectors and energies. However, for completeness, we
derive both parts in this appendix. Therefore, let us say that a desired signal, dN n, is
generated from
dN n hsys an yn;
7:156
7:157
Rn
1
X
l i an ian it l Rn 1 anant
7:158
i0
rdu n
1
X
7:159
i0
If a rectangular window is used, then one can apply the sliding window technique to
update the matrix using a rank two approach. That is,
Rn
L1
X
an ian it
i0
Rn 1 an; an L
ant
an Lt
7:160
276
N 1
X
dn ian i
i0
dn
rdu n 1 an; an L
:
dn L
7:161
7:162
dn dn; dn Lt ;
7:163
and
1 0
;
J
0 1
7:164
Rn Rn 1 BnJBnt
7:165
7:166
Pn Rn1 :
7:167
and (7.161) as
Let
Using this in (7.165) and applying the matrix inversion lemma, we have
Pn Pn 1 Pn 1BnI JBnt Pn 1Bn1 JBnt Pn 1:
7:168
We now dene the two by two likelihood matrices. The rst is found in the
denominator of (7.168),
Vn JBnt Pn 1Bn
JBnt K0 n
JK0 nt Bn
7:169
7:170
7:171
7.11 APPENDIX
277
where
K0 n Pn 1Bn k0 n; k0 n L:
7:172
Here the a priori Kalman gain matrix, K0 n, has been used. It is composed of two a
priori Kalman gain vectors dened as
k0 n Pn 1an
7:173
k0 n L Pn 1an L:
7:174
and
The notation in (7.174) is slightly misleading in that one may think that k0 n L
should equal Pn L 1an L in order to maintain complete consistency with
(7.173). We permit this inconsistency, however, for the sake of simplied notation
and trust that it will not cause a great deal of difculty. In a similar fashion, the a
posteriori Kalman gain vectors are
k1 n Pnan
7:175
k1 n L Pnan L
7:176
and
7:177
The second likelihood variable matrix takes into account the entire inverted matrix
in (7.168):
Qn I JBnt Pn 1Bn1
I Vn1 :
7:178
7:179
7:180
Qn1 I Qn
7:181
I QnQ1
7:182
278
or
QnVn VnQn I Qn:
7:183
Thus, (7.179) through (7.183) show the relationships between the two likelihood
matrices.
We now examine the relationship between the a priori and a posteriori Kalman
gain matrices. From (7.168), (7.172), and (7.178) it is clear that
Pn Pn 1 K0 nQnJK0 nt :
7:184
Multiplying from the right by Bn and using (7.171), (7.172) and (7.177), we get
K1 n K0 n K0 nQnVn:
7:185
7:186
Now we explore the methods of efciently updating the a posteriori and a priori
Kalman gain vectors from sample period to sample period. We start with the identity
Pn
0 P^ n
"
0t
0
P n
0
0
0
1
anant
Ea n
7:187
1
bnbnt ;
Eb n
7:188
where
an is the N-length forward prediction vector,
Ea n is the forward prediction error energy,
bn is the N-length backward prediction vector, and
Eb n is the backward prediction error energy.
We recognize that
P~ n P n 1:
7:189
In the same manner, the tilde and bar quantities derived below provide the
bridge from sample period n 1 to n. First, we derive a few additional denitions.
Implicitly dene B~ n and B n as
un; un 1
B n
Bn
:
B~ n
un L 1; un L
7:190
279
7.11 APPENDIX
This naturally leads us to dene the tilde and bar Kalman gain matrices,
~ 0 n P~ n 1B~ n;
K
7:191
0 n P n 1B n;
K
7:192
~ 1 n P~ nB~ n;
K
7:193
and
1 n P nB n:
K
7:194
Multiplying Pn 1 from the right with Bn and then using (7.187) and (7.188) at
sample n 1 rather than n, we get the relationship between the a priori Kalman gain
matrix and its tilde and bar versions.
K0 n Pn 1Bn
"
#
0; 0
1
an 1e0;a nt
Ea n 1
~ 0 n
K
"
0 n
K
0; 0
1
bn 1e0;b nt ;
Eb n 1
7:195
7:196
where
e0;a n Bnt an 1
7:197
e0;b n Bnt bn 1
7:198
and
are the a priori forward and backward linear prediction errors, respectively. The a
posteriori prediction errors are
e1;a n Bnt an
7:199
7:200
and
280
Relationships similar to (7.195) and (7.196) can be found for the a posteriori Kalman
gain matrix using identity (7.187) and (7.188) for Pn, yielding
K1 n Pn1 Bn
0; 0
1
ane1;a nt ;
~
E
K1 n
a n
"
#
1 n
1
K
bne1;b nt :
E
0; 0
b n
7:201
7:202
We can see the relationships between the linear prediction errors, the expected
squared prediction errors, and the rst and last Kalman gain matrix elements by rst
equating rst coefcients in (7.195) and (7.201), yielding,
k0;1 n; k0;1 n L
1
e0;a nt
Ea n 1
7:203
1
e1;a nt
Ea n
7:204
and
k1;1 n; k1;1 n L
and then, equating the last coefcients in (7.196) and (7.202), yielding
k0;N n; k0;N n L
1
e0;b nt
Eb n 1
7:205
1
e1;b nt :
Eb n
7:206
and
k1;N n; k1;N n L
The likelihood matrices also have tilde and bar counterparts. Starting with (7.169) in
a straightforward manner, we dene
~ n JB~ nt P~ n 1B~ n JB~ nt K
~ 0 n;
V
7:207
7:208
~ n I V
~ n1
Q
7:209
7.11 APPENDIX
281
and
n I V
n1 :
Q
7:210
Also, (7.180) through (7.183) hold for their tilde and bar counterparts. For example,
the counterparts for (7.183) are
~ n I Q
~ n
~ nV
~ n V
~ nQ
Q
7:211
nV
n I Q
n:
n V
nQ
Q
7:212
and
In addition, (7.186) holds true for the tilde and bar versions. For example,
~ n:
~ 1 n K
~ 0 nQ
K
7:213
The relationship between Vn and its tilde and bar variants can be seen by rst
multiplying (7.195) and (7.196) from the left by JBnt , yielding
~ n
Vn V
1
Je0;a ne0;a nt
Ea n 1
7:214
7:215
and
n
Vn V
1
Je0;b ne0;b nt
Eb n 1
7:216
7:217
1
Je0;b ne0;b nt
Eb n 1
7:218
n1
Qn1 Q
1
Je0;b ne0;b nt
Eb n 1
7:219
n1 Qn1
Q
1
Je0;b ne0;b nt :
Eb n 1
7:220
282
1
1
t
Je0;b ne0;b n Qn ;
Eb n 1
7:221
n.
giving us a useful relationship between Qn and Q
~ n. Multiplying (7.201) by
We now nd a relationship between Qn and Q
t
JBn n from the left and using (7.170), (7.186), and (7.183) gives
~ n
I Qn I Q
1
Je1;a ne1;a nt :
Ea n
7:222
1
Je1;a ne1;a nt
Ea n
7:223
7:224
~ n.
the relationship between Qn and Q
n starting
Similarly, we can show another relationship between Qn and Q
from (7.202) and using the same steps we used to derive (7.223):
n
Qn Q
1
Je1;b ne1;b nt
Eb n
7:225
7:226
The expected forward prediction error energy, Ea n, update can be derived by rst
multiplying (7.160) from the right by an 1:
Rnan 1 Rn 1 BnJBnt an 1
Ea n 1w BnJe0;a n;
7:227
Ea n 1
an K1 nJe0;a n:
Ea n
7:228
Ea n 1
k1;1 n; k1;1 n LJe0;a n:
Ea n
7:229
283
7.11 APPENDIX
7:230
We now derive the update for the forward linear predictor, an, using the a priori
prediction errors and a posteriori tilde Kalman gain matrices. Using (7.229) solved
for Ea n 1=Ea n in (7.228) yields
an 1 1 k1;1 n; k1;1 n LJe0;a nan
K1 nJe0;a n
0
an
Je0;a n;
~ 1 n
K
7:231
7:232
where we have used (7.204) and (7.201). Solving for an, we have the result
"
an an 1
0; 0
~ 1 n
K
#
Je0;a n:
7:233
The a posteriori forward linear prediction errors can be found from the a priori
~ n. First, using (7.207), (7.213), and (7.211), we
forward prediction errors using Q
have
~ n:
~ 1 n I Q
JB~ nt K
7:234
0; 0
Je0;a n
~ 1 n
K
7:235
~ nJe0;a n:
Q
~ n. From (7.235) we write
We can nd another relation between Qn and Q
~ n1 Je1;a n
Je0;a n Q
~ nJe1;a n:
I V
7:236
284
~ n
I Vn I V
7:237
1
Je1;a ne0;a nt
Ea n 1
1
~ n:
Q
7:238
The forward linear prediction vector can also be updated using the a posteriori
prediction errors and the a priori tilde Kalman gain matrix. Using (7.213) we can
write (7.233) as follows:
"
an an 1
#
0; 0
~ nJe0;a n:
Q
~ 0 n
K
7:239
#
0; 0
Je1;a n:
~ 0 n
K
7:240
Eb n 1
bn K1 nJe0;b n
Eb n
7:241
Eb n 1
k1;N n; k1;N n LJe0;b n:
Eb n
7:242
7:243
7:244
7.11 APPENDIX
285
1 n
K
0; 0
Je0;b n
7:245
Je1;b n:
7:246
and
"
bn bn 1
0 n
K
0; 0
7:247
1
bn 1 K1 nJe0;b n:
1 k1;N n; k1;N n LJe0;b n
7:248
We now relate the a posteriori residual echo to the a priori residual echo. This is
done merely for completeness. The FAP algorithm generates its own residual echo
based on the longer vector un . We begin by writing the a priori desired signal
estimate:
d^ 0 nt d^ 0 n; d^ 0 n L
7:249
hn 1t Bn
7:250
rn 1 Pn 1Bn
rn 1t K0 n:
7:251
7:252
7:253
7:254
7:255
dnt e0 nt Qn;
7:256
e0 n dn Bnt wn 1:
7:257
where
286
7:258
7:259
Qnt JQnJ;
7:260
Je1 n QnJe0 n:
7:261
we can write
The echo canceller coefcient update can be found from the solution of the leastsquares problem:
wn Pnrn
7:262
6
7a
7b
8
9
10
Computation
Part 1: Kalman Gain Calculations
e0;a n Bnt an 1
0; 0
an an 1
Je0;a n
~ 1 n
K
t
e1;a n Bn an
Ea n Ea n 1 e0;a nt Je1;a n
0; 0
1
K1 n
ane1;a nt
~
E
K1 n
a n
extract last coefcients, [k1;N n; k1;N n L]
e0;b n Bnt bn 1
x 1 k1;N n; k1;N n LJe0;b n1
bn xbn 1 K1 nJe0;b n
"
#
0 n
K
1
bn 1e0;b nt
K0 n
Eb n 1
0; 0
Part 2: Joint Process Extension
e0 n dn Bnt wn 1
wn wn 1 K1 nJe0 n
Equation Reference
7.197
7.233
7.199
7.230
7.201
7.198
7.248
7.248
7.196
7.257
7.265
287
7.11 APPENDIX
7:263
7:264
7:265
Using (7.186) and (7.261), we can express the coefcient update alternatively as
wn wn 1 K0 nJe1 n:
7:266
We are now ready to write the FRLS algorithms. The rectangular windowed fast
Kalman algorithm is shown in Table 7.4, and the sliding windowed stabilized fast
transversal lter algorithm is shown in Table 7.5. The algorithms are separated
6
7
8
9
10
11
12
13
14
15
Computation
Equation Reference
7.197
7.235
7.203
Ea n Ea n 1 e0;a nt Je1;a n
0; 0
1
K0 n
an 1e0;a nt
~ 0 n
Ea n 1
K
7.195
7.230
7.205
7.196
~ n 1 Je1;a ne1;a nt
Qn Q
Ea n
1
1
~ n Qn I
Q
Je0;b ne0;b nt Qn
Eb n 1
7.223
Eb n Eb n 1 e1;b nt Je0;b n
0; 0
an an 1
Je1;a n
~ 0 n
K
"
#
0 n
K
bn bn 1
Je1;b n
0; 0
7.243
7.221
7.240
7.246
7.257
7.261
7.266
288
into their Kalman gain and joint process extension parts. Only the Kalman gain
parts are used in the FAP algorithms. The joint process extensions are given for
completeness.
REFERENCES
1. K. Ozeki, T. Umeda, An Adaptive Filtering Algorithm Using an Orthogonal Projection
to an Afne Subspace and Its Properties, Electronics and Communications in Japan,
Vol. 67-A, No. 5, 1984.
2. B. Widrow, S. D. Stearns, Adaptive Signal Processing, Prentice-Hall, Inc., Englewood
Cliffs, N.J., 1985.
3. J. M. Ciof, T. Kailath, Fast, Recursive-Least-Squares Transversal Filters for Adaptive
Filtering, IEEE Trans. on Acoustics, Speech, and Signal Proc., Vol. Assp-32, No. 2,
April 1984.
4. S. J. Orfanidis, Optimum Signal Processing: An Introduction, Macmillan, New York,
1985.
5. S. L. Gay, A Fast Converging, Low Complexity Adaptive Filtering Algorithm, Third
Intl. Workshop on Acoustic Echo Control, 7 8 Sept. 1993, Plestin les Grevs, France.
6. S. L. Gay, Fast Projection Algorithms with Application to Voice Excited Echo
Cancellers, Ph.D. Dissertation, Rutgers University, Piscataway, N.J., Oct. 1994.
7. M. Tanaka, Y. Kaneda, S. Makino, Reduction of Computation for High-Order
Projection Algorithm, 1993 Electronics Information Communication Society Autumn
Seminar, Tokyo, Japan (in Japanese).
8. J. M. Ciof, T. Kailath, Windowed Fast Transversal Filters Adaptive Algorithms with
Normalization, IEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-33,
No. 3, June 1985.
9. S. G. Kratzer, D. R. R. Morgan, The Partial-Rank Algorithm for Adaptive
Beamforming, SPIE, Vol. 564, Real Time Signal Processing VIII, 1985.
10. Y. Maruyama, A Fast Method of Projection Algorithm, Proc. 1990 IEICE Spring
Conf., B-744, 1990.
11. D. T. M. Slock, T. Kailath, Numerically Stable Transversal Filters for Recursive Least
Squares Adaptive Filtering, IEEE Trans. on Signal Processing, Vol. 39, No. 1, Jan.
1991.
12. R. D. DeGroat, D. Begusic, E. M. Dowling, Dare A. Linebarger, Spherical Subspace and
Eigen Based Afne Projection Algorithms, Proc. of IEEE Intl. Conf. on Acoustics,
Speech and Signal Processing, Vol. 3, pp. 2345 2348, 1997.
13. K. Dogancay, O. Tanrikulu, Adaptive Filtering Algorithms with Selective Partial
Updates, IREE Trans. on Circuits and SystemsI. Analog and Digital Signal
Processing, Vol. 48, No. 8, Aug. 2001.
14. A Ben Rabaa, R. Tourki, Acoustic Echo Cancellation Based on a Recurrent Neural
Network and a Fast Afne Projection Algorithm, Proc. of the 24th Annual Conf. of the
IEEE Industrial Electronics Society, Vol. 3, pp. 1754 1757, 1998.
15. M. Muneyasu, T. Hinamoto, A New 2-D Adaptive Filter Using Afne Projection
Algorithm, Proc. of the ISCAS 1998, Vol. 5, pp. 90 93, 1998.
REFERENCES
289
290
REFERENCES
291
46. H. Ding, A Stable Fast Afne Projection Adaptation Algorithm Suitable for Low-Cost
Processors, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol.
1, pp. 360 363, 2000.
47. T. Gansler, J. Benesty, S. L. Gay, M. M. Sondhi, A Robust Proportionate Afne
Projection Algorithm for Network Echo Cancellation, Proc. of IEEE Intl. Conf. on
Acoustics, Speech and Signal Processing, Vol. 2, pp. 793 796, 2000.
48. M. Rupp, A Family of Adaptive Filter Algorithms with Decorrelating Properties, IEEE
Trans. on Signal Proc., Vol. 46, No. 3, March 1998.
49. S. Werner, J. A. Apolinario, Jr., M. L. R. de Campos, The Data-Selective Constrained
Afne-Projection Algorithm, Proc. of the Intl. Conference on Acoustics, Speech, and
Signal Processing, Vol. 6, pp. 3745 3748, 2001.
50. J. Benesty, P. Duhamel, Y. Grenier, A Multichannel Afne Projection Algorithm with
Applications to Multichannel Acoustic Echo Cancellation, IEEE Signal Processing
Letters, Vol. 3, No. 2, Feb. 1996.
51. S. G. Sankaran, A. A. Beex, Convergence Analysis Results for the Class of Afne
Projection Algorithms, Proc. of the Intl. Symposium on Circuits and Systems, Vol. 3, pp.
251 254, 1999.
52. S. G. Sankaran, A. A. Beex, Convergence Behavior of Afne Projection Algorithms,
IEEE Trans. on Signal Proc., Vol. 48, No. 4, April 2000.
53. N. J. Bershad, D. Linebarger, S. McLaughlin, A Stochastic Analysis of the Afne
Projection Algorithm for Gaussian Autoregressive Inputs, Proc. of IEEE Intl. Conf. on
Acoustics, Speech and Signal Processing, Vol. 6, pp. 3837 3840, 2001.
54. R. A. Soni, K. A. Gallivan, W. K. Jenkins, Afne Projection Methods in Fault Tolerant
Adaptive Filtering, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal
Processing, Vol. 3, pp. 1685 1688, 1999.
55. J. Benesty, P. Duhamel, A Fast Exact Least Mean Square Adaptive Algorithm, Proc. of
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, 1990.
PROPORTIONATE
ADAPTATION: NEW
PARADIGMS IN ADAPTIVE
FILTERS
ZHE CHEN
Communications Research Lab, McMaster University, Hamilton, Ontario, Canada
STEVEN L. GAY
Bell Laboratories, Lucent Technologies, Murray Hill, New Jersey
and
SIMON HAYKIN
Communications Research Lab, McMaster University, Hamilton, Ontario, Canada
8.1
8.1.1
INTRODUCTION
Motivation
In 1960, two classic papers were published on adaptive lter theory. One concerns
Bernard Widrows least-mean-squared (LMS) lter in signal processing area [33];
the other deals with the Kalman lter, named after Rudolph E. Kalman [23], in the
control area. Although both of them are rooted in different backgrounds, they
quickly attracted the world attention and have survived the test of time for over forty
years [34, 17].
The design of adaptive, intelligent, robust, and fast converging algorithms is
central to adaptive lter theory. Intelligent means that the learning algorithm is able
to incorporate some prior knowledge of a specic problem at hand. This chapter is
an effort aimed at this goal.
293
294
8.1.2
A new kind of normalized LMS (NLMS) algorithm, called proportionate normalized least mean square (PNLMS) [10], has been developed at Bell Laboratories
for the purpose of echo cancellation. The novelty of the PNLMS algorithm lies in the
fact that an adaptive individual learning rate is assigned to each tap weight of the
lter according to some criterion, thereby attaining faster convergence [12, 10, 3].
Based on the PNLMS algorithm and its variants, PNLMS, SR-PNLMS, and
PAPA (see [3] for a complete introduction and nomenclature), the idea can be
extended to derive some new learning paradigms for adaptive lters, which we call
proportionate adaptation [17]. Proportionate adaptation means that learning the
sparseness of incoming data from the solution is a key feature of the algorithm. The
merits of proportionate adaptation are two: one is that the weight coefcients are
assigned different learning rates which are adjusted adaptively in the learning
process; the other is that the learning rates are proportional to the magnitude of the
coefcients.
8.1.3
The chapter is organized as follows: Section 8.2 briey describes the PNLMS
algorithm, some established theoretical results, sparse regularization, and physical
interpretation, as well as some new proposed proportionate adaptation paradigms.
Section 8.3 examines the relationship between proportionate adaptation and Kalman
ltering with a time-varying learning rate matrix. In Section 8.4, some recursive
proportionate adaptation paradigms are developed based on Kalman lter theory and
the quasi-Newton method. Some applications and discussions are presented in
Sections 8.5 and 8.6, respectively, followed by concluding remarks in Section 8.7.
Notations
Throughout this chapter, only real-valued data are considered: We denote un as the
N-dimensional (N-by-1) input vector, wn w1 n; . . . ; wk n; . . . ; wN nT as the
N-by-1 tap-weight vector of the lter, wo as the desired (optimal) weight vector, yn
as the estimated response, dn as the desired response which can be represented
by dn uT nwo e n, and dn yn en, where en is the prediction
innovation error and e n is the unknown noise disturbance.1 The superscript T
denotes transpose of matrix or vector, tr(.) denotes the trace of a matrix, m m n is a
constant (time-varying) learning rate scalar, m n is a time-varying positive-denite
learning rate matrix, and I is an identity matrix. The other notations will be given
1
Note that our assumption of the real-valued data in this chapter can be readily extended to the complex
case with little effort.
295
wherever necessary. A full list of notations is given at the end of the chapter
(Appendix G).
8.2.2
PNLMS Algorithm
gk n enuk n
;
g n kunk2
8:1
N
1X
gk n;
N k1
8:2
8:3
m n
m Gn
uT nGnun
m Gn
uT nGnun
8:4
unen ; m nunen;
8:5
8:6
Proposition 1
smoothing.
Proof:
wn wn 1
m Gn
undn uT nwn 1
uT nGnun a
8:7
296
m Gnundn uT nwn
a m uT nGnun
uT nGnun
8:8
wn Knundn uT nwn;
where the second step in Eq. (8.8) uses the matrix inverse lemma.2 We may thus
write
wn wn 1 Knundn uT nwn;
where
Kn m n
m nunm nunT
:
1 uT nm nun
8:9
!1
uT nm nun 1;
n0
2
T n
trunu
kunk
eigenvalue of the positive-denite learning rate matrix m n.
T
Proof: The proof is given in Appendix B. See also Appendix A for some
preliminary background on H 1 norm and H 1 ltering.
Proposition 3
Proof: The proof is similar to that of Proposition 2. The essence of the proof is to
distinguish two components of the weight update equation and treat them
differently, one for PNLMS and the other for NLMS.
Proposition 4
shown by
Given the matrix A and vector B, A BBT 1 A1 A1 BI BT A1 B1 A1 BT .
8:10
Proof:
297
8:11
ke k22
arg min log E exp
;
z^ n
2
298
m n
unen;
m nuT nun 1
In the following, we show that the motivation of PNLMS is closely related to prior
knowledge of weight parameters in the context of Bayesian theory. Observing the
entries of the matrix Gn, which distinguishes PNLMS from NLMS, the elements
are proportional to the L1 norm of the weighted weight vector w (for the purpose of
regularization). To simplify the analysis, we assume that a 0 and the inputs xn
will not be all zero simultaneously. The diagonal elements of matrix Gn are of the
form (neglecting the time index)
gk L1 fr jw1 j; . . . ; r jwN j; jwk jg:
8:12
Note that the product term r d in Eq. (8.2) is neglected in Eq. (8.12) purposely for
ease of analysis. Since r , 1, at the initialization stage, it is expected that gk is
proportional to the absolute value of wk , namely, gk / jwk j.
From the Bayesian perspective, suppose that we know a priori the probability
density function of w as pw (e.g., w is sparsely distributed). Hence we may
attribute the value of gk by the negative logarithmic probability:
gk ln pwk :
8:13
3
The conditional joint probability of optimal weight wo n and e n given the observations is thus given by
P
pwo n;e 0;. .. ;e njd0; . .. ;dn / exp 12 n jdn^znj2 wo nw0m 1 nwo nw0.
4
The diagonal matrix implies the fact that the individual components are stochastically independent.
299
Prior
Uniform
Gaussian
Laplacian
Cauchy
Supergaussian
constant
expw2
expjwj
1
1 w2
1= coshw
gk
1
w2k
jwk j
ln1 w2k
lncoshwk
8:14
where r^ denotes the instantaneous gradient which approximates the true gradient
operator in a limiting case. Equation (8.14) states that different tap-weight elements
are assigned different scaling coefcients along their search directions in the
parameter space.
On the other hand, the weighted instantaneous gradient is related to regularization
theory. It is well established that Bayesian theory can well handle the prior knowledge of unknown parameters to implement regularization [5]. The advantage of this
sparse regularization lies in its direct efciency since the constraint of the weight
distribution is imposed on the gradient descent instead of an extra complexity term
in the loss function.
8.2.5
Physical Interpretation
8.2.5.1 Langevin Equation Studying the LMS lter in the context of the
Langevin equation was rst established in [17] (chap. 5). We can also use the
Langevin equation to analyze the PNLMS algorithm.
As developed in [17], the number of natural modes constituting the transient
response of the LMS lter is equal to the number of adjustable parameters in the
lter. Similarly for the PNLMS algorithm with multiple step-size parameters, the kth
(k 1; 2; . . . ; N) natural mode of the PNLMS lter is given as
y k n 1 1 m k nl k y k n f k n;
5
8:15
From a computation point of view, this is an efcient choice among all of the non-Gaussian priors.
300
where l k is the eigenvalue of the diagonalized correlation matrix of the input and
f k n is a driving force accounting for an unknown stochastic disturbance. From Eq.
(8.15), it follows that
Dy k n y k n 1 y k n
m k nl k y k n f k n;
8:16
8:17
d2 w
dw
m r^ w Ew;
j
dt2
dt
8:18
it is obvious that Eq. (8.17) is a special case of Eq. (8.18) for a massless particle. By
discretizing Eq. (8.18) (i.e., dt Dt, dw wtDt wt ) and assuming Dt 1 for
simplicity, after some rearrangement, we obtain the following difference equation
[29]:
wn 1 wn
1 ^
m
rw Ewn
wn wn 1:
mj
mj
1
m
and momentum h
,
mj
mj
the momentum is zero when the mass of particle m 0 [29].
In light of the above analogy, the PNLMS algorithm offers a nice physical
interpretation: Physically speaking, the convergence of the adaptive algorithm is
achieved when the potential energy Ewn 0 and the velocity of the particle
dw
approaches zero (i.e.,
0; the weights will not be adjusted). The diagonal
dt
element learning rate m k is related to the friction coefcient j k in a medium along a
particular direction, which can be nonuniform but isotropic, or nonuniform and
By comparison, we obtain the learning rate m
301
anisotropic. j k can be proportional to its distance from the starting origin (hence
jwk j) or to the velocity of the particle (hence jDwk j). The PNLMS algorithm belongs
to the rst case. Intuitively, one may put more energy (bigger step-size) to the
particle along the direction which has a bigger friction coefcient.
8.2.6
m n 1 m n expm r^ E m m t ;
8:19
8:20
m k n
jwk nj
/ jwk nj;
w
8:21
where w is a proper normalizing factor. From Eq. (8.21), it follows by virtue of the
derivative chain
rE m m t;n w sgnwk nrE w wk n;
8:22
m nunen
;
kunk2 a
8:23
8:24
Neglecting the w term, we may rewrite Eq. (8.24) in the matrix form
6
m n 1 m n expm un W sgnwnen;
8:25
302
There is no need to worry about the situation where all coefcients are zeros;
hence the regularization parameters r and d in PNLMS are avoided.
The multiplicative update of the diagonal learning rate matrix m n in (8.25)
can also be substituted for by an additive form:
m n 1 m n diagfm un W sgnwneng;
which we refer to as the PANLMS-II algorithm, in contrast to the previous
form.
Since, in the limit of convergence, it should be that limn!1 En 0, and
m k n should also be asymptotically zeros in the limit (see the proof in
Appendix C), whereas in PNLMS, the m k n are not all guaranteed to decrease
to zeros when convergence is achieved as n ! 1; hence the convergence
process will be somehow unstable. To alleviate this problem, the adaptation
of PNLMS and PANLMS may employ
a learning rate annealing
schedule
1
[18] by multiplying a term kn e.g., kn
which satises
n=t 1
limn!1 kn ! 0.
8.2.7
m n 1 m n expm un W sgnwnen;
8:26
8:27
where the diagonal learning rate matrix is updated similarly to the PANLMS (or
PANLMS-II) algorithm.
8.3
303
8:28
8:29
8:30
Pn 1 Pn Q1 n Kn 1uT n 1Pn Q1 n
8:31
Pn Q1 nun 1
:
Q2 n 1 uT n 1Pn Q1 nun 1
8:32
Pn Q1 nun 1
Q2 n 1 uT n 1Pn Q1 nun 1
8:33
dn 1 u n 1wn:
T
m n 1
Pn Q1 n
:
Q2 n 1 uT n 1Pn Q1 nun 1
8:34
That is, updating the learning rate matrix by gradient descent is equivalent to
updating the Kalman gain in the Kalman lter, which is dependent on the
covariances of the state error and the process noise [8].
At this point in the discussion, several remarks are in order:
As indicated in Proposition 1, the PNLMS algorithm is actually the a posteriori
form of Kalman smoothing, which is consistent with the result presented here.
As observed in Eq. (8.33), when the covariances of Pn and Q1 n increase,
the update of wn also increases; in the stochastic gradient descent
algorithms, the increase of the learning rate also increases the update.
7
If the process noise is assumed to be zero, the term (Pn Q1 n) in Eqs. (8.31) and (8.32) reduces to
Pn.
304
8.4
8.4.1
In contrast to off-line learning, on-line learning offers a way to optimize the expected
risk directly, whereas batch learning optimizes the empirical risk given a nite
sample drawn from the known or unknown probability distribution [4]. The
estimated parameter set fwng is a Markovian process in the on-line learning
framework. Proving the convergence of the on-line learning algorithm toward a
minimum of the expected risk provides an alternative to the proofs of consistency of
the learning algorithm in off-line learning [4].
8.4.2
305
m n m n 1
m n 1unm n 1unT
;
1 uT nm n 1un
wn wn 1 m nunen;
8:35
8:36
where m nun in Eq. (8.36) plays the role of the Kalman gain. The form of Eq.
(8.35) is similar to that of Eq. (8.9) and Eq. (8.34). The learning rate matrix m 0 can
be initialized to be an identity matrix or some other form according to prior
knowledge of the correlation matrix of the input.
8.4.2.1 MSE and H 1 Optimality An optimal lter is one that is best in a certain
sense [1]. For instance, (1) the LMS lter is asymptotically optimal in the meansquared-error (MSE) sense under the assumption that the input components are
statistically independent and an appropriate step-size parameter is chosen; (2) the
LMS lter is H 1 optimal in the sense that it minimizes the maximum energy gain
from the disturbances to the predicted (a priori) error; and (3) the Kalman lter is
optimal in the sense that it provides a solution to minimize the instantaneous MSE
for the linear lter under the Gaussian assumption; under the Gaussian assumption,
it is also a maximum-likelihood (ML) estimator. For the RPNLMS lter, we have
the following:
Proof:
See Appendix D.
8
It should be noted that RPNLMS is actually a misnomer since it doesnt really update proportionately; it is
so called because the form of the updating learning rate matrix is similar to Eq. (8.9). Actually, it can be
viewed as a lter with a priori Kalman gain.
9
In [21], a computationally efcient calculation scheme for gain matrices was proposed which allows fast
and recursive estimation of m nxn, namely, the a priori Kalman gain.
306
Proposition 7 Suppose that the vector fm 1=2 nung are exciting and 0 , uT n
m nun , 1. Given proper initialization m 0 I, the RPNLMS algorithm is H 1
optimal in the sense that it is among the family of minimax lters.
Proof: A sketch of the proof is given as follows. At time index n 1, m 1
u1uT 1
I
and
1 ku1k2
uT 1m 1u1 ku1k2
uT 1u1uT 1u1
1 ku1k2
ku1k2
, 1:
1 ku1k2
Generally, we have
uT nm nun
kunk2
, 1:
1 kunk2
Thus the condition in Proposition 7 is always satised. It is easy to check that the
exciting condition also holds. The rest of the procedure to prove the H 1 optimality
of RPNLMS is similar to that of PNLMS, which is omitted here.
8.4.2.2 Comparison of the RPNLMS and RLS Filters It is interesting to compare the RPNLMS and RLS lters since they have many common features in
adaptation.10 In particular, in a state-space formulation, the RLS lter is described
by [22]
wn wn 1
Pn 1un
dn uT nwn 1;
1 uT nPn 1un
8:37
Pn 1unuT nPn 1
1 uT nPn 1un
8:38
An equivalence discussion between the RLS lter and the Kalman lter is given in [17, 22].
307
8.4.3
We may also apply the proportionate adaptation principle to the afne projection
lter (Chapter 7, this volume), resulting in a new proportionate afne projection
adaptation (PAPA) paradigm:12
m t m n 1
m n 1Unm n 1UnT
;
m trUT nm n 1Un
8:39
8:40
m n 1 m n
m1
1X
m nun t m nun t T
m t 0
m1
1X
1
uT n t m nun t
m t 0
;
8:41
1
m nUnm nUnT
m
m n
;
1
1 trUT nm nUn
m
which is actually an averaged version of Eq. (8.35), given the current and m 1 past
observations.
11
This analogy can be understood by comparing the LMS and NLMS algorithms with the learning rate
scalar.
m Gn
12
In the original PAPA algorithm [3], m n T
, where Gn is dened in the same way
u nGnun a
as the PNLMS algorithm.
308
m n m n 1
m n 1gnm n 1gnT
;
1 gT nm n 1gn
wn wn 1 m ngnen;
8:42
8:43
8:44
8:45
wn wn 1 m nsgnunsgnen:
8:46
8.5 APPLICATIONS
8.5
8.5.1
309
APPLICATIONS
Adaptive Equalization
The rst computer experiment is taken from [17] on adaptive equalization. The
purpose of this toy problem is to verify the fast convergence of our proposed
proportionate adaptation paradigms compared to the other stochastic gradient
algorithms, including PNLMS and PNLMS. The equalizer has N 11 taps, and
the impulse response of channel is described by the raised cosine
8
< 1 1 cos 2p n 2 ; n 1; 2; 3
W
hn 2
:
0;
otherwise;
where W controls the amount of amplitude distortion produced by the channel (and
also the eigenvalue spread of the correlation matrix of tap inputs). In our
experiments, W 3:1 and a signal-to-noise ratio (SNR) of 30 dB are used. For
comparison, various learning rate parameters are chosen for all of the algorithms,
but only the best results are reported here. The experimental curve was obtained by
ensemble-averaging the squared value of the prediction error over 100 independent
trials, as shown in Figure 8.1.
8.5.2
Decision-Feedback Equalization
8:47
310
Figure 8.2 A schematic diagram of DFE. (a) Channel model; (b) training phase: the dashed
box of the soft decision is nonexistent for linear equalizer; (c) testing phase: the hard decision
is always used for both linear and nonlinear algorithms trained DFE.
8.5 APPLICATIONS
TABLE 8.2
Channel
B
C
D
G
S
311
Property
2
if xn . 1
if xn , 1
if jxnj 1:
8:48
b n n 1 m c nen;
8:49
where
1
c n uT nwn1 tanh2 b n 1uT nwn
2
and m is a predened small real-valued step-size parameter (in our experiments
m 0:05). The input un here is an augmented vector un un; . . . ;
un N1 1; yn 1; . . . ; yn N2 , (N N1 N2 ), where N1 ; N2 are the
lengths of tap weights of feedforward and feedback lters, respectively. The
number of tap weights of the feedforward and feedback lters as well as decision
delay are summarized in Table 8.3. The learning (convergence) curves are averaged
on 100 independent trials using 1000 training sequence under SNR 14dB. The
convergence results for the time-invariant linear Channels B and C by using
different algorithms are shown in Figure 8.3a and Figure 8.4a, respectively. In order
to observe the evolution of slope parameter b , the trajectories of b n are plotted in
Figure 8.3b for all of the channels. As shown, with no exception, they increase as
convergence is approached.
The convergence results for the time-invariant nonlinear Channel S are shown in
Figure 8.4b. Figure 8.5 gives the BER curves of Channels B, D, S, and G, with
312
Decision Delay
Input Taps
Feedback Taps
Weights
2/4
2/4
1/3
16
16
2/5
15
15
2/10
31
31
4/15
1
N/A
1
1
2
N/A
2
2
2
N/A
1
1
4
64
25
25
Figure 8.3 (a) Learning curves of DFE for Channel B; (b) time evolution of slope parameter
b for Channels B, C, D, G, and S.
different equalizers. The BER is calculated for 10,000 test data, averaged on 100
independent trials upon training 100 data sequence, with the SNR varying from 4 to
14 dB. For the nonlinear channels, the NRPNLMS-DFE13 with minimal parameters
outperforms all of the linear equalizers and has even better performance than many
neural equalizers, including the decision-feedback recurrent neural equalizer
(DFRNE), the decision-feedback Elman network, and the decision-feedback
recurrent multilayer perceptron (RMLP), with lower BER, much less algorithmic
complexity and CPU time, and a much lower memory requirement.
The NRPNLMS algorithm can also be used for time-variant channels, which can
be modeled by varying the coefcients of the impulse response hn. In particular,
13
In the training phase, the NRPNLMS-DFE can be regarded as a RPNLMS-DFE passing a zero-mean soft
nonlinearity of hyperbolic tangent function.
8.5 APPLICATIONS
313
Figure 8.4 (a) Learning curves of DFE for Channel C; (b) learning curves of DFE for
Channel S.
N 1
X
ai nzi :
8:50
i0
The coefcients are functions of time, and they are modeled as zero-mean Gaussian
random processes with user-dened variance. The time-variant coefcients ai n are
generated by using a second-order Markov model in which white Gaussian noise
(zero-mean and variance s 2 ) drives a second-order Butterworth lowpass lter
(LPF). In MATLAB14 language, it can be written by using the functions butter
and filter in the following:
[B,A]=butter(2,fs/Fs);
Ai=ai+filter(B,A,sigma*randn(1,1000));
where B and A are the numerator and denominator of the LPF, respectively; fs=Fs is
the normalizing cutoff frequency, with fs being a fading rate (the smaller fs is, the
slower the fading rate) and Fs being a sampling rate; ai is the xed coefcient, and Ai
is the corresponding time-varying 1000-length vector for ai at different moments.
The choice of fs in our experiment is 0:1 v 0:5Hz (0.1 corresponds to slow fading,
whereas 0.5 corresponds to fast fading); a typical choice of Fs in our experiment is
2400 bits/s.
Only the NRPNLMS algorithm with adaptive slope parameter is investigated
here. A three-tap forward lter, two-tap feedback lter (i.e., N 5 in total), and
14
314
Figure 8.5
8.5 APPLICATIONS
315
Figure 8.6 Left: convergence curves of time-variant slow-fading and fast-fading channels
using NRPNLMS with an adaptive slope parameter. Right: BER of time-variant slowfading
and fast-fading channels.
two-tap decision delay are used in the experiments. The results of convergence and
BER are shown in Figure 8.6. More experiments on time-variant multipath channels
including wireless channels will be reported elsewhere.
8.5.3
Echo Cancellation
In telecommunications, echoes are generated electrically due to impedance mismatches at points along the transmission medium and are thus called line or network
echoes [3]. In particular, the echoes occur due to the delay especially in the longdistance connection. To alleviate this problem and improve the conversation quality,
the rst echo canceler using the LMS algorithm was developed at Bell Labs in the
1960s [30]. Nowadays in the echo cancellation industry, the NLMS lter is still
popular due to its simplicity. Recently, there has been some progress in the echo
cancellation area, where the idea of proportionate adaptation (originally the PNLMS
algorithm) originated (see [3, 13] for an overview).
First, a simple network echo cancellation problem is studied. A schematic
diagram of echo cancellation with a double-talk detector (DTD) is shown in Figure
8.7. In the experiments, the far-end speech (i.e., input excitation signal) is 16bit
PCM coded and lies in the range [32768, 32767]; the sampling rate is 8 kHz. The
normalized measured echo path impulse response is shown in Figure 8.8b, which
can be viewed as a noisy version of the real impulse response. White Gaussian noise
with SNR 30 dB is added to the near-end speech. The length of the tap weight vector
(i.e., impulse response) is N 200. A variety of recursive adaptive lter algorithms
of interest are investigated, including NLMS, PNLMS, PNLMS (double update),
PANLMS, PANLMS-II, and RPNLMS. The parameters of PNLMS and PNLMS
algorithms are chosen as d 0:01, r 5=N 0:025, a 0:001. The learning rate
scalar m for NLMS and PNLMS is 0.2 and 0.8 for PNLMS; for PANLMS and
PANLMS-II algorithms, it is m 0:1 and m 0 I. The initial tap weights are set
316
Figure 8.7
kwo wnk2
:
kwo k2
The misalignment curves are shown in Figure 8.8c. As observed, the performance
of the proposed PANLMS and PANLMS-II algorithms is almost identical, and
both of them are better than NLMS, PNLMS and PNLMS. Among the
algorithms tested, RPNLMS achieves the best performance, though at the cost of
increasing computational complexity and memory requirement, especially when N
is large.
Figure 8.8 (a) Far-end speech; (b) normalized measured echo path impulse response; (c)
misalignment.
8.5 APPLICATIONS
Figure 8.9
317
Second, we consider the echopath change situation in order to study the tracking
performance of the proposed algorithms. Figure 8.9a illustrates two difference
impulse responses of two echo paths. In the rst 4s, the rst echo path is used; after
4 s, the echo path is changed abruptly. The misalignment curves of the proposed
algorithms are shown in Figure 8.9b. As shown, the newly developed proportionate
adaptation algorithms also exhibit very good tracking performance. It should be
noted that, compared to the PANLMS and PANLMS-II algorithms, the tracking
performance (for the second echo path) of the RPNLMS algorithm is worse due to
the time-decreasing nature of m n. Hence, it is favorable to reinitialize the learning
rate matrix once the change in echo path is detected.
We also consider the double-talk situation.15 The design of an efcient DTD is
essential in network echo cancellation. Although many advanced DTD algorithms
(e.g., cross-correlation or coherent methods) exist, a simple DTD called the Geigel
algorithm [13, 3], with threshold T 2, is used in the experiment. Besides, in order
to handle the divergence problem due to the existence of a near-end speech signal,
some robust variants of proportionate adaptation paradigms based on robust
statistics [20] were developed [3, 11]. For clarity of illustration, only the results of
robust PNLMS, robust PNLMS, and robust PANLMS-II algorithms are shown
here.16 in particular, the robust PANLMS-II algorithm is described as
m nun
jenj
c
sgnensn;
sn
kunk2 a
jenj
jenj
; k0
min
c
sn
sn
1 l s jenj
sn 1 l s sn
c
sn;
sn
b
wn wn 1
15
16
Namely, it happens when the far-end and near-end speakers speak simultaneously.
A detailed study and investigation are given in [6].
8:51
8:52
8:53
318
Figure 8.10 (a) Far-end speech; (b) near-end speech; (c) the rectangles indicate where the
double-talk is detected; (d) misalignment curves.
DISCUSSION
Complexity
Robustness
There exists a trade-off between computational complexity and memory requirement. Here we sacrice
memory by storing the intermediate result to reduce the computation cost.
18
As mentioned before, a fast calculation scheme [21] with linear complexity (see Appendix E) allows
RPNLMS to be implemented more efciently.
8.6 DISCUSSION
319
H1
Robust
LMS
NLMS
PNLMS
Yes
Yes
Yes
PNLMS
Yes
PANLMS
PANLMS-II
RPNLMS
SR-RPNLMS
EG
PAEG
RLS
No
No
Yes
N/A
N/A
N/A
UB
Kalman
UB
Computation
2N 2N 1 0 0 0
3N 3N 1 N 0 0
4N 1 5N 1 2N 1 0
N 2
6N 6N 1 3N 1 0
N 2
2N 6N 1 N N 0
3N 6N 1 N 0 0
2N 2 N 2N 2 2N N 2 0 0
2N 2 N 2N 2 2N N 2 0 0
2N 1 3N 1 N N 0
2N 1 7N 1 N 2N 0
2N 2 2N 2N 2 3N N 2 N
00
5N 2 N 1 3N 2 3N N
00
Memory
Conv. Rate
N
N
N
g 1
g 1
g 1
g 1
2N
2N
N 2 2N
N 2 2N
N
2N
N 2 2N
1,g
1,g
g 2
1,g
g 1
1,g
g 2
2N 2 2N
g 2
,2
,2
,2
,2
Note: Computational complexity is measured in one complete iteration. The order of computation is
denoted in terms of number of FLOPS: A M D E S, where A denotes addition, M denotes
multiplication, D denotes division, E denotes exponentiation, and S denotes sorting. UB, upper bounded.
Convergence behavior and tracking are two important performance measures for
adaptive lters. Convergence behavior is a transient phenomenon, whereas tracking
320
8.6.4
Loss Function
The squared loss function (L2 norm) is widely used in the adaptive ltering
community due to its simplicity for optimization, though it is not necessarily the best
or the only choice. A general error metric of adaptive lters was studied in [19].
Since the loss function is essentially related to the noise density in regularization
theory [5], we may consider using different loss functions especially in the context
of stochastic robustness.
8.7
CONCLUDING REMARKS
APPENDIX A:
321
Denition 1 The H 1 Norm [15]: Let h2 denote the vector space of squaresummable,P real-values20 causal sequences with inner product kf f ng;
fgngl 1
n0 f ngn. Let T be a transfer operator that maps an input
sequence fung to an output sequence fyng. Then the H 1 norm of T is
dened as
kTk1
sup
u=0;u[h2
kyk2
;
kuk2
where kuk2 denotes the h2 norm of the causal sequence fung. In other words, the
H 1 norm is the maximum energy gain from the input u to the output y.
and
kep k22
;
2
2
1
wo ;e [ h2 m jwo w0j ke k2
322
inf
F
kek22
;
T 1
2
wo ;e [ h2 wo w0 m wo w0 ke k2
*
A:1
sup
A:2
@f wn 1
, we have the following suboptimal algorithm: If gn are
@wo
exciting in the sense that
where gn
lim
!1
gT nm ngn 1
A:3
n0
and
0 , gT nm ngn , 1;
0 , m n , gngT n1 ;
A:4
wn
1
w
wn
1
o
o
2
2
@wo
2
323
n 1
@2 f w
wo wn 1
@w2o
f wn f wn 1 gT nwo wn 1:
Furthermore, we have the following proposition:
APPENDIX B:
A:5
First, noting that m n dened by Eq. (8.6) always satises condition (1) of
Proposition 2 since m n . 0, we have
uT nm nun
m uT nGnun
,1
uT nGnun a
by virtue of a . 0 and 0 , m , 1.
Second, we want to show that for any time step n, the H 1 minimax problem
formulated in Eq. (A.1) is always satised for the PNLMS algorithm. From
Denition 3, for all wo w0 and m n and for all nonzero e [ h2 , one should nd
an estimate wn such that
kek22
, g 2:
wo w0 m 1 nwo w0 ke k22
T
B:1
B:2
324
1
X
jdn uT nwo j2 g 2 juT nwo uT nwn 1j2
n0
is positive. To prove the H 1 optimality, we must show that for all wo = w0, the
estimate wn always guarantees Jn . 0. Since Jn is quadratic with respect to
wo ,21 it must have a minimum over wo . In order to ensure that the minimum exists,
the following Hessian matrix is positive-denite, namely,
1
X
@2 Jn
1
m
n
1 g 2 unuT n . 0:
@w2o
n0
B:3
B:4
in light of the exciting condition. Equation (B.4) implies that the kth diagonal entry
of the Hessian matrix in Eq. (B.3) is negative:
m 1
k n
1
X
1 g 2 juk nj2 , 0:
B:5
n0
P
T
Hence, m 1 n 1 g 2 1
n0 unu n cannot be positive-denite and Eq.
(B.3) is violated. Therefore g opt 1. We now attempt to prove that g opt is indeed
equal to 1. For this purpose, we consider the case of g 1. Equation (B.3) reduces
to m n . 0, which is always true from the conditions of Proposition 2.
Now that we have guaranteed that for g 1 the quadratic form Jn has a
minimum over wo , the next step is to show that the estimate given by the PNLMS
algorithm at each time step n is always guaranteed to be positive for the same choice
g 1.
21
Note that although m n is time-varying and data-dependent, it doesnt invalidate the quadratic property
of Jn with respect to wo .
325
B:6
B:7
wo w0
T
d1 uT 1w0
m 2 u2uT 2
uT 1 uT 1m 2u2uT 2
wo w0
:
d1 uT 1w0
1 uT 1m 2u2uT 2m 2u1
Observing that the second matrix of the last equality in Eq. (B.7) is positive-denite
by virtue of condition (2) of Proposition 2, namely,
u1 m 2u2uT 2u1 m 2m 1 2 u2uT 2u1 . 0;
uT 1 uT 1m 2u2uT 2 uT 1m 2m 1 2 u2uT 2 . 0;
it follows that J2 . 0. This argument can be continued to show that Jn . 0 for
all n 3, which then states that if the conditions of Proposition 2 are satised, then
326
g opt 1 and the PNLMS algorithm achieves it. Hence, the H 1 norm
P
n0
jenj2
wo w0 m 1 nwo w0
T
P1
n0
je nj2
1
B:8
kek22
1:
T 1
2
wo ;e [ h2 wo w0 m wo w0 ke k2
*
sup
APPENDIX C:
MATRIX
e2pi n
1
N
X
C:1
!2
m k n
k1
Rearranging the terms and taking the limit of both sides of Eq. (C.1),
lim
e2fi n
n!1 e2 n
pi
lim 1 trm n2 :
n!1
C:2
In the limit of convergence, the left-hand side equals 1. Thus, the right-hand should
be also 1, and it follows that
lim
n!1
N
X
k1
m k n 0:
C:3
327
n!1
k 1; . . . ; N:
C:4
In the special case of time-varying learning rate scalar where m n m nI, the
above derivation still holds.PFor the PNLMS algorithm, however, we cannot
generally ensure that limn!1 Nk1 m gk n 0 by recalling Eq. (8.2) and Eq. (8.6),
where limn!1 gk n = 0.
APPENDIX D:
The proof follows the idea presented in [32]. First, we want to prove that the
RPNLMS
P algorithm is optimal to minimize the cumulative quadratic instantaneous
error n jenj2 =2. Denote the optimal learning rate matrix by m o n. In particular,
we have the following form:
wn wn 1 m o nundn uT nwn 1;
D:1
and the optimal m o n is supposed to approximate the inverse of Hessian [32]. In the
linear case, the Hessian is approximately represented by
Hn Hn 1 unuT n:
According to the matrix inverse lemma, we have
H1 n H1 n 1
D:2
~ nw
~ T n, minimizing Ejw
~ nj2 is equivalent to
~ nj2 trEw
Since Ejw
minimizing the trace of the following matrix:
~ nw
~ T n:
h n s 2 Ew
D:3
328
h n h n 1
h n 1unh n 1unT
1 uT nh n 1un
1 uT nh n 1un
h n 1un
m o nun
1 uT nh n 1un
T
h n 1un
m o nun
:
1 uT nh n 1un
D:4
kh n 1unk2
1 uT nh n 1un
1 uT nh n 1un
2
h n 1un
:
m o nun
T
1 u nh n 1un
D:5
m o n
h n 1
;
1 uT nh n 1un
D:6
and
m o n m o n 1
m o n 1unm o n 1unT
;
1 uT nm o n 1un
D:7
which is essentially the RPNLMS algorithm. Here m o n 1un plays the role of
the Kalman gain Kn. Thus far, the proof is completed.
APPENDIX E:
In [21], a fast and computationally efcient scheme was proposed to calculate the a
priori Kalman gain with the form
n
X
j0
!1
x jxT j
xn;
329
where x j can be an m-by-1 vector or, more generally, an mp-by-1 vector such that
x j 1 is obtained from x j by introducing p new elements and deleting p old
ones. In particular, the scheme can be used straightforwardly to implement the
RPNLMS and NRPNLMS algorithms (where p 1 and thus mp N).
The fast algorithm, similar to the idea of Levinsons algorithm in the linear
estimation (prediction) literature (see e.g., [17]), is summarized in the following
generic lemma [21]:
Lemma 1
3
z n 1
6
7
..
xn 4
5:
.
z n m
Then the quantity
Kn
n
X
!1
x jx j d I
T
xn
j1
e n z n AT n 1xn;
E:1
An An 1 Kne T n;
E:2
e 0 n z n AT nxn;
E:3
Sn Sn 1 e 0 ne T n;
"
#
S1 ne 0 n
Kn
:
Kn AnS1 ne 0 n
E:4
E:5
mn
Kn
:
nn
E:6
330
Let
k n z n m DT n 1xn 1;
E:7
E:8
Kn 1 mn Dnnn:
E:9
K
d
k
z
D
n
e0
mp-by-1
p-by-p
p-by-p
mp-by-1
mp-by-p
mp-by-1
p-by-1
APPENDIX F:
mpp 1-by-1
1-by-1
p-by-1
p-by-1
mp-by-p
p-by-1
p-by-1
CONVERGENCE ANALYSIS
The convergence analysis of the LMS algorithm can be addressed in the framework
of stochastic approximation [4]. For learning rate scalar, we have
Lemma 2 In order to guarantee the convergence of wn ! wo , it is necessary for
the learning rate m n to satisfy
1
X
m 2 n , 1;
and
n1
1
X
m n 1;
F:1a
n1
r^ E 2 wn a b wn wo T wn wo ;
a 0; b 0:
F:1b
The convergence analysis of proportionate adaptation paradigms with the timevarying learning rate matrices can be similarly taken by using the quasi-Martingale
convergence theorem (see e.g., [4]). Without presenting the proof here, we give the
following theorem:
Theorem 1 In the case of on-line proportionate adaptation, the almost assure
(a.s.) convergence is guaranteed only when the following conditions hold:
1
X
n1
l 2max m n , 1 and
1
X
l min m n 1;
F:2a
n1
r^ E 2 wn a b wn wo T wn wo a 0; b 0;
F:2b
APPENDIX G: NOTATIONS
331
APPENDIX G: NOTATIONS
Symbol
Description
d
E
E
e, epi
efi
ep
ef
e
f
Gn
Gw
gk
g, g
Hn
hn
Hz
I
K
N
N 0; s 2
p
P
Q
sgn
T
n
tanh
tr
u
U
w
w0
wo
~ n
w
y
332
m
m n
m n
mo
h
l maxmin
e
e
a
b
d
r
y
f
n1
n2
r^ t
kk
Acknowledgments
S.H. and Z.C. are supported by the Natural Sciences and Engineering Council
(NSERC) of Canada. Z.C. is the recipient of the IEEE Neural Networks Society
2002 Summer Research Grant, and he would like to express his thanks for the
summer internship and nancial support provided by Bell Labs, Lucent
Technologies. The authors also acknowledge Dr. Anders Eriksson (Ericsson
Company, Sweden) for providing some data and help in the earlier investigation on
echo cancellation. The results on decision-feedback equalization presented here are
partially based on the collaborative work with Dr. Antonio C. de C. Lima. The
experiments on network echo cancellation were partially done at Bell Labs; the
authors thank Drs. Thomas Gansler and Jacob Benesty for some helpful discussions.
REFERENCES
1. B. D. O. Anderson and J. B. Moore, Optimal Filtering. Englewood Cliffs, NJ: PrenticeHall, 1979.
2. W.-P. Ang and B. Farhang-Boroujeny, A new class of gradient adaptive step-size LMS
algorithms, IEEE Transactions on Signal Processing, 49, 805 809 (2001).
3. J. Benesty, T. Gansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay. Advances in Network
and Acoustic Echo Cancellation. New York: Springer-Verlag, 2001.
4. L. Bottou, On-line learning and stochastic approximation, in D. Saad, ed. On-line
Learning in Neural Networks. Cambridge: Cambridge University Press, 1998, pp. 9 42.
REFERENCES
333
334
STEADY-STATE DYNAMIC
WEIGHT BEHAVIOR IN
(N)LMS ADAPTIVE FILTERS
A. A. (LOUIS) BEEX
DSPRL, ECE, Virginia Tech, Blacksburg, Virginia
and
JAMES R. ZEIDLER
SPAWAR Systems Center, San Diego, California
University of California, San Diego, La Jolla, California
9.1
INTRODUCTION
Nonlinear effects were demonstrated to be a fundamental property of least-meansquares (LMS) adaptive lters in the early work on adaptive noise cancellation
applications with sinusoidal interference [38]. The fundamental adaptive lter
conguration for noise canceling is shown in Figure 9.1. The adaptive lter adjusts
the weights wm , which are used to form the instantaneous linear combination of the
signals that reside in the tapped delay line at its input.
It was established [38, 19] that when the primary input to an LMS adaptive noise
canceler (ANC), dn, contains a sinusoidal signal of frequency v d and the reference
input, rn, contains a sinusoidal signal of a slightly different frequency, v r , the
weights of the LMS ANC will converge to a time-varying solution which modulates
the reference signal at v r and heterodynes it to produce an output signal yn
which consists of a sinusoidal signal at v d to match the frequency in the desired
signal. This was shown to produce a notch lter with a bandwidth that is controlled
by the product of the adaptive step size of the LMS algorithm and the lter order. It
was shown that by selecting the appropriate step-size, the resulting notch bandwidth
can be signicantly less than that of a conventional linear lter of the same order.
Since the effects cannot be predicted from classical linear systems analysis, several
authors [34, 8] have also described these nonlinear phenomena as non-Wiener
effects.
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
335
336
Figure 9.1
9.1 INTRODUCTION
337
inputs. This result contradicts the traditional assumption [18, 28, 38, 39] that the
misadjustment noise of the LMS lter (i.e., the difference between the MSE of the
WF and the LMS lter) represents the loss in performance associated with the
adaptive estimation process.
It was recognized [31] that the improvements in MSE due to nonlinear effects are
bounded by the MSE for an innite-length WF that includes contributions of the past
and present values of the reference signal rn and past values of the desired
response dn. The analysis is based on constraining the processes fdng and frng
to be jointly wide-sense-stationary (WSS) so that the WF is time invariant. The
innite past of the WSS process is not available to the nite-length Wiener lter but
is available to the innite-length WF and may be available to the LMS adaptive
lter. It is shown that there are often substantial performance improvements for an
LMS lter over a nite-length WF of the same order, but that the performance is
always bounded by that of an optimal WF of innite orders, operating on the past
desired and present and past reference inputs.
The performance of the LMS and exponentially weighted recursive-least-squares
(RLS) estimators was compared [32] for both the noise-canceling and the
interference-contaminated equalizer applications where nonlinear effects had been
observed in the LMS algorithm. The exponentially weighted RLS estimator did not
exhibit enhanced performance for these cases. It is important to note that the LMS
algorithm does not make any a priori assumptions on the temporal correlation of the
input model. The LMS lter selects from a manifold of potential weight vector
solutions to minimize MSE based solely on its present state and the current desired
and reference input data. The method by which these solutions are achieved will be
described in detail in this chapter for several cases in which nonlinear effects are
observed in LMS lters.
It was previously shown [23] that the improved tracking performance of the LMS
algorithm relative to the exponentially weighted RLS algorithm [9, 26, 27] results
from the fact that the correlation estimate used by the algorithm does not match the
true temporal correlation of the data. An extended RLS algorithm [23], which
incorporates estimates of the chirp rate into the state space model, can provide
tracking performance superior to that of both the LMS and exponentially weighted
RLS algorithms for the tracking of a chirped narrowband signal in noise. Likewise,
for the noise-canceling applications considered in [19], it would be possible to
introduce an extended RLS estimator that estimates the frequencies of the primary
and reference inputs and incorporates those estimates into the ltering process. Such
approaches could provide performance much closer to the optimal bounds that are
given below, provided that the state space model used accurately describes the input
data. There are many applications however, where there are underlying uncertainties
and nonstationarities in the input processes that do not allow an accurate state space
model to be dened. The advantage of the LMS estimator for such cases is that it is
not necessary to know the statistics of the input processes a priori.
In this chapter, we will begin by introducing three scenarios in Section 9.2 where
nonlinear effects are observed in LMS lters and one in which they are not easily
observed (wideband ANC). These four scenarios provide useful comparisons of the
338
magnitude of the effects which can be expected under different conditions, and will
be considered throughout the chapter as we develop the mechanisms that produce
nonlinear effects. These scenarios are also used to illustrate what is required to
realize performance that approaches the optimal bounds, as provided by an innitelength WF which has access to all the present and past of the reference signal and all
the past of the desired response.
Much of the previous work on nonlinear effects has focused on the behavior of
the LMS lter for sinusoidal inputs. The performance here will be obtained for both
deterministic sinusoids and stochastic order 1 autoregressive AR(1) inputs so that
the effect of signal bandwidth on the adaptive lter performance can be described
and so that the results are applicable to a larger set of adaptive lter applications.
We will focus on the use of the normalized LMS (NLMS) algorithm rather than
LMS so that we can utilize the noise normalization properties of NLMS to simplify
the performance comparisons. In addition, the afne projection and minimum-norm
least-squares interpretation of the NLMS algorithm [20] provide a useful model to
dene how the information from the past errors couples to the current error in the
weight update. It is important to realize that there is generally a manifold of weight
vector solutions that minimize MSE. This issue is also addressed in Section 9.3 in
the context of the NLMS algorithm. A linear time-invariant (LTI) transfer function
model for the NLMS algorithm is dened in Section 9.3.2.
The performance evaluations for nite- and innite-horizon causal WFs are
analyzed for reference-only, for desired-only, and for two-channel LTI Wiener
lters in Section 9.4. The absolute bounds [31] are dened, and necessary conditions
for achieving performance improvements are delineated. It is only when there is a
signicant difference in the performance bounds for the two-channel Wiener lter
and the reference-only WF that nonlinear performance enhancements may be
observable.
Section 9.4 establishes the conditions in which nonlinear performance enhancements are possible; in Section 9.5 we address the mechanisms by which they
may be achieved. It is shown that it is possible to dene a time-varying (TV) singlechannel, reference-only Wiener lter which has exactly the same performance as the
two-channel LTI WF dened in Section 9.4. This solution is based on a simple
rotation or linking sequence that connects the samples of the desired process and the
samples of the reference process. It is shown that the linking sequence is not in
general unique, corresponding to the nonuniqueness of the weight vector solution
represented by the manifold of possible solutions dened in Section 9.3.
Section 9.6 proves that there is an exact rotational linking sequence between the
reference and desired inputs for the deterministic sinusoidal ANC applications
dened in Section 9.2 and illustrates that this allows an accurate determination of the
adaptive TV weight behavior of the NLMS lter. In addition, the minimum norm
interpretation of the NLMS algorithm forces the new weight vector to be the one that
differs minimally from the current solution. This condition resolves the ambiguity in
the solutions. The key issue in realizing the potential performance improvements
delineated in Section 9.4 is shown to be whether the lter is able to track the
temporal variations dened by the single-channel TV WF.
339
Section 9.7 extends these results to stochastic AR(1) inputs and shows that the
properties of the linking sequences between desired and reference processes for the
exponential case still hold approximately for the AR(1) case. The approximation
inherent in this class of inputs is dened by the stochastic component of the AR(1)
model. It is shown that the stochastic component becomes especially important at
the zero crossings of the reference process. The result of the emergence of a driving
termin the difference equations that represent these processesis that abrupt and
signicant changes in the individual weight values can be produced over time as the
NLMS lter selects an update that is the minimum norm variation within the
manifold of possible weight vector solutions. It is shown that the key issue in
realizing potential improvements is the tracking of the temporal variations dened
by the single-channel TV WF.
In Section 9.8 the linking sequence approach is applied in the adaptive linear
prediction (ALP) application and the narrowband interference-contaminated
equalization (AEQ) application. The auxiliary channel for the ALP case consists
of the most recent past values of the desired process. In the equalization application,
the auxiliary channel contains the interference signal itself or an estimate for the
latter. Time-varying equivalent lters are derived for the corresponding two-channel
scenarios. In ALP the equivalent lter can be interpreted as the combination of
variable-step predictors of the desired signal. In AEQ the equivalent lter consists of
a combination of variable-step predictors of the interference at the center tap.
Finally, in Section 9.9, we indicate the conditions that must be satised for
nonlinear effects to be a signicant factor in NLMS adaptive lter performance. The
rst necessary condition is that there be a signicant difference in performance
between the reference-only WF and the two-channel WF using all present and past
reference inputs and all past desired inputs (ANC) or all recent past inputs (ALP) or
center-tap interference input (AEQ). The second requirement is that the adaptive
lter be capable of tracking the temporal variations of the equivalent reference-only
TV WF. In Section 9.9 we show that both of these necessary requirements are
satised simultaneously for ANC scenarios using various signal-to-noise ratios,
bandwidths, frequency differences, and model orders. We also show that a wide WF
performance gap alone is not sufcient for the adaptive lter to realize performance
gain over the reference-only WF. We illustrate that in the ALP scenario, more of the
Wiener lter performance gap is realized by the adaptive lter when the signal is
more narrowband. In the AEQ case the TV nature is such that almost the entire
Wiener lter performance gap is realized when the auxiliary choice approximates
what is practically realizable.
9.2
In this section we summarize the conditions for which nonlinear effects have been
observed previously. Four different scenarios have been selected for illustration: (1)
wideband ANC applications, where nonlinear effects are not easily observed; (2)
narrowband ANC applications, where nonlinear effects dominate performance; (3)
340
9.2.1
Figure 9.2
ANC conguration.
Figure 9.3
341
process, the two processes must have something in common. We will generate the
desired and reference processes according to the signal generator illustrated in
Figure 9.3.
For purposes of illustration, the system functions Hd z and Hr z each have a
single pole and are described as follows:
Hd z
1
1 pd z1
9:1
1
Hr z
:
1 pr z1
These systems are driven by the same unit-variance, zero-mean, white noise process
fv0 ng, thus generating the related AR(1) stochastic processes fd~ ng and f~r ng,
with power of 1 jpd j2 1 and 1 jpr j2 1 , respectively [24]. These AR(1)
generating systems are therefore governed by the following difference equations:
d~ n pd d~ n 1 v0 n
r~ n pr r~ n 1 v0 n:
9:2
The desired and reference stochastic processes fdng and frng, respectively, that
form the inputs to the ANC are noisy versions of the AR(1) processes fd~ ng and
f~r ng as a result of the addition of independent, zero-mean white noise to each:
dn d~ n vd n
rn r~ n vr n:
9:3
The specic poles and measurement noise levels are now specied to complete the
parameterized scenario for Figures 9.2 and 9.3:
p
pd 0:4e j 3
pr 0:4e j 5
SNRd 60 dB
p
9:4
SNRr 60 dB
The nal parameter to be chosen is M, the number of delays in the AF tapped delay
line in Figure 9.2. For clarity of illustration we select M 3. The details of the
342
necessary evaluations for adaptive ltering will be provided in Section 9.3.1 and for
Wiener ltering in Section 9.4.2; for now, we are interested in illustrating the
behavior of the AF in comparison to that of the corresponding WF. The M-tap AF
and WF will be denoted AF(M) and WF(M). For the above scenario, the theoretical
minimum mean square error (MMSE) for the WF(3) and the actual errors for the
WF(3) and AF(3) implementations are shown in Figure 9.4. The difference between
WF(3) and MMSE WF(3) is that the former refers to a nite data realization of the
three-tap Wiener lter implementation, as illustrated in Figure 9.2, while the latter
refers to the theoretical expectation for the performance of such a three-tap WF,
based on perfect knowledge of the statistical descriptions of the processes involved
[the AF solutions are computed using Eqns. (9.6 9.7), and the WF solutions
are computed using Eqns. (9.26 9.31), to be developed later]. We see that the
WF(3) produces errors close to its theoretically expected performance and that
the AF(3) does almost as well. What looks like excess MSE in the latter case
is commonly attributed to the small variations of the steady-state AF weights.
The behavior of the AF(3) weights, relative to the constant WF(3) weights of
[1 0:1236 0:1113j 0:0719 0:0296j], is shown in Figure 9.5. We note that
the AF(3) weights vary in random fashion about their theoretically expected, and
constant, WF(3) weight values. In this scenario, the AF produces Wiener
Figure 9.4 Error behavior for NLMS AF(3) (m 1) and WF(3) [scenario in Eqn. (9.4)]:
pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
343
Figure 9.5 Real and imaginary part of weights for NLMS (m 1) and WF [scenario in Eqn.
(9.4)]: pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
behavior, that is, weight and MSE behavior one would reasonably expect from the
corresponding WF.
9.2.2
Relative to the above scenario, we change the parameters in two signicant ways:
The desired and reference signals are made narrowband, and their center frequencies
are moved closer together. Thereto the signal generator parameters are modied as
follows:
p
pd 0:99e j 3
pr 0:99e j 3 0:052p
SNRd 20 dB
p
9:5
SNRr 20 dB
and the corresponding experiment is executed. The number of delays in the AF
tapped delay line is kept at three, that is, M 3. The resulting error behavior is
344
represented in Figure 9.6. Note that not only is the AF(3) error generally less than the
WF(3) error, it also falls well below what is theoretically expected for the
corresponding WF(3). This performance aspect, of the AF performing better than
the corresponding WF, is surprising. The explanation of this behavior will be given
in detail in the later sections of this chapter.
The AF(3) weight behavior for the narrowband ANC scenario, together with the
constant WF(3) weight vector solution of [0:6587 0:0447j 0:1277 0:0482j
0:5399 0:3701j], is shown in Figure 9.7. We note here that the AF(3) weights are
varying in a somewhat random yet decidedly semiperiodic fashion, and that this
variation is at most only vaguely centered on the constant weights of the
corresponding WF(3). Since the AF error is less than that for the corresponding WF,
and because of the time-varying weight behavior, the AF behavior for this scenario
is termed non-Wiener. Such non-Wiener behaviors had originally been observed
when closely spaced sinusoids were used as inputs to the ANC [19].
The non-Wiener effects were observed in the narrowband ANC scenario and the
effects investigated in terms of pole radii (bandwidth), pole angle difference
(spectral overlap), and signal-to-noise ratio (SNR) [31]. A prediction for the
performance in the narrowband ANC scenario was derived on the basis of a
345
Figure 9.6 Error behavior for NLMS (m 1) and WF [scenario in Eqn. (9.5)]:
pd 0:99e jp =3 , pr 0:99e jp =50:052p , SNRd SNRr 20 dB.
9.2.3
Nonlinear behavior was also observed in the AEQ scenario [33] depicted in Figure
9.8, where xn is a wideband quadrature phase shift keyed (QPSK) signal, in is a
narrowband interference, and vr n is an additive, zero-mean, white noise.
The delay D in the signal path ensures that the lter output is compared to the
signal value at the center tap [D M 1=2]. After training of the AEQ, the error
signal can be derived by comparison with the output from the decision device.
However, as our purpose here is to demonstrate the occurrence of nonlinear effects,
we will compare the estimated signal constellations when using the WF and when
using the AF in training mode in the presence of strong narrowband AR(1)
346
Figure 9.7 Real and imaginary part of weights for NLMS (m 1) and WF [scenario in Eqn.
(9.5)]: pd 0:99e jp =3 , pr 0:99e jp =50:052p , SNRd SNRr 20 dB.
347
interference. For strict comparison, the WF and AF are operating on the same
realization. The AR(1) pole is located at 0:9999 exp jp =3. Adaptive lter
performance is again computed using Eqns. (9.6 9.7), and WF performance is
computed using Eqns. (9.26 9.31). The respective results are shown in Figure 9.9,
for an SNR of 25 dB, a signal-to-interference ratio (SIR) of 20 dB, a lter length M
of 51 (D 25), and NLMS step-sizes m 0:1, 0.8, and 1.2. Step-size is an
important factor in optimizing performance; a step-size of 0.8 is close to optimal for
this scenario [33], while a very small step-size elicits WF-like results. For signal
power at 0 dB, the NLMS (m 0:1, 0.8, 1.2) AF(51) produced MSE of 12.83,
16.01, and 15.11 dB, respectively, while WF(51) produced MSE of 11.09 dB.
The WF(51) MMSE is 11.34 dB for this case. We see that the AF(51) equalized
symbols for the larger step-sizes are more tightly clustered around the true symbol
values (at the cross-hairs) than the WF(51) equalized symbols. Correspondinglyas
borne out by the MSEthe AF errors are more tightly clustered around zero,
thereby demonstrating the nonlinear effect in this AEQ scenario.
Nonlinear effects in LMS adaptive equalizers were investigated for the LMS as
well as NLMS algorithms for a variety of SIR and SNR values [33]. The latter
investigation included deriving the corresponding WF, expressions for the optimal
step-size parameter, and results for sinusoidal and AR(1) interference processes,
with and without the use of decision feedback.
9.2.4
Nonlinear behavior was observed as well in the ALP scenario depicted in Figure
9.10, wherein particularthe process to be predicted was a chirped signal [21].
To demonstrate the nonlinear effect in the ALP scenario, we will use an AR(1)
process in observation noise, as provided by fdng in Figure 9.3. The AR(1) pole
pd 0:95 exp jp =3, SNRd 20 dB, and prediction lag D 10. The number of
delays in the AF tapped delay line is again three, that is, M 3. In Figure 9.11
the NLMS AF(3) and WF(3) performance, operating on the same 10 process
348
Figure 9.9 Equalization performance for WF(51) (a) and NLMS AF(51) (m 0:1, 0.8, 1.2
in (b), (c), and (d), respectively): pi 0:9999e jp =3 , SNR 25 dB, SIR 20 dB.
349
350
Figure 9.10
ALP scenario.
realizations, is compared to the theoretical MMSE for the corresponding WF. The
experiment is repeated at different step-sizes. We observe that in the given scenario
the NLMS AF(3) outperforms the corresponding WF(3) for a wide range of stepsizes, thereby illustrating the existence of non-Wiener effects in the ALP scenario.
This limited set of experiments suggests that the nonlinear effects are more
pronounced for larger step-sizes.
351
Results for chirped signals, elaborating on the effects of chirp rate and bandwidth,
have been reported [21]. The latter also provides an estimate of the performance that
may be expected using the transfer function approach in the un-chirped signal
domain.
9.3
We have seen that NLMS performance can be better than the performance of the
corresponding nite-length WF due to the nonlinear or non-Wiener effect illustrated
in Section 9.2. The intriguing question we now address is how this performance
improvement is achieved. To begin, we establish notation and briey review some
of the well-known interpretations of the NLMS algorithm that will be used here. An
indicator for NLMS performance can be found in the transfer function approach to
modeling the behavior of NLMS, which is rooted in adaptive noise cancellation of
sinusoidal interference [19] as well as of colored processes [12]. The LTI transfer
function model for NLMS is derived so that we can later compare the performance
estimate it provides to the performance of the NLMS algorithm for several of the
scenarios described in Section 9.2.
9.3.1 Projection and Minimum-Norm Least-Squares
Interpretations of NLMS
Using the setups in Figures 9.2, 9.8, and 9.10 for environments that are not known a
priori, or that vary over time, the WF is replaced with an AF that uses the same
inputs. In the ANC and ALP applications, the noisy version of the desired signal is
used as the desired signal, and the error between the desired signal and its estimate
the AF output ynis used for AF weight adaptation. In the AEQ scenario the
signal of interest (the QPSK signal) serves as the desired signal during the training
period.
The nonlinear effects in the AF environment have been observed [33] when using
the NLMS algorithm, which is summarized here as follows:
en dn wH nun
wn 1 wn m
e*n
un;
H
u nun
9:6
rn M 1
3
7
7:
5
9:7
352
9:8
9:9
9:10
Figure 9.12
353
In this situation, we can write for the observed desired signal the following:
dn wH
TI un vn
9:11
H nun:
w
9.3.2
In Section 9.2 we showed that NLMS AF performance could be better than the
performance of the corresponding WF. This phenomenon was often, though by no
means always, associated with large step-sizes. We now show the derivation of this
LTI transfer function model for NLMS, which has the attractive feature that it is
equally valid for large and small step-sizes. The LTI transfer function model has
been a reasonably good indicator for NLMS performance in ANC, AEQ, and ALP
scenarios [21, 32, 33].
Starting from an initial weight vector w0, repeated application of Eqn. (9.6)
leads to the following expression for the weight vector wn:
wn w0 m
n1
X
e*i
i0
uH iui
ui:
9:12
354
the rst equality in Eqn. (9.6), wn is then a function of all the previously encountered values of the desired signal. The transfer function model for NLMS is based on
Eqn. (9.12), so that we can reasonably expect this model to accountmore or less
for the fact that NLMS uses all previously encountered values of the desired and
reference processes.
With yn, the output of the adaptive lter, given by
yn wH nun;
9:13
we nd
yn wH 0un m
n1
X
i0
ei
uH iun:
uH iui
9:14
The approximation facilitating the derivation of the LTI transfer function model for
NLMS is
uH iun
M1
X
rn jr*i j
j0
9:15
Mrr n i;
where rr m denotes the ensemble-average correlation of the reference process
frng at lag m. The latter results from the ergodicity assumption, so that time
averages can be replaced by ensemble averages. For large M this approximation
appears to be more valid than for small M. Nevertheless, as will be shown in Section
9.8, the approximation in Eqn. (9.15) is useful for reasonably small M also, in the
sense that the resulting model for NLMS behavior produces a good indicator of
NLMS performance.
Using the rst equality in Eqn. (9.6) in the LHS of Eqn. (9.14), and substituting
Eqn. (9.15) in the right-hand side of Eqn. (9.14) twice, yields
dn en wH 0un m
n1
X
ei
Mrr n i:
Mr
r 0
i0
9:16
Noting that the denominator under the summation is constant, and dening tn
dn wH 0un as the excitation, produces the following difference equation as
governing the NLMS error process:
en
n1
m X
eirr n i dn wH 0un
rr 0 i0
tn:
9:17
355
n1
m X
en mrr m tn:
rr 0 m1
9:18
1
m X
en mrr m tn:
rr 0 m1
9:19
Equation (9.19) is recognized as a difference equation describing a causal, allpole, LTI system of innite order, with numerator polynomial equal to 1. The
denominator of the corresponding system function is given by the following
polynomial:
DNLMS z 1
1
m X
rr mzm :
rr 0 m1
9:20
The difference equation in Eqn. (9.19) therefore represents the NLMS error process
as the result of the LTI system HNLMS z driven by the process tn . The NLMS system
function is given by
"
1
m X
HNLMS z 1
rr mzm
rr 0 m1
#1
1
P1 rr m m
z
1 m m1
rr 0
1m
1
P1
m1 r r mz
m
9:21
1
2h
h
h
jHNLMS e jv j2 St e jv d v ;
9:22
356
where St e jv is the spectral density of the process ftn g dened in Eqn. (9.17).
Alternatively,
1
JNLMS 1
2p j
jzj1
9:23
which can be interpreted in terms of auto- and cross-correlations when the integrand
is rational [1, 14].
9.4
In Section 9.3, Eqn. (9.12), we saw that the NLMS weight vector wn was an
implicit function of past values of dn, rn, and wn. Note that the NLMS AF does
not have direct access to all these causally past values; for example, the value of
dn 1 is embedded in wn and is no longer directly available at n and beyond.
Consequently, the NLMS AF is constrained in its use of the causally past values of
dn and rn, and as a result, its performance is limited relative to that of a lter
having full access to the past. In this section, we look at what is possible in terms of
estimation performance when an LTI estimator has full access to the causal past of
the desired process and the reference process, as well as to the present of the
reference process, thereby reaching the bounds dened earlier [31]. The latter will
provide absolute bounds on the performance of AFs when used in a WSS
environment. We will show in Section 9.5 how the NLMS AF is able to access
information from the past values of dn and achieve performance which exceeds
that of the reference-only WF while always being bounded by the performance of
the WF that uses all past values of dn and rn.
The goal of the estimation scenarios of interest is to provide the best estimate
based on given information or measurements. In each case, the lter output can be
seen as having been produced by the input process fdn ; un g, that is, a joint or
multichannel process. In the WF case, statistical information about the joint process
is used for the design of the lter, which then operates on the samples of (or perhaps
only a subset of) the joint process. In the AF case, the lter operates on the samples
of the joint process (although differently on its different subsets) while
simultaneously redesigning itself on the basis of those same samples. This
multichannel view may not directly represent the most commonly encountered
implementation of AFs, but it will afford us insights into the limits of performance
encountered in the AF scenarios above in their usual single-channel implementation.
Also, using different multichannel views, for the ANC, ALP, and AEQ scenarios,
we will be able to advance explanations for the observed nonlinear or nonWiener effects. In addition, we will show that, in some scenarios, multichannel
AF implementations may provide performance gain that cannot otherwise be
obtained.
9.4.1
357
9:25
358
where the last equality results from the measurement noise process fvd ng being
white, zero-mean, and independent of the noise processes fv0 ng and fvr ng.
In a Gaussian scenario, the driving and measurement noises are all Gaussian in
addition to having the above properties. The optimal lter for estimating dn is then
in fact an LTI lter [37]. The latter being truly optimal in MMSE sense, that is, the
best of all possible operations on the joint process sampleswhether that operation
is linear or nonlinearproduces an absolute bound on the performance of the AF.
This performance bound was recognized by Quirk et al. [31], who also showed that it
may be approached by the performance of NLMS in specic ANC scenarios.
The optimal lter, and the corresponding absolute bound, can be derived using
spectral factorization [31]. In practice, we are often interested in designing the best
causal linear lter operating on a nite horizon, as expressed by L and M, the number
of tapped delay line stages for the desired and reference processes, respectively. It
can be shown [22, 36] that the performance of the nite horizon causal linear lter
as its order increasesconverges to the performance of the innite horizon causal
linear lter, which in a Gaussian scenario is the best performance possible. We will
therefore concentrate next on the design and performance of the optimal
(multichannel) nite horizon causal linear lter, which provides the opportunity
to make a practical trade-off between performance and computational effort.
9.4.2
9:26
In general, the appropriate partitions in the following denitions are used in order to
yield the single-channel reference-only (L 0, M . 0), single-channel desired-only
(L . 0, M 0), or multichannel desired reference (L . 0, M . 0) WFs:
R EfunuH ng; p Efund*ng
dn 1
un
rn
2
3
2
rn
dn 1
6 rn 1
6 dn 2 7
6
7
6
dn 1 6
..
..
7; rn 6
4
5
4
.
.
dn L
9:27
9:28
3
7
7
7:
5
9:29
rn M 1
The output of the corresponding (multichannel) WFs can then be written as follows:
d^ n wH
WFL;M un;
9:30
359
where the values of L and M indicate which of the partition(s) of un in Eqn. (9.28)
is active. The performance of these nite horizon WFs, expressed in terms of MMSE
in estimating dn, is evaluated from
MMSEWFL;M rd 0 wH
WFL;M p;
9:31
9.4.3
360
361
Figure 9.15 ACF magnitude (a) and CCF magnitude (b) versus lag for the wideband ANC
scenario: pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
362
present and past values. Also note that the effect of the added measurement noise is
very small and is observed at lag 0 only. The reference signal ACF magnitude (not
its phase) is again the same as for the desired signal. For this narrowband process, it
takes many lags for the correlation between process samples to vanish. In fact, for
both the ACF and the CCF, the magnitude is exponential, with the factor equal to the
magnitude of the poles used in the generating processes. We see that the correlation
between adjacent desired signal values is approximately 0.99, that is, very high. The
magnitude of the CCF between the reference and desired processes is only
approximately 0.065 at zero lag, indicating that there is not much (statistical)
information in rn about dn. Consequently, the WFM; 0 performance seen in
Figure 9.16 is rapidly better than for WF0; M, that is, a reversal relative to the
wideband ANC scenario.
We note that MMSE WF(0, 3) is about 17 dB, as reected in Figure 9.6. For this
scenario, the reference-only WF will need an extremely high order before its
performance will approach that of the desired reference WF. In the present
scenario, a big improvement in performance results from incorporating the rst
desired signal sample; this is the reverse of what we saw in the wideband ANC
scenario. The performance advantage of WFM; M over WF0; M is immediate
363
Figure 9.17 ACF magnitude (a) and CCF magnitude (b) for the narrowband ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 20 dB.
364
and not easily mitigated by an increase in the lter order. The latter performance
behavior provides good arguments for the use of lters that include the desired
channel. Recall that in Figure 9.6 we saw an AF(0, 3) performance improvement of
about 6 dB, performance still well above the bound of less than 1 dB indicated in
Figure 9.16 for M 3. This indicates that the AF is accessing at least some, but not
all, of the information present in the desired signal channel. The mechanism by
which this is accomplished will be described in Section 9.5.
The performance perspective in Figure 9.16 shows thatfor scenarios such as
this one, where there is strong temporal correlation in fdngit would be more
benecial to use past desired process values than past (and present) reference values.
Furthermore, the results again indicate that the AF is accessing some of the
information available to the two-channel WF. However, this does not yet explain the
mechanism responsible for the nonlinear effects that have been observed when using
a reference-only adaptive lter.
9.4.3.3 Adaptive Linear Predictor The WF performance behavior for the ALP
scenario specied in Section 9.2.4 is shown in Figure 9.18, while Figure 9.19 shows
365
Figure 9.19 ACF and CCF magnitudes for the ALP scenario: pd 0:95e jp =3 ,
SNRd 20 dB, D 10.
366
the magnitude of the ACF for the desired process and the magnitude of the CCF
between the reference and desired processes for the ALP scenario specied.
Recall that, in the ALP scenario, the reference process is a delayed version of the
desired process, so that rn in Eqn. (9.29) is actually equal to dn D, as dened in
Eqn. (9.29). The ACFs of the desired and reference processes are therefore exactly
the same. The ACF and CCF die out exponentially according to the pole magnitude
of 0.95, and the peak in the CCF magnitude at lag m 10 reects the pure delay of
10 samples in the reference signal relative to the desired signal. There is a strong
correlation between dn and dn 1, so that adding a past desired signal sample
provides immediate information about dn. This is reected in the sharp increase in
WFM; 0 performance at the addition of the very rst desired channel tap, as seen in
Figure 9.18. Since fdng is essentially an AR(1) process [24], adding past values
beyond dn 1 provides almost no additional information. This explains the
apparent performance saturation at 0.75 dB; that is, the performance bound is
reached almost immediately.
The MMSE WF(0, 3) performance is about 8.3 dB, corresponding to the level
indicated in Figure 9.11. Note that the performance of the reference-only WF has
saturated at 8.3 dB and will not approach the performance of the desired-only or
desired reference WF at any lter order. The latter is explained by the fact that
increasing the order, or number of taps, in WF0; M only adds further delayed
samples to the reference vector in Eqn. (9.29). Figure 9.19b shows that, as far as the
reference channel is concerned, rnwhich equals dn D contains most
information about dn but is correlated with it only at the 0.6 level (see lag m 0),
hence the saturation at about 8.3 dB rather than at 0.75 dB.
With respect to Figure 9.19b, note that as we change the prediction lag D, the
CCF magnitude peaks at the corresponding lag location. Simultaneously, the CCF
between rn dn D and dnat lag 0decreases (increases) as the
prediction lag D increases (decreases). A decrease in the latter CCF implies that
there is less information about dn in rn dn D, and therefore the performance of the reference-only WF decreases. Figure 9.20 shows the WF performance as
it changes with prediction lag D. As prediction lag D increases, the performance gap
between the reference-only WF and the desired reference WF increases. The latter
means that, potentially, there is more to gain from the nonlinear effects in AF as
prediction lag increases. The NLMS ALP performance for the ALP scenario of
Section 9.2.4 is evaluated for M 3, for ve different realizations, and for
prediction lags of 10 and 30, and is shown in Figure 9.21. Figure 9.21a shows that
the NLMS reference-only AF gains a maximum of about 2 to 2.5 dB over the
corresponding WF for prediction lag D 10. For prediction lag D 30, the NLMS
AF gains a maximum of about 3 to 4 dB. As hypothesized earlier, we gain relatively
more in performance when prediction lag increases.
In both of the latter cases, and in the results in Figure 9.11, we saw AF
performance down to about the 5.5dB level, well above the absolute bound of about
0.75 dB indicated in Figures 9.18 and 9.20. Again, this performance behavior
indicates that some, but not all, of the information from the desired channel is being
accessed in corresponding AF applications.
367
368
369
Figure 9.22 Two-channel AEQ conguration for Section 2.3 scenario: pi 0:9999e jp =3 ,
SNR 25 dB, SIR 20 dB.
Figure 9.23 shows the WF performance behavior for the AEQ scenarios, using
the interference process as the second channel. The WF(0, 51) MMSE of 11.34 dB
corresponds to the WF constellation in Figure 9.9a. The AF(0, 51) MMSE that was
realized for m 0:8 was 16dB, corresponding to the AF constellation in Figure
9.9c. Note in Figure 9.23 that the latter is in the direction of the two-channel WF
performance of 25 dB. The latter performance bound is associated with using the
interference signal itself in the second channel, since the AF task in the AEQ
scenario is to reject the interference in the process of producing an equalized
estimate of the desired QPSK signal. We see in Figure 9.23 that WFM; 0 does not
produce any improvement in performance at any order. This situation corresponds to
using the interference signal, by itself, to estimate the desired QPSK signal.
Considering the independence of interference and QPSK, no improvement in
performance can be expected.
Each of the above examples illustrates that an appropriate two-channel WF
always performs at least as well as and often better than the reference-only WF. The
absolute two-channel WF bound, MMSE WF1; 1, indicates the best possible
performance that can be achieved by a two-channel NLMS AF implementation in
WSS scenarios. For analysis of the different contributions to NLMS error, the LTI
transfer function model for a two-channel NLMS lter is developed next.
9.4.4
With the AF input vector dened as in Eqn. (9.28), the derivation of the two-channel
LTI transfer function proceeds exactly as in Eqns. (9.12) through (9.14). Equation
(9.15) is replaced by:
uH iun
M
1
X
j0
rn jr*i j
L1
X
j0
Mrr n i Lrd n i;
dn 1 jd*i 1 j
9:32
370
where rr m and rd m denote the correlations, at lag m, of the reference and desired
(or auxiliary) processes, respectively (the relationship is valid, in general, for any
two channels that make up the AF input vector). Using Eqn. (9.32) in Eqns. (9.16)
through (9.20) results in the LTI transfer function model for two-channel NLMS:
"
1
X
m
HNLMS z 1
Mrr m Lrd mzm
Mrr 0 Lrd 0 m1
"
1
1
X
1
X
#1
m
M
rr mzm L
rd mzm
Mrr;0 Lrd;0
m1
m1
9:33
!#1
:
The choice L 0 in Eqn. (9.33) directly yields the single-channel NLMS model
in Eqn. (9.21). To evaluate the NLMS transfer function model requires the
evaluation of the ACF of the reference process frng and the auxiliary (desired or
interference in our earlier scenarios) process, and their respective strictly causal ztransforms.
9.4.5
371
As our earlier scenarios have been based on AR(1) processes, and as the ACFs for
the desired process and the delayed desired process are equal, we will explicitly
derive the NLMS transfer function model for the ANC and ALP scenarios.
The AR(1) reference process ACF, for the process dened in the signal generator
portion of Section 9.2.1, is given by [24]
rr m
1
pjmj c2r d m;
1 jpr j2 r
9:34
where the scaling constant cr is determined by the SNR (in dB) as follows:
c2r rr 010SNRr =10
1
10SNRr =10
1 jpr j2
9:35
From Eqn. (9.34), and the analogy in generating fdng and frng, now follows
rr m
1
1 10SNRr =10
pjmj 1 d m
d m
2 r
1 jpr j2
1 jpr j
1
1 10SNRd =10
jmj
p
1
d
m
d m;
rd m
1 jpd j2
1 jpr j2 d
9:36
where the rst term in brackets equals 1 for all values of m except for m 0, where
the term is zero. For the summation terms in the NLMS transfer function we only
need the correlation function values for strictly positive lag m, resulting in
1
X
rr mzm
m1
1
X
m1
rd mz
m
pr z1
; jzj . jpr j
1 jpr j2 1 pr z1
pd z1
; jzj . jpd j:
1 jpd j2 1 pd z1
9:37
With rr 0 and rd 0 given by the constants in the second right-hand side terms in
Eqn. (9.36), also substituting Eqn. (9.37) in Eqn. (9.33), and some careful algebra,
we nd the following explicit expression for the NLMS transfer function applicable
372
1 pr z1 1 pd z1
1 m g rd m g dr pr pd z1 pr pd m g rd pd m g dr pr z2
g rd Mpr 1 jpd j2 b 1
g dr Lpd 1 jpr j2 b 1
b M1 10SNRr =10 1 jpd j2 L1 10SNRd =10 1 jpr j2 :
9:38
Associated with the NLMS transfer function is the choice of the driving term in
Eqn. (9.17). A common choice for the starting weight vector is the all-zero vector.
We see from the right-hand side of Eqn. (9.17) that this corresponds to driving the
NLMS difference equation with the desired signal, dn. Alternatively, we could
argue that our interest is in the steady-state performance of NLMS, and that we
should therefore start NLMS in the steady-state. In this second case, a reasonable
weight vector to start from is the optimal Wiener weight. Substituting the latter in
the NLMS driving term, dened in Eqn. (9.17), yields the Wiener error as the driving
term for the NLMS difference equation. In Section 9.8, we will refer to the latter
choice as yielding the MSE performance estimate from the transfer function model
for NLMS.
Note from Eqn. (9.33) that when m ! 0, the NLMS transfer function
HNLMS z ! 1. If the driving term is the Wiener error, then the NLMS error will still
be the Wiener error; that is, we get the expected behavior for small step-size.
Having derived an explicit expression for modeling NLMS performance in the
ANC and ALP scenarios, we will now outline the procedure by which the corresponding MSE estimate is evaluated. Working backward from Eqn. (9.23), the
Wiener error is the driving process for the LTI NLMS lter that then generates the
modeled steady-state NLMS error. This process is illustrated in Figure 9.24.
Figure 9.24
373
Figure 9.25
Note that the only dependence on NLMS step-size resides in the LTI transfer
function. The NLMS error process consists of the additive contributions due to the
independent processes fv0 ng, fvd ng, and fvr ng, corresponding to the input
process and the measurement noise processes on the desired and reference
processes, respectively. In order to calculate more readily the individual contributions to the modeled NLMS error, we note that all systems in Figure 9.24 are
LTI, so that the equivalent diagram, Figure 9.25, applies.
The individual contributions to the modeled NLMS MSE can therefore be
evaluated from the ACF and CCF of the corresponding contributions to the
processes fe1 ng and fe2 ng as follows:
re 0 re1 0 re2 0 re1 e2 0 r*e2 e1 0
9:39
Equation (9.39) is needed only in the evaluation of the contribution due to the input
process fv0 ng. The contribution due to the measurement noise on the desired
process involves only the corresponding component of fe1 ng, and the contribution
due to the measurement noise on the reference process involves only the
corresponding component of fe2 ng. Note that all systems in Figure 9.25 are autoregressive moving-average (ARMA) systems, and that each of the driving noises is
zero-mean and white. Consequently, the individual contributions to the modeled
NLMS MSE can be evaluated using the Sylvester matrix based approach [1, 14].
For this MSE estimate from the transfer function model for NLMS to apply to
reference-only NLMS, the WF partition corresponding to the desired channel is set
to zero and the WF partition corresponding to the reference channel is replaced by
the reference-only WF. The resulting input to the LTI NLMS lterwhich is
obtained by setting L 0 in Eqn. (9.33)is now the reference-only WF error.
9.5
In Section 9.2.2, Figure 9.7, we noted that time-varying behavior of the NLMS
weights occurred when demonstrating nonlinear effects in the ANC scenario. In this
374
9.5.1
Assuming that we have solved for the optimal TI two-channel WF, using reference
and desired inputs, the estimate that such a lter produces is given by
d^ n wH
WFL;M un
dn 1
wH
L;
M
:
WF
rn
9:40
For the sake of illustration, lets assume that L M 1, so that the number of taps
in the desired signal channel is one less than the number of taps in the reference
channel. This is not an actual restriction, and we show in Section 9.5.3 how to
remove it.
Next, we dene the rotation, or linking, sequence fr ng, which expresses the
connection between the samples of the desired (or auxiliary) process and the
samples of the reference process:
dn r nrn:
9:41
9:42
375
wdH
WFL;M
wrH
WFL;M
dn 1
rn
rH
wdH
WFL;M 0 Dr n 1rn wWFL;M rn
rH
wdH
WFL;M 0 Dr n 1 wWFL;M rn
9:43
wH
TVWF0;M nrn:
In the nal step shown in Eqn. (9.43), we have thus dened wTVWF0;M n, a timevarying (reference-only) WF that is equivalent to the optimal TI two-channel WF
wWFL;M . The latter is TI, but uses both the desired and the reference input channels.
The newly dened equivalent reference-only WF is TV, because of the term
involving Dr n 1. Note that both lters, in the rst and last lines of Eqn. (9.43),
produce exactly the same estimate.
Note from Eqn. (9.43) that Dr n 1 represents the only time-varying aspect of
the equivalent lter. Now, if the reference-only AF manages to effectively track this
TV equivalent to the optimal desired reference TI WF, then the AF may indeed
capture some of the performance advantage of the two-channel TI WF over the
corresponding reference-only WF.
9.5.2 Alternative Single-Channel TV Equivalent to the
Two-Channel LTI WF
The above TV WF equivalent is not unique unless L M 1. In this section, we
show that under the same assumption as in Section 9.5.1, that is, L M 1, a
different choice for linking the elements of the desired (auxiliary) channel vector
and the elements of the reference channel vector leads to an alternative TV WF
equivalent. The original and alternative produce the same estimates but exhibit
distinctly different weight vector behavior. In Section 9.6 we will show how NLMS
resolves the ambiguity of picking a weight vector from a manifold of equivalent
solutions.
The alternative rotation sequence fk ng, is dened as follows:
dn 1 k nrn:
9:44
9:45
376
9:46
rH
wdH
WFL;M Dk n 0 wWFL;M rn
~H
w
TVWF0;M nrn:
~ TVWF0;M n, an alternative
Consequently, the last equality in Eqn. (9.46) denes w
TV equivalent lter that also produces the optimal TI two-channel WF estimate in
terms of only rn, the reference-input partition of un.
The optimal two-channel WF is LTI and uses both the desired and reference
channels as input, while the equivalent (and equally optimal) lter is TV and uses
only the reference channel as input. Again, Dk n in Eqn. (9.46) represents the only
TV aspect of this TV WF equivalent. Note that the rst weight vector element of
~ TVWF0;M n in Eqn. (9.46)
wTVWF0;M in Eqn. (9.43) is always constant, while for w
it is the last weight vector element that is always constant. However, while
exhibiting different weight vector behavior, these reference-only TV WF alternatives are equivalent in that each of these weight vectors lies in the manifold of
lters that produce the same WF estimate.
9.5.3
Nonuniqueness of TV WF Equivalents
We have shown two equivalent TV WFs in Sections 9.5.1 and 9.5.2. These
alternatives were associated with different ways to link a particular element of the
desired channel vector dn 1 with a particular element of the reference channel
vector rn. After the initial choice linking, for example, dn 1 with rn 1 by
using r n 1, we used delayed versions of the rotation sequence to link the
correspondingly delayed elements of the desired and reference channel vectors.
For the purpose of nding TV equivalent lters, we can generally dene rotation
sequences that link a particular element of the desired (auxiliary) channel vector
with any particular element of the reference channel vector, and then use that same
linking sequence to substitute for the other elements in the desired (auxiliary)
channel vector with the corresponding element of the reference channel vector. Such
377
9:47
..
.
dn L r L nrn
dn L r 1L n 1rn 1
..
.
dn L r M1L n M 1rn M 1:
For each of the L taps in the desired channel, M rotation sequences were dened
one for each tap in the reference channelthereby removing the earlier restriction of
assuming that L M 1. These linking sequences are uniquely dened by their
superscript, which reects the shift between the reference sequence and the desired
sequence that dene it. The linking sequences dened in Eqns. (9.41) and (9.44) are
now recognized as r 0 n and r 1 n, respectively.
Lets rst take a closer look at how these linking sequences may be useful in
narrowband scenarios, with processes governed by Eqn. (9.2). The linking sequence
r 0 n indicates how to operate on rn to get dn, while r 1 n 1 indicates how
to operate on rn 1 to get dn. In Figure 9.26 some of the linking sequences are
illustrated.
Lets assume that, at time n, we have the following linking relationship between
the desired signal dn and the reference signal rn:
dn r 0 nrn:
9:48
Based on the propagation dictated by the AR(1) processes as in Eqn. (9.2), we have
at time n 1 the following relationships for the desired signal dn 1 and the
378
Figure 9.26
reference signal rn 1:
dn 1 d~ n 1 vd n 1
pd d~ n v0 n 1 vd n 1
pd dn
rn 1 r~ n 1 vr n 1
pr r~ n v0 n 1 vr n 1
9:49
pr rn;
where in the next-to-last step of each, we have used the fact that the driving noise is
small relative to the AR(1) process itself when its pole radius is close to 1 and the
fact that the measurement noise is small. For the narrowband ANC scenario in
Section 9.2, these assumptions are reasonable, other than close to zero-crossings,
where they are no longer valid. For purely exponential processes the relationships in
Eqn. (9.49) are in fact exact.
Consequently, the following approximate relationship between r 0 n 1 and
0
r n is valid most of the time in the narrowband ANC case:
r 0 n 1
dn 1
rn 1
pd dn
pr rn
0
pd p1
r r n
jpd j jv d v r 0
e
r n:
jpr j
9:50
379
r 1 n 1
dn
rn 1
pd dn 1
pr rn
1
pd p1
n
r r
jpd j jv d v r 1
e
r n:
jpr j
9:51
Note that, under the narrowband assumptions above, all the linking sequences for
different shifts behave the same way. Under these circumstances, the behavior of the
TV aspects in Eqns. (9.43) and (9.46) is actually the same (though operating on a
different dimension of the weight vector).
As illustrated in Figure 9.26, there is also an approximate relationship between
the different linking sequences:
r 1 n 1
dn
rn 1
dn
pr rn
9:52
0
p1
r r n
380
Assuming that L and M are chosen large enough, enumerating all possible linking
sequences involved in substituting for the desired channel vector elements with
reference channel vector elements, we nd the set from r L n to r M2 n. For
each of these linkages, we can determine the corresponding TV WF equivalent.
Each of these TV WF equivalents operates on the same reference vector, is a
reference-only lter, and produces the same estimate as the unique and optimal LTI
two-channel WF. Consequently, any linear combination of any of these TV WF
equivalent lters will produce that same optimal estimate, as long as the sum of
linear combination weights equals 1. The TV WF equivalents are nonunique and
make up the manifold of solutions that produces the optimal WF estimate.
We will next use the above linkage relationships to establish the exact and
approximate, respectively, time-varying WF targets for NLMS in the exponential
and narrowband AR(1) ANC scenarios. The question we then address is whether
there is a specic target solution determined by the multitude of possibilities
indicated above. In Section 9.6 we will provide the answer to that question for the
class of WSS sinusoidal processes. The narrowband AR(1) process case will be
addressed in Section 9.7.
9.6
Having seen the multitude of alternative TV WF equivalents to the optimal LTI twochannel WF, we now illustrate the above ndings by applying them in the context of
WSS exponential processes. The specic context of the ANC scenario in Figure 9.1
will be used.
9.6.1
Referring back to the signal generator in Figure 9.3, the noise-free desired and
reference processes, f~r ng and fd~ ng, respectively, are now governed by the
homogeneous difference equations corresponding to Eqn. (9.2):
d~ n pd d~ n 1
r~ n pr r~ n 1:
9:53
The WSS exponential processes are the zero-input responses of these systems,
starting from the appropriate random initial conditions. The frequencies and
amplitudes of the complex sinusoids are assumed xed. The following parameterization then applies:
pd e j v d
p r e jv r
d~ 0 Ad e jf d ; f d v U0; 2p
r~ 0 Ar e jf r ; f r v U0; 2p
f d ; f r statistically independent:
9:54
381
For the noiseless desired and reference processes this leads to the following explicit
expressions:
dn Ad e jf d e jv d n
rn Ar e jf r e jv r n :
9:55
In the two-channel ANC scenario, our goal is to estimate the desired process from its
past and from causally past values of the reference signal. For the sake of simplicity,
rst select L 1 and M 1. The estimate for dn is then written as follows:
d^ n wH
WF1;1 un
dn 1
H
:
wWF1;1
rn
9:56
We are seeking the LTI two-channel WF solution that produces the desired signal
dn. To that end, we note from Eqn. (9.55) the following:
dn Ad e jf d e jv d n
Ad e jf d e jv d n1 e jv d
9:57
e jv d dn 1:
Using the latter to substitute for dn 1 in Eqn. (9.56) produces the LTI WF
solution:
jv d
e
dn
d^ n wH
WF1;1
rn
jv d
e
dn
jv d
9:58
0
e
rn
dn :
We are interested in the reference-only equivalent to the LTI WF solution. We use
the linking sequence r 1 n, as dened in Eqn. (9.47), to substitute for dn 1
with rn. This leads to the following, which is a special case of Eqn. (9.46):
dn 1
d^ n wH
WF1;1
rn
"
#
r 1 nrn
jv d
0
e
9:59
rn
e jv d r 1 nrn
wH
TVWF0;1 nrn:
382
r 1 n
dn 1
rn
Ad e jf d e jv d n1
Ar e j f r e j v r n
Ad jf d f r jv d v r n jv d
e
e
e
:
Ar
9:60
Substituting the latter in wTVWF0;1 , as dened in the last equality of Eqn. (9.59),
yields the following explicit expression for the equivalent TV WF:
wTVWF0;1 n e jv d r 1 nH
Ad jf d f r jv d v r n
e
e
:
Ar
9:61
How does this relate to AF behavior and performance? Recall that NLMS, for
step-size equal to 1, adjusts the NLMS weight vector so that the a posteriori error
equals 0. This means that the a posteriori weight vector wAF0;1 n 1 produces the
desired signal:
dn wH
AF0;1 n 1rn
wH
AF0;1 n 1rn
d^ n
9:62
Ad jf d f r jv d v r n
e
e
rn:
Ar
The nal equality comes from comparing with Eqn. (9.59) and substituting Eqn.
(9.61), thereby producing the unique correspondence between the optimal TV WF
weight vector at time n and the AF weight vector at time n 1. In this example, we
can then write an explicit expression for the weight behavior of NLMS, with stepsize equal to 1, because the a posteriori weight vector in one iteration equals the a
priori weight vector in the next iteration:
wAF0;1 n
9:63
383
a priori estimation error associated with the AF, which tends to vanish as the
frequency difference vanishes.
Recall that at iteration n we have the a priori weight vector wAF0;1 n, and that
the a posteriori weight vector wAF0;1 n 1 follows from the weight update
equation:
wAF0;1 n 1 wAF0;1 n
wAF0;1 n
m
e*nun
uH nun
m
dn wH
AF0;1 nrn rn
r H nrn
9:65
384
Using Eqn. (9.65) to substitute for the steady-state NLMS weight vector in the
weight update equation,
wAF0;1 n 1 wAF0;1 n
m
r H nrn
e*nrn;
9:66
g e jc
m
:
1 m 1e jv d v r
9:67
1 ejv d v r
1 1 m ejv d v r
9:68
1 ejv d v r
:
1 1 m ejv d v r
This expression for the steady-state error is valid for any realization; the factor that
converts the desired signal into the error signal depends only on the frequency
difference of the sinusoidal signals, and not in any way on their phases. Consequently, the mean square value of the error in Eqn. (9.68) is also the MSE for the
WSS sinusoidal process case. Note that for any nonzero step-size, the error goes to
zero as the frequency difference goes to zero. The latter corresponds to the AF
weight vector becoming the nonzero constant that corrects for the amplitude and
phase difference between any set of sinusoidal process realizations. Also, keeping
v r xed and sweeping v d , notch lter behavior is observed [19], but without any
restrictions on the desired and/or reference channel frequencies. An interesting
observation is that for m ! 1 and v d v r ! p , we nd en ! 2dn; that is, the
worst-case result of using NLMS is a 6 dB increase in error power over not having
ltered at all.
9.6.2
385
Alternative Equivalent TV WF
In the previous section M L 1, and there was only one way to link the past
desired value with the present reference value, resulting in a unique TV WF
equivalent. Correspondingly the NLMS weight vector solution was derived in
straightforward fashion. For M; L . 1 multiple TV WF equivalents exist.
For the same scenario as dened in Section 9.6.1, but with M 2, there are now
two elements in the reference vector partition (two taps in the reference channel
delay line). In complete accordance with the linking sequence in Eqn. (9.60) we
have already derived the following TV WF equivalent, as in Eqn. (9.59). It has
merely been rewritten with the additionalhere inactivedimension corresponding to the second reference vector dimension:
2
3
dn 1
6
7
rn 5
d^ n wH
WF1;2 4
rn 1
2
e jv d
3
r 1 nrn
6
7
0 04
rn
5
rn 1
e jv d r 1 n 0
rn
9:69
rn 1
_H
w
TVWF0;2 nrn:
There is now a second way to link the desired channel with the reference channel,
namely, by using the linking sequence r 0 n. The alternative to the development
resulting in Eqn. (9.69) is the following, a special case of Eqn. (9.43):
2
6
d^ n wH
WF1;2 4
dn 1
7
rn 5
rn 1
2
e jv d
3
r 0 n 1rn 1
6
7
0 04
rn
5
rn 1
0 e jv d r 0 n 1
H
w
TVWF0;2 nrn:
rn
rn 1
9:70
386
1 a ejv d r 0* n 1
9:71
a
Ad jf d f r jv d v r n
:
e
e
Ar
1 a ejv r
Uniqueness Resolved
Equation (9.71) provides the set of target solutions for the NLMS algorithm. For the
present scenario this set is complete, since knowledge of dn 1 is sufcient to
completely determine the desired dn. Actually, knowing dn 1 for any positive l
is sufcient, as following the above procedurefor any of these choicesleads to
the same solution set given in Eqn. (9.71).
Recall that NLMS can be interpreted as nding the new weight vector that
minimally differs from the current one. From Eqn. (9.71) we can write for the weight
vector increment
wTVWF0;2 n 1 wTVWF0;2 n
The only part that depends on a is the vector on the right. The norm of this vector is
minimized by the choice a 0:5. Substituting a 0:5 in Eqn. (9.71), incorporating the effect of m , as given in Eqn. (9.67), and accounting for the AF always
lagging one step behind its target gives the following expression for the a priori
steady-state weight vector behavior of the reference-only NLMS AF with two input
taps:
wAF0;2 n g e jc
Ad jf d f r jv d v r n1
0:5
;
e
e
0:5ejv r
Ar
9:73
where g e jc is as given in Eqn. (9.67). Figure 9.27 shows the actual a posteriori
weight vector behavior and the behavior in steady stateas governed by Eqn. (9.73)
one step advancedfor NLMS step-size m of 1 and 0.1, with Ad 2,
v d p =3 0:052p , Ar 3, v r p =3, and random initial phases. We see that
387
Figure 9.27 Actual (x) and theoretical (o) NLMS weight behavior for m 1 [(a) and (c)]
and m 0:1 [(b) and (d)]; Ad 2, v d p =3 0:052p , Ar 3, and v r p =3.
388
Figure 9.27
(continued )
389
steady state is reached more quickly and that the weight vector elements have larger
amplitudes when m 1 (a,c) than when m 0:1 (b,d). Also, the steady-state weight
behavior given in Eqn. (9.73) is veried by the actual NLMS result.
In Figure 9.28, using Eqn. (9.68), the corresponding actual and theoretical MSE
behaviors are shown for m 1 (a) and m 0:1 (b). For small step-size, NLMS
cannot follow the changes in the TV WF target weight vector and the steady-state
error is therefore larger.
The steady-state output of the AF follows from Eqn. (9.73), using Eqn. (9.55):
yn wH
AF0;2 nun
g ejc
ge
jc
Ad jf d f r jv d v r n1
e
e
0:5
Ar
Ad e
A r e jf r e jv r n
0:5e jv r
Ar e jf r e jv r n1
9:74
jf d jv d n1 jv r
g ejc e jv r dn 1:
In the next to last equality, we see that the steady-state output of the AF consists
of a single frequency component, at frequency v d , conrming Glovers original
heterodyning interpretation [19]. Any other frequency components in the AF output
vanish, as they result from AF transient behavior. Recall from Eqn. (9.67) that
m ! 1 produces g e jc ! 1, resulting in steady-state AF output yn
e jv r dn 1, showing that the AF adjusts the desired signal from the previous step.
The above readily generalizes to the use of more reference vector elements (or
delay line taps). For every increment in M, an element is added to the vector in the
right-hand-side of Eqn. (9.71). As a result of the next higher indexed and one sample
further delayed rotation sequence, using Eqn. (9.52), each addition of an element
contains an extra factor of ejv r . The latter expresses a rotation of the added weight
relative to the earlier weights. The multiple solutions, expressed by the corresponding elements, are all weighted equally to produce the minimum norm weight
vector increment. Figure 9.29 shows the weight vector behavior for a 10-tap lter
and m 1, with otherwise the same parameters as above. In Figure 9.29 we see that
the weight with index 1 and the weight with index 7 have the same behavior.
Weights with indices 2 and 8 also behave the same way, and so on. There is
periodicity over the weight index, with period equal to 6. This corresponds to the
factor ejv r in the weight vector, as in this example v r p =3. We also observe in
each weight a period of 20 over the time index, due to v d v r 0:052p ,
corresponding to the ejv d v r n term that all weight vector elements have in
common. These periodic weight behaviors were originally observed by Glover [19].
Figure 9.30 shows a close-up of the weight-vector behavior in the 10-tap case.
NLMS starts to adapt at time index 10. Starting from the zero vector, it only takes
one iteration to get into steady-state behavior, because m 1. As explained, we see
the behavior of only six different weight vector elements because of the periodicity
over the weight index.
390
Figure 9.28
391
Figure 9.29 Weight behavior for 10-tap NLMS (m 1) in the sinusoidal ANC scenario:
Ad 2, v d p =3 0:052p , Ar 3, and v r p =3.
392
Figure 9.30 Close-up of actual (x) and theoretical (o) weight behavior for 10-tap NLMS
(m 1) in the sinusoidal ANC scenario: Ad 2, v d p =3 0:052p , Ar 3, and
v r p =3.
393
Figure 9.31 shows the corresponding result for m 0:1. Only the real part of the
weights is presented, as the imaginary part behaves similarly. From the small stepsize result, we noted earlier that the transient behavior takes longer. However, after
100 iterations, the actual NLMS behavior and the theoretical steady-state behavior
have become indistinguishable.
The error behavior for the two-tap lters above was shown in Figure 9.28. For the
10-tap lters this behavior is simply delayed by eight samples, corresponding to the
delayed start of weight vector adaptation. That this is so can be seen from
generalizing the following transition from the one-tap to the two-tap lter, based on
Eqns. (9.65), (9.55), and (9.73):
d^ AF0;1 wH
AF0;1 nrn
g ejc
Ad jf d f r jv d v r n1
e
e
rn
Ar
g ejc
Ad jf d f r jv d v r n1
e
e
0:5rn 0:5e jv r rn 1
Ar
rn
Ad jf d f r jv d v r n1
jv r
e
e
0:5 0:5e
Ar
rn 1
g ejc
9:75
wH
AF0;2 nrn
d^ AF0;2 n:
The a priori steady-state estimatesand therefore the corresponding errorsare the
same for the one-tap and two-tap AF. Consequently, the system function from the
error signal to the AF output remains the same and is independent of M, the number
of taps in the AF delay line.
In Figure 9.27 we observe that the amplitudes of the real and imaginary
components of the steady-state weights for the two-tap lter are 0.34 and 0.11 for
m 1 and m 0:1, respectively. In Figures 9.30 and 9.31, for the 10-tap lter,
these amplitudes have dropped to 0.062 and 0.021. Recall that NLMS minimizes the
norm of the weight vector increment from one iteration to the next. While the a
priori estimation error remains the same, as the number of taps used is increased the
norm of the weight vector increment decreases.
9.7
We now test the TV optimal equivalent hypothesis for AR(1) processes. This
represents a widening of the bandwidth relative to the WSS exponential processes,
as well as the emergence of a driving term in the difference equations representing
these processes. While for the exponential processes the linear prediction error is
zero, this is no longer the case for AR(1) processes. While the stochastic nature of
the input processes makes it difcult to describe the weight dynamics exactly, it will
394
Figure 9.31 Close-ups of the real part of actual (x) and theoretical (o) weight behavior for
10-tap NLMS (m 0:1) in the sinusoidal ANC scenario: Ad 2, v d p =3 0:052p ,
Ar 3, and v r p =3.
395
We now return to the ANC scenario in Section 9.2.2, with pole radii of 0.99 and
frequencies 188 apart. The optimal performance of the reference, desired, and twochannel nite horizon optimal WFs was shown in Figure 9.16 for various horizons.
We saw that when both reference and desired inputs are used, the performance
rapidly approaches a limit, which is in fact the limit achieved by the corresponding
innite horizon WF. The best LTI WF in the Gaussian scenario is a lter that
operates on a somewhat limited past of both the desired and reference inputs.
In order to demonstrate more easily the TV equivalent models, we use the
scenario from Section 9.2.2, but with the SNR increased to 80 dB. For later
reference, we thus have the following parameterization:
p
pd 0:99e j 3
pr 0:99e j 3 0:052p
p
SNRd 80 dB
SNRr 80 dB:
9:76
Figure 9.32 shows a close-up of the WF performance for this scenario. We see that
when there is (nearly) no observation noise, the optimal lter only requires two past
values of dn and rn to reach a performance equal to that for the innite horizon
lter. In fact, in the truly noiseless scenario, optimal performance is obtained with
one past value of dn and two past values of rn. The latter is a direct result of the
generating processes being AR(1) (only one past value is needed for its optimal
prediction). The addition of observation noise causes the (noisy) desired and
reference processes to become more and more ARMA [24]. Consequently, the
equivalent AR processes approach being of innite order. Depending on the SNRs,
we can approximate these processes reasonably well with AR( p) processes of high
enough order. Using the analytical technique described in Section 9.4.2 (resulting in
Figure 9.16, for example), we can readily determine how much of a WF horizon is
needed to get within an acceptable margin of the optimal performance.
For the scenarios reected in Figures 9.16, 9.18, and 9.23, we observed that there
is often a substantial performance gap between the reference-only WF and the
desired-only or desired reference WF. In this section we will outline the conditions
under which the AF performance can approach that of the optimal desired
reference WF. We will rst derive approximate reference-only TV WF equivalents
to the two-channel WF, as discussed in Section 9.5. Due to the misadjustment and
lag variance associated with the use of AF ltering techniques, only part of the
performance gap can be bridged. Furthermore, there may not be any performance
advantage when the time variations are too fast to be tracked by an AF.
While the overall optimal performance in Figure 9.32 is reached for L 2,
M 2, the optimal MMSE is actually reached for L 1, M 2 (as indicated by the
396
Figure 9.32 Optimal WF performance for the (nearly) noiseless ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB.
MMSE WFL; 0 behavior). By making the SNR very high, we thus have a truly
optimal WF at very low orders with which to demonstrate the optimal TV WF
equivalents.
Now that we have established a scenario for which the optimal WF is wWF1;2 and
this lter is LTI, we expect nice behavior (convergence to a tight neighborhood of
the optimal LTI WF) from the corresponding AF, wAF1;2 . Figure 9.33 shows the
learning curve for the latter, together with the error behavior for the optimal lter,
wWF1;2 . Note that the AF does almost as well as the optimal WF. The discrepancy
between the two is known as the misadjustment error, which for m 1 is generally
close to 3 dB.
The weight behavior of the AF(1,2) and WF(1, 2) lters is shown in Figure 9.34.
The weight vector for WF(1, 2) is [0:4950 0:8574j 1 0:2058 0:9684j]. We see
that the adaptive lter weights are almost indistinguishable from those of the WF.
Only if we zoom in, as in Figure 9.35, do we see that the AF weights are actually
varying somewhat. The random uctuation behavior of the weights is responsible
for the excess MSE seen in Figure 9.33. One might say that we get nice, desirable
behavior. The NLMS AF converges to (a neighborhood of) the optimal solution in
WF
and
AF:
397
pd 0:99e jp =3 ,
its quest for minimizing the error under the constraint of minimal change in the
weight vector increments. The latter is eminently compatible with the existence of
an LTI solution in this case.
9.7.2
398
Figure 9.34 Real (a) and imaginary (b) components of the AF(1, 2) and WF(1, 2) weights:
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB.
Figure 9.35
399
400
The rst TV equivalent lter for this situation follows directly from Eqn. (9.43)
by using L 1 and M 2. This gives us the following result:
d^ n wH
WFL;M un
rH
wdH
WFL;M wWFL;M
dn 1
rn
rH
wdH
WFL;M 0 Dr n 1rn wWFL;M rn
0
rH
wdH
WF1;2 0 r n 1 wWF1;2 rn
9:77
H
w
TVWF0;2 nrn:
WFTV0;2 n has a rst component that is constant, as it comes from the
Note that w
LTI WF(1, 2) exclusively. The second component, the one depending on r 0 n 1,
is the only (potentially) TV component.
401
Figure 9.37 Real (a) and imaginary (b) components of the AF(0, 2) weights: exactlypd
0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB.
402
The second TV equivalent lter follows directly from Eqn. (9.46) and gives the
following result:
d^ n wH
WFL;M un
dn 1
rH
w
wdH
WFL;M
WFL;M
rn
rH
wdH
WFL;M Dk n 0rn wWFL;M rn
1
wdH
n 0 wrH
WF1;2 r
WF1;2 rn
9:78
_H
w
TVWF0;2 nrn:
_
Note that w
TVWF0;2 n has a second component that is constant, as it comes from the
LTI WF(1, 2) exclusively. Now the rst component, the one depending on r 1 n,
is the only (potentially) TV component.
Combining the results from Eqns. (9.77) and (9.78), and using Eqn. (9.52) to
substitute for r 0 n 1 in Eqn. (9.77), we can now state the set of (approximate)
TV WF equivalents that describes the manifold from which NLMS determines the a
posteriori weight vector:
wTVWF0;2 n a wdWF1;2
wrWF1;2
r 0* n 1
!
"
#
1*
r
n
1 a wdWF1;2
wrWF1;2
0
"
#
1 a r 1* n
d
wWF1;2
wrWF1;2
0*
ar n 1
wdWF1;2 r 1* n
1 a
a jpr jejv r
9:79
wrWF1;2 :
The rst term on the right-hand side in Eqn. (9.79) is TV, following the behavior of
the rotation sequence r 1 n. Both vector elements vary with the difference
frequency when the approximation in Eqn. (9.51) is valid, and the second weight is
offset by an angle corresponding to the reference frequency when the approximation
in Eqn. (9.52) is valid.
Referring back to Figure 9.37, we see both of these weight vector behaviors. Note
that our derivation was subject to holding most of the time, a condition based on
measurement noise being locally negligible with respect to the signal values; this
pertains in particular to the reference signal values, as those show up in the
denominator of our linking sequences. Note how the regularity of the TV weight
403
behavior in Figure 9.37 is temporarily lost near sample 4950, where WF(0, 2) does
temporarily better than AF(0, 2), as seen in Figure 9.36. When the signal is small
relative to the noise, Eqn. (9.51) loses its validity and the semiperiodic weight
behavior is disturbed, as reected in the interval around sample 4925. Furthermore,
in this example, a very short reference vector is being used in the reference-only AF
(containing only two reference channel samples), which can easily cause a rather
small reference vector norm for some instants. As a consequence, the NLMS weight
update produces temporarily large disturbances of the weight vector.
In order to nd the a posteriori target weight vector for NLMS from the manifold
of solutions described in Eqn. (9.79), we next evaluate the weight vector increment:
wWFTV0;2 n 1 wWFTV0;2 n wdWF1;2 r 1* n 1 r 1* n
1 a
:
9:80
a jpr jejv r
Assuming the rotation sequence difference to be constant, the norm squared of the
weight vector increment has the following proportionality:
kwWFTV0;2 n 1 wWFTV0;2 nk2 v j1 a j2 ja j2 jpr j2 :
9:81
Writing the right-hand side in terms of the real and imaginary part of a , and
minimizing with respect to both, yields the optimal linear combination coefcient:
a opt
1
:
1 jpr j2
9:82
Substituting in Eqn. (9.79) produces the a posteriori weight vector target for NLMS:
wWFTV0;2 n
wdWF1;2 r 1* n
jpr j
jpr j
wrWF1;2 :
jv
1 jpr j2 e r
9:83
Comparing the weight vector in Eqn. (9.71) to that in Eqn. (9.83), we note that the
former is explicit in terms of the parameters reecting the exponential scenario,
while the latter is implicit, as it contains the rotation sequence r 1 n. The latter
determines the behavior of the TV aspect of the a posteriori weight vector target for
NLMS. The stochastic nature of the temporal behavior of the linking sequence
exemplies the main difference between the deterministic and the stochastic
narrowband WSS cases.
Figure 9.38 shows the behavior of the actual NLMS update (solid, varying)
together with that of the hypothesized target model (dotted, varying) in Eqn. (9.83)
and the reference portion of the LTI wWF1;2 0:4950 0:8574j 1 0:2058
0:9684jT (gray, constant). The latter indicates the values around which the TV
weights vary, according to Eqn. (9.83), whichunlike in the exponential case are
now generally nonzero. We note that the hypothesized weight vector behavior, as
404
Figure 9.38 Real (a) and imaginary (b) components of the NLMS and hypothesized NLMS
weights: pd 0:99e jp =3 , pr 0:99e jp =30:052p , SNRd SNRr 80 dB.
405
predicted from the manifold of TV equivalent WFs, follows the actual NLMS
behavior quite well. While there appear to be discrepancies between the two from
time to time, this is attributed to the relationships used in the derivation being
approximate and valid most of the time. Note that NLMS for step-size m 1
produces an a posteriori estimate equal to the desired value (which is slightly noisy),
while the hypothesized model aims to produce the Wiener solution provided by the
two-channel LTI WF. Figure 9.39 shows these respective estimates. The estimates
from the TV AF and WF equivalent lter are indistinguishable ( and * coincide) and
nearly equal to the desired value (o), while the NLMS estimate is strictly equal to the
desired value (because m 1). The a posteriori weights track the TV WF
equivalent. More importantly, most of the time, these a posteriori weights are still
relatively close to the TV WF equivalent weights at the next iteration (as seen in Fig.
9.38), resulting in an a priori error that is small relative to that produced by the LTI
WF weights (as seen in Fig. 9.36).
Figure 9.40 shows the norm of the weight change vector during steady state for
the various solutions that were considered. The optimal TV WF, as expressed in
Eqn. (9.83), is observed to have a weight vector increment norm smaller than either
one of its two constituents, as given in Eqns. (9.77) and (9.78). Moreover, linearly
combining the latter, as in Eqn. (9.79), and numerically nding a to yield the
minimum of either the max, min, mean, or median of the norm of the weight vector
increments over the steady-state interval all yielded a very close to 0.5 and nearly
indistinguishable weight vector solutions.
As in the exponential case, the addition of more taps in the reference channel
creates additional weight solutions with the TV aspect modied by jpr jejv r , that is,
shifted and with slightly smaller amplitudes. We can observe the shifting in Figure
9.7, where M 3. In the latter case the SNRs were 20dB, illustrating that it is the
validity of Eqns. (9.51) and (9.52) in the vicinity of zero crossings that is more a
determinant of weight behavior than SNR.
9.7.3
The narrowband scenario in Sections 9.7.1 and 9.7.2 supports the notion that it is the
slowly TV equivalent optimal solution that is being tracked. It is relatively simple,
then, to hypothesize a very similar scenario in which the TV equivalent solution
varies rapidly. If we choose the following scenario for Figures 9.2 and 9.3,
p
pd 0:99e j 3
pr 0:99e j 3 0:502p
SNRd 80 dB
p
9:84
SNRr 80 dB;
then the optimal WF performance graph looks as it does in Figure 9.41. Note that,
although we have changed the pole angle difference dramatically (from 188 to 1808),
there is still a large performance gap between the reference-only and two-channel
WFs, so that one might benet from the possible nonlinear effect of using an AF in
406
Figure 9.39 NLMS and hypothesized NLMS a posteriori estimates (a) and close-up (b):
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB; desired (o), AF0; M (*),
WF0; M (solid), WFTV0; M opt(.).
407
Figure 9.40 Weight vector increment norms for various TV equivalents: pd 0:99e jp =3 ,
_
WFTV0;2 (solid gray), w
pr 0:99e jp =30:052p , SNRd SNRr 80 dB; w
WFTV0;2 (dotted
gray), wWFTV0;2 (black).
this scenario. The manifold of TV equivalent lters is still described by Eqn. (9.83).
The linking sequence is still dened as before and, for this scenario, specically
evaluates as follows from Eqn. (9.51):
r 1 n
jpd j jv d v r 1
e
r n 1
jpr j
ej0:502p r 1 n 1
9:85
r 1 n 1:
Substituting the latter in Eqn. (9.83) yields the following TV WF equivalent relative
to some arbitrary steady-state time index n0 :
wWFTV0;2 n
wdWF1;2 1nn0
jpr j
jpr j
wrWF1;2 :
jv
1 jpr j2 e r
9:86
The rst term on the right-hand side is seen to change maximally from iteration to
iteration. An example of the performance of the reference-only WF and reference-
408
TV
scenario:
pd 0:99e jp =3 ,
only NLMS AF is shown in Figure 9.42. In the latter, we now see that the TI
WF(0, 2) still performs close to its theoretical bound but that the AF(0,2)while
exhibiting the same overall error behaviornow has an error that is approximately
6 dB larger than that for the TI WF. Recall that this is the worst-case expectation for
the exponential case with a frequency difference of p ; that is, the behavior of the a
priori error for the (nearly) noiseless narrowband AR(1) case isfor each
iterationclose to that for the corresponding exponential scenario. Comparing with
Figure 9.36, we note that the performance advantage of AF over WF has ipped into
a comparable disadvantage.
Figure 9.43 shows that the two-channel AF performance is still Wiener-like and
similar to that in Figure 9.33. We observe that only the convergence rate seems to
have been affected, not the steady-state performance.
In Figure 9.44 the real part of the AF(1, 2) weights is shown, together with a
zoomed version, as are the WF(1, 2) weights. The imaginary part of the weight
vector behaves the same way. The WF(1, 2) weight vector for this scenario is
[0:4950 0:8574j 1 0:4950 0:8574j]; that is, its rst and last component are the
same.
An indication of weight vector tracking, corresponding to Figure 9.38, is now
reected in Figure 9.45. The weight vector for WF(0, 2) is [1 0:5 0:8660j] and
TV
scenario:
409
pd 0:99e jp =3 ,
therefore is actually close to wrWF1;2 , the reference portion of WF(1, 2). The NLMS
does not appear to track the optimal solution well in an absolute sense, since the a
posteriori weight vector is not close to the hypothesized TV WF equivalent. However, the a posteriori NLMS weight vector still falls in the required manifold, as
inferred from Figure 9.46, where it produces the desired a posteriori estimate. The
difference between the actual and hypothesized weight vector behavior is transient
in nature. Simultaneously, the a priori error has become large, as seen in Figure 9.42,
because the a posteriori AF weight vector at time n is no longer close to the optimal
TV target at time n 1 due to its lagging behind one time interval. The referenceonly WF now performs better than its AF counterpart because the latter is subject to
a large lag error, while the former is not. The key difference between NLMS for the
scenario in Eqn. (9.76) versus the scenario in Eqn. (9.84) lies in the a priori weight
vectors and the corresponding errors. While the a posteriori behaviors, in Figures
9.39 and 9.46, respectively, are very similar, the a priori errors are very different, as
shown in Figures 9.36 and 9.42, respectively. Figure 9.38 shows that the weights at
time n are generally close to the weights at time n 1 and vary about the reference
portion of the Wiener solution, while in Figure 9.45 the weights at time n are not
near the reference portion of the two-channel Wiener solution (and, in this case, also
410
Figure 9.43 AF(1, 2) and WF(1, 2) performance for the fast TV scenario: pd 0:99e jp =3 ,
pr 0:99e j p =30:502p , SNRd SNRr 80 dB.
not near the WF(0, 2) solution). Furthermore, the TV portion of wTVWF0;2 changes
its direction by 1808 from one sample to the next. While the NLMS weight behavior
has the same features as its target solution, it is not tracking that target very well. The
fact that NLMS inherently lags one sample behind, since its tracking takes place a
posteriori, limits the parameterizations of the ANC scenario over which MSE
performance improvement can be observed.
9.8
After our detailed treatment of the non-Wiener behavior in the ANC cases of
Sections 9.5, 9.6, and 9.7, we can now more readily address the nonlinear effects
question for the ALP and AEQ cases. The major distinction with the ANC case lies
in the use of different auxiliary and/or reference processes. In the ALP case the
auxiliary vector contains the immediate past of the desired signal (as in the ANC
case), while the reference vector contains the far past of the desired signal. We have
seen in Section 9.4.5 that this had no impact on the form of the transfer function
model for NLMS. In the AEQ case the auxiliary vector contains the interference
signal, which is totally uncorrelated with the desired signal, and the reference vector
411
Figure 9.44 Real (a) and zoomed real (b) components of the AF(1, 2) and WF(1, 2) weights
for the fast TV scenario of Figure 9.43.
412
Figure 9.45 Real (a) and imaginary (b) components of the AF(0, 2) weights for the
fast TV scenario: pd 0:99e jp =3 , pr 0:99e j p =30:502p , SNRd SNRr 80 dB;
wAF0;2 n (solid, varying), wWFTV0;2 n (dotted, varying), wrWF1;2 (gray, constant).
413
Figure 9.46 A posteriori NLMS and hypothesized NLMS estimates for the fast TV
scenario: pd 0:99e jp =3 , pr 0:99e j p =30:502p , SNRd SNRr 80 dB; desired (o),
AF(0, 2) (*), WF(0, 2) (solid), wWFTV0;2 (.).
414
9.8.1
In the ALP scenario the input vector to the two-channel WF is as follows [2]:
an
un
:
9:87
rn
The auxiliary vector an is dened on the basis of the immediate past of the desired
signal, while the reference vector contains the far past of the desired signal. This
choice for the auxiliary vector is based on the knowledge that the best causal
predictor for dn uses its most recent past:
3
2
dn 1
6 dn 2 7
7
6
an 6
..
7 dn 1;
5
4
.
dn L
2
dn D
6 dn D 1
rn 6
..
4
.
9:88
3
7
7 dn D:
5
dn D M 1
At very high SNR, from Eqns. (9.2) and (9.3), the following relationship holds for an
AR(1) desired process:
dn pd dn 1 v0 n:
9:89
We recognize the rst term on the right-hand side of Eqn. (9.89) to be the best onestep predictor for dn on the basis of its immediate past. That estimate engenders an
MSE equal to the variance of v0 n. If we use Eqn. (9.89) to replace dn 1 on its
right-hand side, we nd the best two-step predictor, which engenders a larger MSE
than the one-step predictor. Assuming that L 1 and M 2 in Eqn. (9.88), the
desired data can be written as having the following structure:
dn pd dn 1 v0 n
d^ n v0 n
2
6
pd 0 04
dn 1
7
dn D 5 v0 n
dn D 1
wH
ar un v0 n:
9:90
415
Since the variance of v0 n is the lowest possible MSE, a two-channel WFof the
form implied by the rst right-hand term in Eqn. (9.90)would converge to the
solution war or its equivalent (if multiple solutions exist that produce the same MSE
performance).
The earlier linking sequence concept will be used in order to see how a referenceonly ALP can approach the performance associated with the optimal predictor.
Based on Eqns. (9.87) and (9.88), the following linking sequences between desired
and reference signals are of interest in the present case:
k D1 n D
k
dn 1
dn D
dn 1
:
n D 1
dn D 1
9:91
These linking sequences can be used to rewrite the optimal predictor from Eqn.
(9.90):
2
3
k D1 n Ddn D
6
7
d^ n pd 0 04
dn D
5
dn D 1
dn D
pd k D1 n D 0
dn D 1
9:92
_H
w
TVWF0;2 nrn:
Note that the end result represents a TV lter due to the linking sequence.
Alternatively, the optimal predictor can be rewritten as follows:
2
3
k D n D 1dn D 1
6
7
d^ n pd 0 04
dn D
5
dn D 1
dn D
D
0 pd k n D 1
dn D 1
9:93
H
w
TVWF0;2 nrn:
Consequently, the optimal predictor for the chosen scenario can be written as an
afne linear combination of the above two TV equivalents to the optimal Wiener
416
predictor:
_
TVWF0;2 nH rn
d^ n a w
TVWF0;2 n 1 a w
pd a k D1 n D 1 a k D n D 1
dn D
dn D 1
9:94
wH
TVWF0;2 nrn:
The particular behavior of this optimal predictor for the desired data, which can be
interpreted as the closest thing to the structure of the desired data (meaning the
lowest MSE-producing model of any kind), depends on the behavior of the linking
sequences.
Let h D n D 1 denote the prediction error associated with predicting
dn 1 based on dn D 1, that is, a D-step predictor. The linking sequence
behavior can then be written as follows:
k D1 n D
k
dn 1
h D1 n D
pdD1
dn D
dn D
dn 1
h D n D 1
pDd
:
n D 1
dn D 1
dn D 1
9:95
Substitution into the TV weight vector manifold, implied by the nal equality in
Eqn. (9.94), yields
"
#
*
a k D1 n D
wTVWF0;2 n pd*
1 a k D n D 1
3*
pd h D1 n D
D
a
p
d
7
6
dn D
7
6
6
7 :
D
5
4
p
h
n
D
1
d
D1
1 a pd
dn D 1
2
9:96
With the reference-only input vector implied by Eqn. (9.94), that is,
dn D
rn
;
dn D 1
9:97
417
#
pDd *
0
9:98
The minimum norm interpretation of NLMS, together with the time-varying nature
of the structure that underlies the desired data, as given in Eqn. (9.94), produces the
possibility in NLMS to achieve a better predictor by combining D-step and D 1step linear predictorsalong the lines presented in Section 9.7.2 for the ANC case
in addition to the attempted tracking of the TV component of the data structure.
Recall that due to the equivalences above, the wTVWF0;2 n lter achieves the
same minimal MSE as the TI WF(1, 2). However, the AF(0, 2) that aims to track
wTVWF0;2 n will always be one step behind due to its a posteriori correction, and
therefore will incur a tracking error in addition to misadjustment.
For the ALP scenario in Section 9.2.4, we showed in Section 9.4.3.3 that a
substantial gap exists between the reference-only WF and the two-channel WF
performance. The results in Figure 9.21 demonstrated the existence of nonlinear
effects. For step-size m 0:7, which seems to be near-optimal for this scenario, the
absolute errors of the WF(0, 2) and AF(0, 2) lters are compared in Figure 9.47 over
the steady-state interval from iteration 4700 to 5000. We see that while the WF(0, 2)
error uctuates about its theoretically expected value, the AF(0, 2) error is generally
less. The performance improvement realized by AF(0, 2) over this interval is
3.99 dB. For comparison, the performance improvement over MMSE WF(0, 2)
realized by AF(1, 2) is 5.01dB, while MMSE WF(1, 2) for this case is 7.53dB better
than MMSE WF(0, 2). Note that the AF(1,2) performance suffers in this comparison,
because for the step-size of 0.7 it incurs misadjustment error.
The behavior of the real and imaginary parts of the AF(0,2) weights is shown in
Figures 9.48a and 9.48b, respectively, for the same interval and realization reected
in Figure 9.47.
It is evident from Figures 9.47 and 9.48 that the performance improvement of
AF(0, 2) over WF(0, 2) is paired with dynamic weight behavior. The TV WF
equivalent to WF(1, 2), as given in Eqn. (9.96), suggests that further performance
improvement would be obtained with AF(0,2) if the TV aspect of TVWF(0, 2) were
reduced. The latter resides in the prediction error variance, which can be reduced by
making the process more narrowband. Repeating the above experiment with a pole
radius of 0.99 rather than 0.95 produces an AF(0, 2) performance improvement of
3.93 dB over WF(0, 2). While this is slightly less than in the previous case, MMSE
WF(1, 2) is now only 7.04 dB better than MMSE WF(0, 2), so that actually a larger
fraction of the possible performance improvement has been realized. Figure 9.49
shows the error comparison. The AF(0, 2) error is observed to generally be less than
the WF(0, 2) error, which conforms nicely to its expected value. The weight behaviors for this more narrowband ALP case are shown in Figure 9.50a and 9.50b.
We observe that the time variation of the AF(0, 2) weights is less than it was for the
earlier more widerband ALP example. This behavior conrms that relatively
418
Figure 9.47
9.8.2
As argued in Section 9.4.3.4, in the AEQ scenario the input vector to the twochannel WF is as follows:
un
an
:
rn
9:99
The auxiliary vector an is dened on the basis of the interference signal, while the
reference vector contains the desired signal (QPSK in our example) additively
contaminated by narrowband AR(1) interference and white Gaussian measurement
noise. Recall that the interference is strong relative to the desired signal and that the
measurement noise is weak relative to the desired signal. Our interest is in the center
Figure 9.48
419
9:100
The number of taps in the auxiliary and reference channels now satisfy the relations
~ 1, respectively.
L 2L~ 1 and M 2M
420
Figure 9.48
The choice for the interference as the auxiliary vector is based on the knowledge
that the best estimate for the desired signal, xn D, results by removing the
interference signal in D from the observed reference signal rn D. The latter
reveals that the best structure to represent the underlying desired data is a twochannel structure [3, 4]:
x^ n D 0
1
0 1
0
in
rn
in D rn D
9:101
xn D vn D:
Note that in this ideal case, only the center elements of the auxiliary and reference
vectors are used. While this model is useful for guiding our direction, it is not
directly usable in practice, as the interference channel is not measurable in the AEQ
application. Nevertheless, the corresponding two-channel WF will provide an upper
bound on attainable performance, as it did in the ANC and ALP cases.
421
Figure 9.49
Based on Eqn. (9.101), we can write the following structure for the desired signal:
3
in D
6 rn D 1 7
7
6
06
7 1n D
4 rn D 5
2
xn D 0:9968 0
0:9968
9:102
rn D 1
wH
ir un 1n D:
The structure given in Eqn. (9.102) is the two-channel WF for the AEQ scenario of
previous sections, with pi 0:9999e jp =3 , SNR 25 dB, and SIR 20 dB. For
simplicity of representation, we have chosen L 1 and M 3. Note that the desired
signal structure in Eqn. (9.102) is of the form of that in Eqn. (9.10), and it has
been veried that the corresponding AF(1, 3) yields the corresponding target for
small NLMS step-size, that is, weights converging to the TI WF(1, 3) weights
and performance approaching the optimal MSE performance.
As was done for the ANC and ALP cases, we dene a set of linking sequences in
order to derive a WF equivalent to the above that uses reference inputs only. As
422
Figure 9.50
(a) Real part of AF(0, 2) weights for the narrowband ALP scenario.
dictated by the structure in Eqn. (9.102), the following linking sequences are
dened:
l 1 n D 1
l 0 n D
l 1 n D 1
in D
rn D 1
in D
rn D
9:103
in D
:
rn D 1
a 0 l 0 n D a 1 l 1 n D 1
9:104
Figure 9.50
423
(b) Imaginary part of AF(0, 2) weights for the narrowband ALP scenario.
1
2 3
1
a 4 1 5 1:
1
1
9:105
Substituting for in D in Eqn. (9.102), using Eqn. (9.104), yields the equivalent
reference-only WF structure for the desired signal:
xn D wH
TVWF0;3 nrn 1n D
2
3
a 1 l 1 n D 1 *
6
7
wTVWF0;3 n 0:99684 1 a 0 l 0 n D 5 ;
a 1 l 1 n D 1
9:106
2
6
rn 4
rn D 1
7
rn D 5:
rn D 1
9:107
424
l 1 n D 1
in D
rn D 1
pi in D 1 vi n D
in D 1 xn D 1 vr n D 1
9:108
pi h 1 n D 1:
The same AR(1) relation can be used to write the past in terms of the future and an
innovation term:
l 1 n D 1
in D
rn D 1
p1
i in D 1 vi n D 1
in D 1 xn D 1 vr n D 1
9:109
1
p1
n D 1:
i h
l 0 n D
in D
rn D
9:110
1 h n D:
All three linking sequences have thus been written as a constant contaminated by
a noise process, so that the TVWF(0, 3) in Eqn. (9.107) can be written in
corresponding terms:
3
1 1
h n D 1 *
a 1 p1
i a
7
6
7 :
wTVWF0;3 n 0:99686
1 a 0 a 0 h 0 n D
5
4
1
1 1
a pi a h n D 1
2
9:111
Note that the constant terms in the above weight vector undergo a rotation that
depends on the pole of the interference process. Generalizing the above, allowing M
to increase, results in a TI weight component proportional to the pole of the
interference process raised to a power equal to the distance of the element from the
center tap. The effect of such a component, operating on the corresponding element
425
of the reference vector, constitutes an estimate of the interference signal at the center
tap. In fact, for a particular set of afne combination coefcients, the TI component
coincides with the WF0; M solution.
In Figure 9.51 the WF(0, 51) weights are shown, together with the AF(0, 51)
weights during steady-state iterations 5000 through 10,000. As in Section 9.2.3, an
NLMS step-size of 0.8 is used.
We observe in Figure 9.51 that the weights do not change much over 5000 steadystate iterations (a uniformly spaced subset from 5000 successive iterations is
overlaid). However, the AF(0, 51) weights do not coincide with the WF(0, 51)
weights. As reported in Section 9.2.3, the performance of the AF is almost 5 dB
better than the performance of the TI WF. If the experiment is repeated, the same
behavior is observed, albeit centered about a different weight vector solution [4].
The latter shows that different solutions are appropriate, depending on the particular
realization. An AF can converge to these appropriate solutions and track them.
Recall that the step-size is a large 0.8, appropriate for tracking, and not so
appropriate for converging to a TI solution.
In Figure 9.52a and 9.52b the dynamic behavior of the real part of the AF(0, 51)
weights is shown. The weights are seen to be changing, in a slow, random-walk-like
type fashion, as predicted by the reference-only WF equivalent in Eqn. (9.111).
Figure 9.51
426
Figure 9.51
While the very slowly varying weight behaviorfor any given realization
almost suggests that a TI solution could be appropriate, using a time-averaged
weight vector associated with one realization on a different realization generally
results in MSE higher than that for the WF(0, 51) weights. Furthermore, as the best
performance is realized at a large step-size, we must again reach the conclusion that
it is the TV nature of NLMS that facilitates the tracking of the TV nature of the
structure that underlies the desired data.
9.9 CONDITIONS FOR NONLINEAR EFFECTS IN
ANC APPLICATIONS
We now address the fundamental question as to the requisite conditions that lead to a
signicant nonlinear response when using an NLMS AF. In some applications the
nonlinear effects are benecial; in others they are not. As we have shown, however,
the nonlinear effects can totally dominate performance in realistic conditions, and it
is important to be able to predict when such nonlinear behavior is likely to occur.
9.9.1
In the sinusoidal scenarios treated in Section 9.6, the reference-only Wiener solution
is the all-zero weight vector. In that case the MSE equals s 2d , the power in the
427
Figure 9.51
(c) Zoomed-in view of real part of weights for the AEQ scenario.
desired signal. Consequently, any deviation in MSE from the desired signal power
constitutes a nonlinear effect. The MSE in the exponential scenarios is completely
governed by Eqn. (9.68). Dening the normalized MSE s~ 2e as follows,
s 2e
s 2d
2
1 ejv d v r
;
j
v
v
d
r
1 1 m e
s~ 2e
9:112
428
Figure 9.52
Figure 9.52
(a) Real part of center tap of AF(0, 51) for the AEQ scenario.
(b) Real part of off-center taps of AF(0, 51) for the AEQ scenario.
429
For the ANC application, we have shown that the reference-only AF may
outperform the reference-only WF when there is a substantial gap in performance
between the reference-only WF and the two-channel WF (as analyzed in Section
9.4) and the TV equivalent to the two-channel LTI WF is substantially similar from
one time index to the next. The latter is a tracking condition, meaning that the a
priori AF weight vector is substantially in the direction of the a posteriori AF weight
vector (which is targeting the TV equivalent WF). The question now is whether we
430
can predict when both of the former conditions will be satised. The analysis in
Section 9.4 indicated that, generally, the processes need to be narrowband. The more
narrowband the processes are, the better the MSE estimate dened by the transfer
function model of Section 9.3.2 predicts the performance of NLMS. Consequently,
in the narrowband ANC scenario, we may be able to use the MSE estimate from the
transfer function model to determine when the reference-only AF is likely to
outperform its LTI WF counterpart.
Each of the subsequent gures shows the same type of information for a variety
of ANC scenarios. First, the solid black line on the bottom of each plot indicates the
theoretical limit of performance, min MSE, which equals limL;M!1 MSE WFL; M.
Above that are two sets of four graphs. The bottom set of four pertains to twochannel lters and the top set of four pertains to reference-only lters. The constant
gray dot-dash line in the top set and the constant solid gray line in the bottom set
show, respectively, the theoretical MSE expected for the M-tap reference-only WF,
WF0; M and for the two-channel WF WFL; M. The gray symbols with bars
indicate the mean and the 80 percent occurrence interval of the estimated MSE
achieved by the designed WF0; M and WFL; M for 10 different realizations.
Similarly, the black symbols and bars indicate the mean and the 80 percent
occurrence interval of the estimated MSE achieved by AF0; M and AFL; M for
the same realizations. The black nonconstant dotted and solid curves correspond to
the MSE estimate evaluated according to the LTI model for reference-only NLMS
(Section 9.3.2) and two-channel NLMS (Sections 4.4 and 4.5).
Figure 9.54a shows the results from 10 experiments for the scenario in Eqn.
(9.76). Figure 9.54b shows the results for a comparable scenario after changing the
SNRs to 20dB. The MSE estimate from the transfer function model is shown to
provide a good indication for the performance of the reference-only NLMS AF(0, 2)
and an even better one for the two-channel NLMS AF(1, 2) for this case. The number
of data points used in all of these simulation runs was 5000, explaining the relatively
high MSE results for AF(1, 2) for small step-sizes, since the lter has not had
sufcient time to converge at an SNR of 80 dB for m , 0:5. The nal 300 iterations
were used to obtain the results for estimated MSE. Note that 5000 iterations provides
for convergence at an SNR of 20 dB even at the smaller step-sizes. The performance
of WF(0, 2) and WF(1, 2) is very close to their respective theoretically expected
values. The nonlinear effects in AF(0, 2) are accurately predicted by the MSE
estimate derived from the LTI transfer function model for NLMS AF(0,2). From the
difference between MMSE WF(0, 2) (top - line) and the MSE estimate for AF(0, 2)
from the transfer function model (top ), we observe a maximum reduction in
MSE (occurring for m 1), due to nonlinear effects, of about 9 dB. This gure is
only slightly less than the 10dB MSE reduction for the corresponding sinusoidal
case (fth curve from the bottom in Figure 9.53).
Note in Figure 9.54a, where SNR 80 dB, that the MSE estimate variations are
much larger for WF(0, 2) and AF(0,2) than for WF(1, 2) and AF(1, 2). In the latter
cases, the data pretty much satisfy the (1, 2) structure, thereby yielding MSE close to
the minimum possible (with the higher result for AF(1, 2) due to misadjustment). For
the (0, 2) cases the data no longer t the model, which forces the errorengendered
by the wrong modelto be higher. In Figure 9.54b, with SNR 20 dB, both the
431
Figure 9.54 Simulation results and transfer function MSE estimates for the ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:052p . (a) SNRd SNRr 80 dB.
(b) SNRd SNRr 20 dB.
432
(1, 2) and (0, 2) variations are larger than in Figure 9.54a, due to the increased
measurement noise; however, the increase is a relatively larger fraction of the
randomness in the (1, 2) case.
Figure 9.55 provides a comparison of results that illustrate the effects of signal
bandwidth, pole angle separation, and lter order. The signal bandwidth is reduced
by approximately a factor of 10, the pole angle difference is decreased from 188 to
38, and results are obtained for both (1, 2)- and (10, 25)-tap lters. Figure 9.55a
reects a narrower bandwidth than Figure 9.54b. The MSE estimate from the
transfer function model shows similar reductions for both, about 9 dB, suggesting
that MSE may not be very sensitive to bandwidth directly. In Figure 9.55b the
frequency difference is smaller than in Figure 9.55a. We observe that the MSE
estimate from the transfer function model is a good indicator of NLMS AF(0, 2)
behavior and that the actual nonlinear effect comes within a few decibels of the
lower bound MMSE. Another observation is that the nonlinear effect seems to
saturate at approximately 10 dB and does so over a wide range of step-sizes. The
maximum MSE reduction for the reference-only AF (over the reference-only WF) is
approximately 17dB, which is far short of the maximum 25dB MSE reduction in the
comparable exponential case. However, the latter would violate the absolute lower
bound on MSE in the AR(1) situation. Another interesting observation linking the
performance in the exponential scenario to the performance in the AR(1) scenario is
that the shape of the MSE performance curves in Figures 9.54 and 9.55 is similar to
the comparable ones for the exponential case in Figure 9.53.
The effect of increased orderscomparing Figure 9.55a and 9.55b with Figures
9.55c and 9.55dseems to be mostly conned to the improved theoretical and
actual performance of the (10, 25)-tap AF and WF. In each case, the absolute lowerbound performance is approximated more closely. The MSE estimate from the
transfer function model again provides a good indicator of NLMS performance at
both sets of lter orders and for both the single- and two-channel NLMS lters. An
interesting observation, in the higher-order cases in Figures 9.55c and 9.55d, is that
the transfer function model based MSE estimate tends to overestimate the
AF(10, 25) performance.
Figure 9.56 shows simulation results for the maximally TV scenario (pole angle
difference of 1808) of Eqn. (9.84) and for SNRs of 80 and 20 dB. Note here how the
reference-only MSE estimate from the transfer function model successfully
indicates that NLMS performance will be worse than the corresponding WF
performance. Recall that the transfer function model for MSE, as shown in Section
9.4.5, is entirely based on LTI system blocks. In the transfer function development
there is no obvious connection to any TV behaviors. Again we observe that the
nonlinear effect on MSE, an increase in this case, follows the shape of the curve for
the exponential case, shown in Figure 9.53, for the corresponding parameterization.
In fact, in this case, its magnitude is the same as well.
Figure 9.57 shows the performance results for the scenario in Eqn. (9.76) for
(1, 2)-tap lters, but with a frequency difference of only 1.88. At this small frequency
difference, the MSE estimate from the transfer function model is saturated in Figure
9.57a. Figure 9.57b shows that when the bandwidth of the desired and reference
processes is decreased, the saturation level of the MSE estimate from the transfer
function model drops to about 25 dB relative to s 2d .
433
Figure 9.55 Simulation results for (1, 2)-tap [(a) and (b)] and (10, 25)-tap [(c) and (d)]
lters: SNRd SNRr 20 dB. (a) pd 0:999e jp =3 , pr 0:999e j p =30:052p ;
(b) pd 0:999e jp =3 , pr 0:999e j p =32p =120 ; (c) pd 0:999e jp =3 ,
pr 0:999e j p =30:052p ; (d) pd 0:999e jp =3 , pr 0:999e j p =32p =120 .
434
Figure 9.55
(continued )
435
Figure 9.56 Order (1, 2) simulation results and TF MSE for the ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:502p . (a) SNRd SNRr 80 dB;
(b) SNRd SNRr 20 dB.
436
Figure 9.57 Simulation results and TF MSE for the modied Eqn. (9.76) scenario:
SNRd SNRr 80 dB. (a) pd 0:99e jp =3 , pr 0:99e j p =30:0052p ;
(b) pd 0:999e jp =3 , pr 0:999e jp =30:0052p .
437
Note in Figure 9.57b that no AF convergence transients are observed, unlike with
the earlier results at 80dB SNR. The difference between the two behaviors lies in the
starting weight vector. All earlier adaptive lters were started with the all-zero
vector, while for illustrative purposes AF(1, 2) was started at WF(1, 2) to generate
Figure 9.57b.
From the above simulation results, we observed that the MSE estimate from the
transfer function model in the reference-only case tends to have the same behavior
with step-size as MSE for the exponential case, shown in Section 9.8.1. While in the
noise-free exponential case the absolute lower bound on MSE equals zero, in the
AR(1) case it is always strictly positive. In the AR(1) case the MSE estimates from
the transfer function modeland actual performancesaturate at some level above
the absolute lower bound for these WSS scenarios. The saturation phenomenon
becomes more prominent as the (pole) frequency difference gets smaller. The level
at which saturation occurs drops with reduction of the bandwidth of the reference
and desired processes. The MSE performance results for the exponential case,
together with the absolute lower bound on MSE, constitute a good indicator of
performance for the reference-only AF.
9.9.3
As in the ANC case, a substantial gap between the reference-only and two-channel
WF performances is necessary for the reference-only AF to realize some of that
performance advantage. This was found to be the case in the examples provided in
Section 9.4.3.3.
For the examples used in Section 9.8.1, the performance can be summarized
along the lines of Section 9.9.2. Figure 9.58 shows the various minimum, realized,
and estimated MSEs for the ALP scenario of Section 9.8.1. We observe that the
AF(0, 2) MSE performance is very much in line with the AF(0, 3) performance seen
in Section 9.4.3.3 (Figure 9.21). A large gap is seen here between the WF(0, 2) and
WF(1, 2) MMSEs. This condition is suggestive of AF(0, 2) performance improvement as long as the TV aspects of the equivalent TV WF(0, 2) can be tracked
successfully. Clearly, some of the performance potential is being realizedin fact,
about 2 dB out of the possible 7 dB.
In Section 9.8.1 we argued that the TV nature of the equivalent TV WF(0, 2)
could be slowed by making the process more narrowband. Figure 9.59 shows the
MSE performance for the narrowband ALP example of Section 9.8.1. We note in
this case that, while the same absolute level of performance is reached, a larger
fraction of the potential performance is now realized in going from WF(0, 2) to
AF(0, 2). Approximately 4 dB of the maximum possible improvement of 7dB is
realized. As explained earlier, this is commensurate with a reduction in time
variation for the reference-only equivalent WF.
9.9.4
In Section 9.2.3 we showed that an AF(0, 51) could realize performance improvement over a WF(0, 51) in a narrowband interference-contaminated AEQ application. In Section 9.4.3.3 it was shown that an idealized two-channel WF could
438
Figure 9.58
perform better than the WF(0, 51) for that scenario. The numerical results obtained
in Section 9.2.3 indeed reected a performance improvement in AF(0, 51) over
WF(0, 51). The AF(0, 51) performance did not approach the performance of the
idealized two-channel WF.
In the idealized two-channel WF, the auxiliary channel contained the interference
signal itself. Consequently, the interference was provided to the WF(51, 51) without
error. In a somewhat more realistic scenario, the interference must be estimated,
thereby incurring estimation error. Since the interference model is known, its best
estimate is derived from its value at a tap next to the center tap by means of a onestep predictor. The latter would theoretically incur the innovation variance as
prediction error variance. Therefore, in addition to simply subtracting the
interference at the center tap, the observation noise variance is increased by the
interferences innovation variance (both are white processes). This leads to a more
realistic performance bound, referred to as the ideal interference predictor. While,
again, the interference itself is not available for such a one-step predictor, the SIR
and SNR combine to make the interference the component that dominates the
reference signal. It seems not unreasonable, then, to substitute the reference signal
for use in interference prediction.
The corresponding amendment of Figure 9.23 is shown in Figure 9.60.
9.10 SUMMARY
Figure 9.59
439
9.10
SUMMARY
We have shown that nonlinear TV effects in AFs originate from the error feedback
used in the weight update, as the error reects the discrepancy between the desired
data and the current model for that desired data. These TV effects have been shown
to become prominent when applied in narrowband WSS ANC, ALP, and AEQ
scenarios, particularly in cases where the spectral content of the reference and
desired inputs to the lter are dissimilar. For such scenarios, it was shown that it is
440
ACKNOWLEDGMENT
441
For the ANC and ALP scenarios, we have shown that a manifold of optimal TV
reference-only WFs exists that forms the target for the a posteriori NLMS weight
vector. When the corresponding TV reference-only WF target weight vector can be
tracked reasonably well by NLMS, that is, when it is slowly TV, the AF may realize
a priori performance gain over the reference-only WF, which is TI. The conditions
under which nonlinear effects exist, as well as their magnitude, are given for
exponential ANC scenarios. For narrowband AR(1) ANC scenarios, we indicate
when prominent nonlinear effects can be expected. In the exponential ANC scenario
the linking sequence has constant amplitude and linear phase, while in the AR(1)
ANC scenario the linking sequence is at times nearly constant with linear phase.
Under this condition, the weight behavior is nearly periodic. It is also shown that the
linking sequence for the AR(1) input is subject to random uctuations, which
become especially pronounced near zero crossings of the reference signal. The MSE
estimate provided by the linear TI transfer function model for NLMS provides a
good indication of performance.
The TV nonlinear effects observed in the narrowband interference-contaminated
AEQ scenario can be explained by the existence of a two-channel WF where the
auxiliary channel contains values of the narrowband interference. This forms an
upper bound on performance since the AF must generate interference channel
estimates solely from present and past values of the reference signal. In this case, the
nonlinear response is again shown to be associated with TV weight behavior.
However, there is now a TI component to the weights that dominates their
magnitudes.
ACKNOWLEDGMENT
The authors wish to express their sincere thanks to Ms. Rachel Goshorn of SSC for
her gracious help, expertise, and effort in producing many of the gures in this
chapter.
The rst author acknowledges the support provided by the National Research
Council, in awarding him a Senior Research Associateship at SPAWAR Systems
Center, San Diego, during his Fall 2001 sabbatical there.
REFERENCES
1. A. A. (Louis) Beex, Efcient generation of ARMA cross-covariance sequences, IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP85), pp.
327 330, March 26 29, 1985, Tampa, FL.
2. A. A. (Louis) Beex and James R. Zeidler, Non-linear effects in adaptive linear
prediction, Fourth IASTED International Conference on Signal and Image Processing
SIP2002, pp. 21 26, August 12 14, 2002, Kauai, Hawaii.
3. A. A. (Louis) Beex and James R. Zeidler, Data structure and non-linear effects in
adaptive lters, 14th International Conference on Digital Signal Processing DSP2002,
pp. 659 662, July 1 3, 2002, Santorini, Greece.
442
REFERENCES
443
21. J. Han, J. R. Zeidler, and W. H. Ku, Nonlinear effects of the LMS predictor for chirped
input signals, EURASIP Appl. Signal Processing, Special Issue on Nonlinear Signal and
Image Processing, Part II, pp. 21 29, January 2002.
22. M. Hayes, Statistical Digital Signal Processing and Modeling. Wiley, 1996.
23. S. Haykin, A. Sayed, J. R. Zeidler, P. Wei, and P. Yee, Tracking of linear time-variant
systems by extended RLS algorithms, IEEE Trans. Signal Processing, 45, 1118 1128,
May 1997.
24. S. M. Kay, Modern Spectral Estimation: Theory and Applications. Prentice-Hall, 1988.
25. S. M. Kuo and D. R. Morgan, Active Noise Control SystemsAlgorithms and DSP
Implementations. New York: Wiley, 1996.
26. O. Macchi and N. J. Bershad, Adaptive recovery of a chirped sinusoidal signal in noise:
I. Performance of the RLS algorithm, IEEE Trans. Acoust., Speech, Signal Processing,
ASSP-39, 583 594, March 1991.
27. O. Macchi, N. J. Bershad, and M. Mboup, Steady state superiority of LMS over RLS for
time-varying line enhancer in noisy environment, IEE Proc. F, 138, 354360, August
1991.
28. J. E. Mazo, On the independence theory of equalizer convergence, Bell Syst. Tech. J.,
58, 963 993, May/June 1979.
29. D. R. Morgan and J. Thi, A multi-tone pseudo-cascade ltered-X LMS adaptive notch
lter, IEEE Trans. Signal Processing, 41, 946 956, February 1993.
30. S. Olmos and P. Laguna, Steady-state MSE convergence of LMS adaptive lters with
deterministic reference inputs with applications to biomedical signals, IEEE Trans.
Signal Processing, 48, 2229 2241, August 2000.
31. K. J. Quirk, L. B. Milstein, and J. R. Zeidler, A performance bound of the LMS
estimator, IEEE Trans. Information Theory, 46, 1150 1158, May 2000.
32. M. Reuter, K. Quirk, J. Zeidler, and L. Milstein, Nonlinear effects in LMS adaptive
lters, Proceedings of Symposium 2000 on Adaptive Systems for Signal Processing,
Communications and Control, pp. 141146, 1 4 October 2000, Lake Louise, Alberta,
Canada.
33. M. Reuter and J. R. Zeidler, Nonlinear effects in LMS adaptive equalizers, IEEE Trans.
Signal Processing, 47, 1570 1579, June 1999.
34. M. J. Shensa, Non-Wiener solutions of the adaptive noise canceler with a noisy
reference, IEEE Trans. Acoust., Speech, Signal Processing, ASSP-28, 468 473, August
1980.
35. D. T. M. Slock, On the convergence hehavior of the LMS and the normalized LMS
algorithms, IEEE Trans. Signal Processing, ASSP-41, 2811 2825, September 1993.
36. S. A. Tretter, Introduction to Discrete-Time Signal Processing. Wiley, 1976.
37. H. L. Van Trees, Detection, Estimation, and Modulation Theory. Wiley, 1967.
38. B. Widrow, J. Glover, J. McCool, J. Kaunitz, C. Williams, R. Hearn, J. Zeidler, E. Dong,
Jr., and R. Goodin, Adaptive noise canceling: principles and applications, Proc. IEEE,
63, 1692 1716, December 1975.
39. B. Widrow, J. M. McCool, M. G. Larimore, and C. R. Johnson, Jr., Stationary and
nonstationary learning characteristics of the LMS adaptive lter, Proc. IEEE, 64, 1151
1162, August 1976.
10
ERROR WHITENING
WIENER FILTERS:
THEORY AND
ALGORITHMS
10.1
INTRODUCTION
The mean-squared error (MSE) criterion has been the workhorse of linear optimization theory due to the simple and analytically tractable structure of linear least
squares [16, 23]. In adaptive lter theory, the Wiener-Hopf equations are more
commonly used owing to the extension of least squares to functional spaces
proposed by Wiener [16, 23]. However, for nite impulse lters (vector spaces) the
two solutions coincide. There are a number of reasons behind the widespread use of
the Wiener lter: Firstly, the Wiener solution provides the best possible lter
weights in the least squares sense; secondly, there exist simple and elegant
optimization algorithms like least mean squares (LMS), normalized least mean
squares (NLMS), and recursive least squares (RLS) to nd or closely track the
Wiener solution in a sample-by-sample fashion, suitable for on-line adaptive signal
processing applications [16]. There are also a number of important properties that
help us understand the statistical properties of the Wiener solution, namely, the orthogonality of the error signal to the input vector space and the whiteness of the predictor
error signal for stationary inputs, provided that the lter is long enough [16, 23].
However, in a number of applications of practical importance, the error sequence
produced by the Wiener lter is not white. One of the most important is the case of noisy
inputs. In fact, it has long been recognized that these MSE-based lter optimization
approaches are unable to produce the optimal weights associated with noise-free input
due to the biasing of the input covariance matrix [autocorrelation in the case of nite
impulse response (FIR) lters] by the additive noise [33, 11]. Since noise is always
present in real-world signals, the optimal lter weights offered by the MSE criterion
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
445
446
and associated algorithms are inevitably inaccurate; this might hinder the performance of the designed engineering systems that require robust parameter estimations.
There are several techniques to suppress the bias in the MSE-based solutions in
the presence of noisy training data [7, 42, 38, 18]. Total least squares (TLS) is one of
the popular methods due to its principled way of eliminating the effect of noise on
the optimal weight vector solution [17, 19, 20]. Major drawbacks of TLS are the
requirements for accurate model order estimation, an identical noise variance in the
input and desired signals, and the singular value decomposition (SVD) computations
that severely limit its practical applicability [20, 33, 9, 11]. Total least squares is
known to perform poorly when these assumptions are not satised [42, 33]. Another
important class of algorithms that can effectively eliminate noise in the input data is
subspace Wiener ltering [16, 23, 31]. Subspace approaches try to minimize the
effect of noise on the solution by projecting the input data vector onto a lowerdimensional space that spans the input signal space. Traditional Wiener ltering
algorithms are then applied to the projected inputs, which exhibit an improved
signal-to-noise ratio (SNR). Many subspace algorithms are present in the literature;
to mention all of them is beyond the scope of this chapter. The drawbacks of these
methods include proper model order estimation, increased computational requirements and sufciently small noise power that helps discriminate signal and noise
during subspace dimensionality selection [31].
In this chapter, we will present a completely different approach to produce a
(partially) white noise sequence at the output of Wiener lters in the presence of
noisy inputs. We will approach the problem by introducing a new adaptation
criterion that enforces zero autocorrelation of the error signal beyond a certain lag,
hence the name error whitening Wiener lters (EWWF). Since we want to preserve
the on-line properties of the adaptation algorithms, we propose to expand the error
autocorrelation around a lag larger than the lter length using Taylor series. Hence,
instead of an error signal we end up with an error vector, with as many components
as the terms kept in the Taylor series expansion. A schematic diagram of the
proposed adaptation structure is depicted in Figure 10.1. The properties of this
solution are very interesting, since it contains the Wiener solution as a special case,
and for the case of two error terms, the same analytical tools developed for the
Wiener lter can be applied with minor modications. Moreover, when the input
signal is contaminated with additive white noise, the EWWF produces the optimal
Figure 10.1
447
solution for the noise-free input signal, with the same computational complexity of
the Wiener solution.
The organization of this chapter is as follows: First, we will present the
motivation behind using the autocorrelation of the residual error signal in supervised
training of Wiener lters. This will clearly demonstrate the reasoning behind the
selected performance function, which will be called the error whitening criterion
(EWC). Second, an analytical investigation of the mathematical properties of the
EWWF and the optimal lter weight estimates will be presented. The optimal
selection of parameters will be followed by demonstrations of the theoretical
expectations on noise-rejecting properties of the proposed solution through Monte
Carlo simulations performed using analytical calculations of the necessary autocorrelation functions. Next, we will derive the recursive error whitening (REW)
algorithm that nds the proposed error whitening Wiener lter solution using
sample-by-sample updates in a fashion similar to the well-known RLS algorithm.
This type of recursive algorithm require On2 complexity in the number of weights.
Finally, we address the issues with the development of the gradient-based algorithm
for EWWF. We will derive a gradient-based LMS-type update algorithm for the
weights that will converge to the vicinity of the desired solution using stochastic
updates. Theoretical bounds on step-size to guarantee convergence and comparisons
with MSE counterparts will be provided.
10.2
The classical Wiener solution yields a biased estimate of the reference lter weight
vector in the presence of input noise. This problem arises due to the contamination of
the input signal autocorrelation matrix with that of the additive noise. If a signal is
contaminated with additive white noise, only the zero-lag autocorrelation is biased
by the amount of the noise power. Autocorrelations at all other lags still remain at
their original values. This observation rules out MSE as a good optimization
criterion for this case. In fact, since the error power is the value of the error
autocorrelation function at zero lag, the optimal weights will be biased because they
depend on the input autocorrelation values at zero lag. The fact that the autocorrelation at nonzero lags is unaffected by the presence of noise will be proved
useful in determining an unbiased estimate of the lter weights.
10.2.1
The question that arises is, what lag should be used to obtain the true weight vector
in the presence of white input noise? Let us consider the autocorrelation of the
training error at nonzero lags. Suppose noisy training data of the form (xt; dt) are
provided, where xt x~ t vt and dt d~ t ut, with x~ t being the sample
of the noise-free input vector at time t (time is assumed to be continuous), vt being
the additive white noise vector on the input vector, d~ t being the noise-free desired
output, and ut being the additive white noise on the desired output. Suppose that
the true weight vector of the reference lter that generated the data is wT (moving
448
The observation that constraining the higher lags of the error autocorrelation
function to zero yields unbiased weight solutions is quite signicant. Moreover, the
449
algorithmic structure of this new solution and the lag-zero MSE solution are still
very similar. The noise-free case helps us understand why this similarity occurs.
Suppose that the desired signal is generated by the following equation:
d~ t x~ T twT , where wT is the true weight vector. Now multiply both sides by
x~ t D from the left and then take the expected value of both sides to yield
E~xt Dd~ t E~xt D~xT twT . Similarly, we can obtain E~xtd~ t D
E~xt~xT t DwT . Adding the corresponding sides of these two equations yields
E~xtd~ t D x~ t Dd~ t E~xt~xT t D x~ t D~xT twT :
10:2
Now that we have described the structure of the solution, let us address the issue of
training this new class of optimum lters that we called error whitening Wiener
lters (EWWF). Adaptation exploits the sensitivity of the error autocorrelation with
respect to the weight vector of the adaptive lter. We will formulate the solution in
continuous time rst for the sake of simplicity. If the support of the impulse response
of the adaptive lter is of length m, we evaluate the derivative of the error autocorrelation function at lag D with respect to the weights, where D m. Assuming
that the noises in the input and desired output are uncorrelated to each other and to
the input signal, we get
@r e D
@w
@Eetet D
@w
2E~xt~xT t Dwt w:
10:3
The identity in (10.3) immediately tells us that the sensitivity of the error autocorrelation with respect to the weight vector becomes zero; that is, @r e D=@w 0 if
450
10:5
451
Analyzing (10.5), we note another advantage of the Taylor series expansion because
the familiar MSE is part of the expansion. Note also that as one forces D ! L, the
MSE term will disappear and only the lag-L error autocorrelation will remain. On
the other hand, as D ! L 1 only the MSE term will prevail in the autocorrelation
function approximation. Introducing more terms in the Taylor expansion will bring
in error autocorrelation constraints from lags iL.
10.2.4
The EWC
10:6
10:7
which has the same form as (10.5). Note that when b 0, we recover the MSE in
(10.6) and (10.7). Similarly, we would have to select D L in order to make the
rst-order expansion identical to the exact value of the error autocorrelation
function. Substituting the identity 1 2b D L and using D L, we
observe that b 1=2 eliminates the MSE term from the criterion. Interestingly,
this value will appear in the following discussion, when we optimize b in order to
reduce the bias in the solution introduced by input noise.
The same criterion can also be obtained by considering performance functions of
the form
Jw Eken
p
p
b e_ n
g e n
T k22
10:8
where the coefcients b , g , and so on are assumed to be positive. Note that (10.8) is
the L2 norm of a vector of criteria. The components of this vector consist of en,
e_ n, e n, and so on. Due to the equivalence provided by the difference approximations for the derivative, these terms constrain the error autocorrelation at
lags iL; as well as the error power as seen in (10.8). The number of terms included in
the Taylor series approximation for the error autocorrelation determines how many
constraints are present in the vector of criteria. Therefore, the EWWF utilizes an
error vector (see Fig. 10.1) instead of the scalar error signal utilized in the
conventional Wiener lter. Our aim is to force the error signal as close as possible to
becoming white (at lags exceeding the lter length), but these multiple lag options
have not yet been investigated.
452
10.3
10.3.1
Suppose that noise-free training data of the form ~xn; d~ n, generated by a linear
system with weight vector wT through d~ n x~ T nwT , are provided. Assume
without loss of generality that the adaptive lter and the reference lter are of the
same length. This is possible since it is possible to pad wT with zeros if it is shorter
than the adaptive lter. Therefore, the input vector x~ n [ Rm , the weight vector
wT [ Rm and the desired output d~ n [ R. The quadratic form in (10.6) denes the
specic EWC we are interested in, and its unique stationary point gives the optimal
solution for the EWWF. If b 0, then this stationary point is a minimum.
Otherwise, the Hessian of (10.6) might have mixed-sign eigenvalues or even allnegative eigenvalues. We demonstrate this fact with sample performance surfaces
obtained for two-tap FIR lters using b 1=2. For three differently colored
training data, we obtain the EWC performance surfaced shown in Figure 10.2. In
each row, the MSE performance surface, the EWC cost contour plot, and the EWC
performance surface are shown for the corresponding training data. The eigenvalue
pairs of the Hessian matrix of (10.6) are (2.35, 20.30), (6.13, 5.21), and (4.08,
4.14) for these representative cases in Figure 10.2. Clearly, it is possible for (10.6)
to have a stationary point that is a minimum, a saddle point, or a maximum, and we
start to see the differences brought about by the EWC. The performance surface is a
weighted sum of paraboloids, which will complicate gradient-based adaptation but
will not affect search algorithms utilizing curvature information.
10.3.2
Theorem 10.1
10:9
~
~ E~xn~xT n, S~ E x~_ n x~_ T n, P~ E~xnd~ n, and Q
where we dened R
~
~
Ex_ n d_ n.
453
Figure 10.2 The MSE performance surfaces, the EWC contour plot, and the EWC performance surface for three different training data sets and twotap adaptive FIR lters.
454
Proof Substituting the proper variables in (10.6), we obtain the following explicit
expression for Jw:
2
~ T w:
~ b S~ w 2P~ b Q
Jw Ed~ n b E d~_ 2 n wT R
10:10
10:11
~ :
~ b S~ 1 P~ b Q
) w * R
Note that selecting b 0 in (10.6) reduces the criterion to MSE and the optimal
solution, given in (10.9), reduces to the Wiener solution. Thus, the Wiener lter is a
special case of the EWWF solution (though not optimal for noisy inputs, as we will
show later).
Corollary 1
10:12
455
10.3.3
Now, suppose that we are given noisy training data xn; dn, where xn
x~ n vn and dn d~ n un. The additive noises on both signals are zeromean, and uncorrelated with each other and with the input and desired signals.
Assume that the additive noise, un, on the desired is white (in time), and let the
autocorrelation matrices of vn be V EvnvT n and VL Evn LvT n
vnvT n L. Under these circumstances, we have to estimate the necessary
matrices to evaluate (10.9) using noisy data. These matrices evaluated using noisy
data, R, S, P, and Q, will become (see Appendix B for details)
~ V
R ExnxT n R
~ V R
~ L VL
S Exn xn Lxn xn LT 2R
P Exndn P~
10:14
10:15
456
10.4
10.4.1
An important question regarding the behavior of the optimal solution obtained using
the EWC criterion is the relationship between the residual error signal and the input
vector. In the case of MSE, we know that the Wiener solution results in an error
orthogonal to the input signal, that is, Eenxn 0 [16, 23]. Similarly, we can
determine what the EWC criterion will achieve.
Lemma 3 At the optimal solution of EWC, the error and the input random
processes satisfy b Eenxn L en Lxn 1 2b Eenxn for all
L m.
Proof We know that the optimal solution of EWC for any L m is obtained when
the gradient of the cost function with respect to the weights is zero. Therefore,
@J
2Eenxn 2b Een en Lxn xn L
@w
1 2b Eenxn b Eenxn L en Lxn 0:
10:16
457
Another interesting property that the EWWF solution exhibits is its relationship with
entropy. Notice that when b , 0, the optimization rule tries to minimize MSE, yet it
tries to maximize the separation between samples of errors simultaneously. We
could regard the sample separation as an estimate of the error entropy. In fact, the
entropy estimation literature is full of methods based on sample separations [39, 5,
21, 3, 24, 2, 40]. Specically, the case b 1=2 nds the perfect balance between
entropy and MSE that allows us to eliminate the effect of noise on the solution.
Recall that the Gaussian density displays maximum entropy among distributions of
xed variance. In light of this fact, the aim of EWWF could be understood as nding
the minimum error variance solution while keeping the error close to Gaussian. Note
that, due to the central limit theorem, the error signal will be closely approximated
by a Gaussian density when there is a large number of taps.
10.4.3
Model order selection is another important issue in adaptive lter theory. The
purpose of an adaptive lter is to nd the right balance between approximating the
training data as accurately as possible and generalizing to unseen data with precision
[6]. One major cause of poor generalization is known to be excessive model
complexity [6]. Under these circumstances, the designers aim is to determine the
least complex adaptive system (which translates into a smaller number of weights in
the case of linear systems) that minimizes the approximation error. Akaikes
information criterion [1] and Rissanens minimum description length [36] are two
important theoretical results regarding model order selection. Such methods require
the designer to evaluate an objective function, which is a combination of MSE and
the lter length or the lter weights, using different lengths of adaptive lters. The
EWC criterion successfully determines the length of the true lter (assumed FIR),
even in the presence of additive noise, provided that the trained adaptive lter is
sufciently long. In the case of an adaptive lter longer than the reference lter, the
additional taps will decay to zero, indicating that a smaller lter is sufcient to
model the data. This is exactly what we would like an automated regularization
algorithm to achieve: determining the proper length of the lter without requiring
external discrete modications on this parameter. Therefore, EWC extends the
458
The effect of the cost function free parameter b on the accuracy of the solution
(compared to the true weight vector that generated the training data) is another
crucial issue. In fact, it is possible to determine the dynamics of the weight error as a
function of b . This result is provided in the following lemma.
Lemma 4 (The Effect of b on the EWWF) In the noisy training data case, the
derivative of the error vector between the optimal EWC solution and the true weight
^ * wT , with respect to b is given by
vector, that is, 1^ * w
@1^ *
1 2b R V b RL 1 2R RL 1^ * RL wT :
@b
10:17
459
the noisy training data case. This section demonstrates these theoretical results in
numerical case studies with Monte Carlo simulations.
Given the scheme depicted in Figure 10.3, it is possible to determine the true
analytic auto/cross-correlations of all signals of interest in terms of the lter
coefcients and the noise powers. Suppose that j , v , and u are zero-mean white noise
signals with powers s 2x , s 2v , and s 2u , respectively. Suppose that the coloring lter h
and the mapping lter w are unit norm. Under these conditions, we obtain
E~xn~xn D s 2x
M
X
10:18
hj hjD
j0
E~xn v~ n~xn D v~ n D
E~xn v~ nd^ n s 2v wD
N
X
s 2x s 2v ;
D0
E~xn~xn D; D = 0
wl E~xn~xn l D:
10:19
10:20
l0
For each combination of SNR from f10 dB; 0 dB; 10 dBg, b from f0:5;
0:3; 0; 0:1g, m from f2; . . . ; 10g, and L from fm; . . . ; 20g we have performed 100
Monte Carlo simulations using randomly selected 30-tap FIR coloring and n-tap
mapping lters. The length of the mapping lters and that of the adaptive lters were
selected to be equal in every case. In all simulations, we used an input signal power
of s 2x 1, and the noise powers s 2v s 2u are determined from the given SNR using
SNR 10 log10 s 2x =s 2v . The matrices R, S, P, and Q, which are necessary to
evaluate the optimal solution given by (10.15), are then evaluated using (10.18),
(10.19), and (10.20) analytically. The results obtained are summarized in Figure
10.4 and Figure 10.5, where for the three SNR levels selected, the average squared
error norm for the optimal solutions (in reference to the true weights) is given as a
function of L and n for different b values. In Figure 10.4, we present the average
normalized weight vector error norm obtained using EWC at different SNR levels
and using different b values as a function of the correlation lag L that is used in the
Figure 10.3 Demonstration scheme with coloring lter h, true mapping lter w, and the
uncorrelated white signals j , v~ , and u^ .
460
criterion. The lter length is 10 in these results. From the theoretical analysis, we
know that if the input autocorrelation matrix is invertible, then the solution accuracy
should be independent of the autocorrelation lag L. The results of the Monte Carlo
simulations presented in Figure 10.4 conform to this fact. As expected, the optimal
choice of b 1=2 determined the correct lter weights exactly.
Another set of results, presented in Figure 10.5, shows the effect of lter length
on the accuracy of the solutions provided by the EWC criterion. The optimal value
of b 1=2 always yields the perfect solution, whereas the accuracy of the optimal
weights degrades as this parameter is increased towards zero (i.e., as the weights
approaches the Wiener solution). An interesting observation from Figure 10.5 is that
for SNR levels below zero, the accuracy of the solutions using suboptimal b values
increases, whereas for SNR levels above zero, the accuracy decreases when the lter
length is increased. For zero SNR, on the other hand, the accuracy seems to be
roughly unaffected by the lter length.
The Monte Carlo simulations performed in the preceding examples utilized the
exact coloring lter and the true lter coefcients to obtain the analytical solutions.
In our nal case study, we demonstrate the performance of the batch solution of the
EWC criterion obtained from sample estimates of all the relevant auto- and crosscorrelation matrices. In these Monte Carlo simulations, we utilize 10,000 samples
corrupted with white noise at various SNR levels. The results of these Monte Carlo
simulations are summarized in the histograms shown in Figure 10.6. Each subplot of
Figure 10.6 corresponds to experiments performed using SNR levels of 10 dB,
0 dB, and 10dB for each column and adaptive lter lengths of 4 taps, 8 taps, and 12
Figure 10.4 The average squared error norm of the optimal weight vector as a function of
autocorrelation lag L for various b values and SNR levels.
461
Figure 10.5 The average squared error norm of the optimal weight vector as a function of
lter length m for various b values and SNR levels.
taps for each row, respectively. For each combination of SNR and lter length, we
performed 50 Monte Carlo simulations using MSE (b 0) and EWC (b 1=2)
criteria. The correlation lag is selected to be equal to the lter length in all
simulations due to Theorem 10.2. Clearly, Figure 10.6 demonstrates the superiority
of the EWC in rejecting noise that is present in the training data. Note that in all
subplots (i.e., for all combinations of lter length and SNR), EWC achieves a
smaller average error norm than MSE. The discrepancy between the performances
of the two solutions intensies with increasing lter length. Next, we demonstrate
the error-whitening property of the proposed EWC solutions.
From (10.1) we can expect that the error autocorrelation function will vanish at
lags greater than or equal to the length of the reference lter if the weight vector is
identical to the true weight vector. For any other value of the weight vector, the error
autocorrelation uctuates at nonzero values. A four-tap reference lter is identied
with a four-tap adaptive lter using noisy training data (hypothetical) at an SNR
level of 0 dB. The autocorrelation functions of the error signals corresponding to the
MSE solution and the EWC solution are shown in Figure 10.7. Clearly, the EWC
criterion determines a solution that forces the error autocorrelation function to zero
at lags greater than or equal to the lter length (partial whitening of the error).
Finally, we address the order selection capability and demonstrate how the EWC
criterion can be used to determine the correct lter order, even with noisy data,
provided that the given input desired output pair is a moving average process. For
this purpose, we determine the theoretical Wiener and EWC (with b 1=2 and
462
Figure 10.6 Histograms of the weight error norms (dB) obtained in 50 Monte Carlo simulations using 10,000 samples of noisy data using MSE
(empty bars) and EWC with b 1=2 (full bars). The subgures in each row use lters with 4, 8, and 12 taps, respectively. The subgures in each
column use noisy samples at 10, 0, and 10dB SNR, respectively.
Figure 10.7
463
Error autocorrelation function for MSE (dotted) and EWC (solid) solutions.
L m, where m is the length of the adaptive lter) solutions for a randomly selected
pair of coloring lter h and mapping lter w at different adaptive lter lengths. The
noise level is selected to be 20 dB, and the length of the true mapping lter is 5. We
know from our theoretical analysis that if the adaptive lter is longer than the
reference lter, the EWC will yield the true weight vector padded with zeros. This
will not change the MSE of the solution. Thus, if we plot the MSE of the EWC
versus the length of the adaptive lter, starting from the length of the actual lter, the
MSE of the EWC solution will remain at, whereas the Wiener solution will keep
decreasing the MSE, contaminating the solution by learning the noise in the data.
Figure 10.8a shows the MSE of the Wiener solution as well as the EWC obtained for
different lengths of the adaptive lter using the same training data described above.
Note (in the zoomed-in portion) that the MSE of the EWC remains constant starting
from 5, which is the lter order that generated the data. On the other hand, if we were
to decide on the lter order looking at the MSE of the Wiener solution, we would
select a model order of 4, since the gain in MSE is insignicantly small compared to
the previous steps from this point on.
Figure 10.8b shows the norm of the weight vector error for the solutions obtained
using the EWC and MSE criteria, which conrms that the true weight vector is
indeed attained with the EWC criterion once the proper model order is reached.
This section aimed at experimentally demonstrating the theoretical concepts set
forth in the preceding sections of the chapter. We have demonstrated with numerous
Monte Carlo simulations that the analytical solution of the EWC criterion eliminates
the effect of noise completely if the proper value is used for b . We have also
demonstrated that the batch solution of EWC (estimated from a nite number of
samples) outperforms MSE in the presence of noise, provided that a sufcient
464
Figure 10.8 Model order selection using the EWC criterion: (a) MSE Ed 2 n of the
EWWF (solid) and the Wiener solutions (dotted) versus lter length. (b) Norm of the weight
vector error as a function of lter length for EWWF (solid) and Wiener solutions (dotted).
number of samples are given so that the noise autocorrelation matrices diminish as
required by the theory.
Although we have presented a complete theoretical investigation of the proposed
criterion and its analytical solution, in practice, on-line algorithms that operate on a
sample-by-sample basis to determine the desired solution are equally valuable.
Therefore, in the sequel, we will focus on designing computationally efcient online algorithms to solve for EWC in a fashion similar to the well-known LMS and
RLS algorithms. In fact, we aim to come up with algorithms that have the same
computational complexity as these two widely used algorithms. The advantage of
the new algorithms will be their ability to provide better estimates of the model
weights when the training data are contaminated with white noise.
10.6
In this section, we will present an on-line recursive algorithm to estimate the optimal
solution for the EWC. Given the estimate of the lter tap weights at time instant
(n 1), the goal is to determine the best set of tap weights at the next iteration n that
would track the optimal solution. This algorithm, which we call recursive error
whitening (REW), is similar to recursive least squares (RLS). The strongest
motivation behind proposing the REW algorithm is that it is truly a xed-point-type
algorithm that tracks, at each iteration, the optimal solution.
This tracking nature results in the faster convergence of the REW algorithm [34].
This, however, comes at an increase in the computational cost. The REW algorithm
is Om2 in complexity (the same as in the RLS algorithm), and this is a substantial
increase in complexity when compared with simple gradient methods that will be
465
discussed in a later section. We know that the optimal solution for the EWC is given
by
w* R b S1 P b Q:
10:21
10:23
10:26
We will dene a gain matrix analogous to the gain vector in the RLS case [23] as
kn T1 n 1BI22 DT T1 n 1B1 :
10:27
Using the above denition, the recursive estimate for the inverse of Tn becomes,
T1 n T1 n 1 knDT T1 n 1:
10:28
466
Once again, the above equation is analogous to the Ricatti equation for the RLS
algorithm. Multiplying (10.27) from the right by I22 DT T1 n 1B, we
obtain
knI22 DT T1 n 1B T1 n 1B
kn T1 n 1B knDT T1 n 1B
10:29
T1 nB:
In order to derive an update equation for the lter weights, we substitute the
recursive estimate for Vn in (10.26):
wn T1 nVn 1 T1 n1 2b dnxn b dnxn L
b dn Lxn
10:30
10:31
10:32
dn
:
wn wn 1 knD wn 1 kn
dn b dn L
10:33
Note that the product DT wn 1 is nothing but the matrix of the outputs
yn yn b yn LT ,
where
yn xT nwn 1 and
yn L
T
x n Lwn 1. The a priori error matrix is dened as
en
dn yn
dn yn b dn L yn L
en
en b en L
: 10:34
467
Using all the above denitions, we will formally state the weight update equation for
the REW algorithm as
wn wn 1 knen:
10:35
The overall complexity of (10.35) is Om2 , which is comparable to the complexity of the RLS algorithm. Unlike the stochastic gradient algorithms that are
easily affected by the eigenspread of the input data and the type of the stationary
point solution (minimum, maximum, or saddle), the REW algorithm is immune to
these problems. This is because it inherently makes use of more information about
the performance surface by computing the inverse of the Hessian matrix R b S. A
summary of the REW algorithm is given in Table 10.1.
The convergence analysis of the REW algorithm is similar to the analysis of the
RLS algorithm, which is dealt with in detail in [23]. In this chapter, we will not dwell
further on the convergence issues of REW algorithm. The REW algorithm as given
by (10.35) works for stationary data only. For nonstationary data, tracking becomes
an important issue. This can be handled by including a forgetting factor in the
estimation of Tn and Vn. This generalization of the REW algorithm with
forgetting factor is trivial and very similar to the exponentially weighted RLS
(EWRLS) algorithm [23].
The instrumental variables (IV) method proposed as an extension to the leastsquares (LS) has a similar recursive algorithm for solving the problem of parameter
estimation in white noise [43]. This method requires choosing a set of instruments
that are uncorrelated with the noise in the input. Specically, the IV method
computes the solution w Exk xTkD 1 ExkD dk ; where D is the chosen lag for the
instrument vector. Notice that there is a similarity between the IV solution and the
recursive EWC solution w R1
L PL : However, the EWC formulation is based on
TABLE 10.1
xn and
D xn xn b xn L
kn Tn 1BI22 DT Tn 1B1
yn xT nwn 1 and yn L xT n Lwn 1
dn yn
en
en
dn yn b dn L yn L
en b en L
wn wn 1 knen
Tn Tn 1 knDT Tn 1
468
the error whereas the IV method does not have an associated error cost function.
Also, the Toeplitz structure of RL can be exploited to derive fast converging (and
robust) minor components based recursive EWC algorithms [44].
10.6.1
The REW algorithm can be used effectively to solve the system identication
problem in noisy environments. As we have seen before, setting the value of
b 0:5, noise immunity can be gained for parameter estimation. We generated a
purely white Gaussian random noise of length 50,000 samples and added this to a
colored input signal. The white noise signal is uncorrelated with the input signal.
The noise-free, colored input signal was ltered by the unknown reference lter, and
this formed the desired signal for the adaptive lter. Since the noise in the desired
signal would be averaged out for both RLS and REW algorithms, we decided to use
the clean desired signal itself. This will bring out only the effects of input noise on
the lter estimates. Also, the noise added to the clean input is uncorrelated with the
desired signal. In the experiment, we varied the SNR in the range 10 dB to 10 dB.
The number of desired lter coefcients was also varied from 4 to 12. We then
performed 100 Monte Carlo runs and computed the normalized error vector norm
given by
error 20 log 10
kwT w* k
;
kwT k
10:36
where w is the weight vector estimated by the REW algorithm with b 0:5 after
50,000 iterations or one complete presentation of the input data and wT is the true
weight vector. In order to show the effectiveness of the REW algorithm, we
performed Monte Carlo runs using the RLS algorithm on the same data to estimate
the lter coefcients. Figure 10.9 shows a histogram plot of the normalized error
vector norm given in (10.36). The solid bars show the REW results, and the unlled
bars denote the results of RLS. It is clear that the REW algorithm is able to perform
better than the RLS at various SNR and tap length settings. In the high-SNR cases,
there is not much of a difference between RLS and REW results. However, under
noisy circumstances, the reduction in the parameter estimation error with REW is
orders of magnitude higher when compared with RLS. Also, the RLS algorithm
results in a rather useless zero weight vector; that is, w 0 when the SNR is lower
than 10dB.
10.6.2
469
Figure 10.9
Histogram plots showing the normalized error vector norm for REW and RLS algorithms.
470
Figure 10.10 Performance of the REW algorithm with (a) SNR 0 dB and (b)
SNR 10 dB over various beta values.
gure), and this clearly gives us the minimum estimation error. For b 0 (indicated
by a o in the gure), the REW algorithm reduces to the regular RLS, giving a fairly
signicant estimation error. Next, the parameter b is set to 0.5 and SNR to 0 dB,
and the weight tracks are estimated for the two algorithms. Figure 10.11 shows the
averaged weight tracks for both REW and RLS algorithms over 50 Monte Carlo
trials. Asterisks on the plots indicate the true parameters. The tracks for the RLS
algorithm are smoother, but they converge to wrong values, which we have observed
quite consistently. The weight tracks for the REW algorithm are noisier than those of
the RLS, but they eventually converge to values very close to the true weights.
We have observed that the weight tracks for the REW algorithm can be quite
noisy in the initial stages of adaptation. This may be attributed to the poor
Figure 10.11
471
10:37
10:38
It is easy to see that both Ee2 n and E_e2 n have parabolic performance surfaces
as their Hessians have positive eigenvalues. However, the value of b can invert the
performance surface of E_e2 n. For b . 0, the stationary point is always a global
minimum and the gradient of (10.38) can be written as the sum of the individual
gradients as follows:
@Jw
2R b Sw 2P b Q 2Rw P 2b Sw Q:
@w
10:39
10:40
472
Thus we can write the weight update for the stochastic EWC-LMS algorithm for
b . 0 as
wn 1 wn h nenxn b e_ n_xn;
10:41
10:42
where h n is again a small step-size. However, there is no guarantee that the above
update rules will be stable for all choices of step-sizes. Although (10.41) and (10.42)
are identical, we will use jb j in the update, (10.42), to analyze the convergence of
the algorithm specically for b , 0. The reason for the separate analysis is that the
convergence characteristics of (10.41) and (10.42) are very different.
Theorem 10.3 The stochastic EWC algorithms asymptotically converge in the
mean to the optimal solution given by
w* R b S1 P b Q;
w* R jb jS1 P jb jQ;
b .0
b , 0:
10:43
We will rst consider the update equation in (10.41), which is the stochastic EWCLMS algorithm for b . 0. Without loss of generality, we will assume that the input
473
vectors xn and their corresponding desired responses dn are noise-free. The mean
update vector h wn is given by
dwt
Eenxn b e_ n_xn
h wn
dt
Rwn Pn b Swn Qn:
10:44
The stationary point of the ordinary differential equation (ODE) in (10.44) is given
by
w* R b S1 P b Q:
10:45
10:46
10:47
Imposing the condition that kjn 1k2 , kjnk2 for all n, we get an upper bound
on the time varying step-size parameter h n which is given by
h n ,
10:48
Simplifying the above equation using the fact that jT nxn en and
j n_xn e_ n, we get
T
h n ,
2e2 n b e_ 2 n
;
kenxn b e_ n_xnk2
10:49
which is a more practical upper bound on the step-size, as it can be directly estimated
from the input and outputs. As an observation, we say that if b 0, then the bound
in (10.49) reduces to,
h n ,
2
;
kxnk2
10:50
which, when included in the update equation, reduces to a variant of the normalized
LMS (NLMS) algorithm. In general, if the step-size parameter is chosen according
to the bound given by (10.49), then the norm of the error vector jn is a
474
10.7.2
10:51
Rwn Pn jb jSwn Qn
As before, the stationary point of this ODE is
w* R jb jS1 P jb jQ:
10:52
The eigenvalues of R jb jS decide the nature of the stationary point. If they are all
positive, then we have a global minimum; if they are all negative, we have a global
maximum. In these two cases, the stochastic gradient algorithm in (10.42) with a
proper xed sign step-size would converge to the stationary point, which would be
stable. However, we know that the eigenvalues of R jb jS can also take both
positive and negative values, resulting in a saddle stationary point. Thus, the
underlying dynamical system would have both stable and unstable modes making it
impossible for the algorithm in (10.42) with xed sign step-size to converge. This is
well known in the literature [22]. However, as will be shown next, this difculty can
be removed for our case by appropriately utilizing the sign of the update equation
(remember that this is the only stationary point of the quadratic performance
surface). The general idea is to use a vector step-size (one step-size per weight)
having both positive and negative values. One unrealistic way (for an on-line
algorithm) to achieve this goal is to estimate the eigenvalues of R jb jS.
Alternatively, we can derive the conditions on the step-size for guaranteed
convergence. As before, we will dene the error vector at time instant n as jn
w* wn. The norm of the error vector at time instant n 1 is given by
kjn 1k2 kjnk2 2h njT nenxn jb jjT n_en_xn
10:53
h 2 nkenxn jb j_en_xnk2 :
475
10:54
The mean of the error vector norm will monotonically decay to zero over time; that
is, Ekjn 1k2 , Ekjnk2 if and only if the step-size satises the following
inequality:
jh nj ,
10:55
10:56
10:57
10:58
Similarly, we have
jT n_en_xn w* wnT d~ n un wT n~xn vn
d~ n L un L wT n~xn L vn L
~xk vk x~ kL vkL :
10:59
476
10:60
10:61
Using (10.57) and (10.60) in (10.55), we get an expression for the upper bound on
the step-size as
jh nj ,
10:62
This expression is not usable in practice as an upper bound because it depends on the
optimal weight vector. However, for b 0:5, the upper bound on the step-size
reduces to
jh nj ,
2jJMSE 0:5JENT j
:
Ekenxn 0:5_en_xnk2
10:63
From (10.58) and (10.61), we know that JMSE and JENT are positive quantities.
However, JMSE 0:5JENT can be negative. Also, note that this upper bound is
computed by evaluating the right-hand side of (10.63) with the current weight vector
wn. Thus, as expected, it is very clear that the step-size at the nth iteration can take
either positive or negative values based on JMSE 0:5JENT ; therefore, sgnh n
must be the same as sgnJMSE 0:5JENT evaluated at wn. Intuitively speaking, the
term JMSE 0:5JENT is the EWC cost computed with the current weights wn and
b 0:5, which tells us where we are on the performance surface, and the sign tells
which way to go to reach the stationary point. It also means that the lower bound on
the step-size is not positive, as in traditional gradient algorithms. In general, if the
step-size we choose satises (10.62), then the mean error vector norm decreases
asymptotically; that is, Ekjn 1k2 , Ekjnk2 and eventually becomes zero,
which implies that limn!1 Ewn ! w* . Thus, the weight vector Ewn
converges asymptotically to w* , which is the only stationary point of the ODE in
(10.51). We conclude that the knowledge of the eigenvalues is not needed to
implement gradient descent in the EWC performance surface, but (10.63) is still not
appropriate for a simple LMS-type algorithm.
10.7.3
477
As mentioned before, computing JMSE 0:5JENT at the current weight vector would
require reusing the entire past data at every iteration. As an alternative, we can
extract the curvature at the operating point and include that information in the
gradient algorithm. By doing so, we obtain the following stochastic algorithm:
wn 1 wn h sgnwT nRn jb jSnwnenxn
jb j_en_xn;
10:64
where Rn and Sn are the estimates of R and S, respectively, at the nth time
instant.
Corollary Given any quadratic surface Jw, the following gradient algorithm
converges to its stationary point:
wn 1 wn h sgnwT nHwn
@J
:
@wn
10:65
Proof Without loss of generality, suppose that we are given a quadratic surface of
the form Jw wT Hw, where H [ Rmm and w [ Rm1 . H is restricted to be
symmetric; therefore, it is the Hessian matrix of this quadratic surface. The gradient
of the performance surface with respect to the weights, evaluated at point w0 , is
@J=@w0 2Hw0 , and the stationary point of Jw is the origin. Since the
performance surface is quadratic, any cross section passing through the stationary
point is a parabola. Consider the cross section of Jw along the line dened by the
local gradient that passes through the point w0 . In general, the Hessian matrix of this
surface can be positive or negative denite; it might as well have mixed eigenvalues.
The unique stationary point of Jw, which makes its gradient zero, can be reached
by moving along the direction of the local gradient. The important issue is the
selection of the sign, that is, whether to move along or against the gradient direction
to reach the stationary point. The decision can be made by observing the local
curvature of the cross section of Jw along the gradient direction. The performance
surface cross section along the gradient direction at w0 is
Jw0 2h Hw0 wT0 I 2h HT HI 2h Hw0
wT0 H 4h H2 4h 2 H3 w0 :
10:66
From this, we deduce that the local curvature of the parabolic cross section at w0 is
4wT0 H3 w0 . If the performance surface is locally convex, this curvature is positive. If
the performance surface is locally concave, this curvature is negative. Also, note that
sgn4wT0 H3 w0 sgnwT0 Hw0 . Thus, the update equation with the curvature
478
10:67
The experimental setup is the same as the one we used to test the REW algorithm.
We varied the SNR between 10 dB and 10 dB and changed the number of lter
parameters from 4 to 12. We set b 0:5 and used the update equation in (10.67)
for the EWC-LMS algorithm. A time-varying step-size magnitude was chosen in
accordance with the upper bound given by (10.63) without the expectation
operators. This greatly reduces the computational burden but makes the algorithm
noisier. However, since we are using 50,000 samples for estimating the parameters,
we can expect the errors to average out over iterations. For the LMS algorithm, we
chose the step-size that gave the least error in each trial. A total of 100 Monte Carlo
trials were performed, and histograms of normalized error vector norms were
plotted. Figure 10.12 shows the error histograms for both LMS and EWC-LMS
algorithms. The EWC-LMS algorithm performs signicantly better than the LMS
algorithm at low SNR values. Their performances are on par for SNRs greater than
20 dB. Figure 10.13 shows a sample comparison between the stochastic and
recursive algorithms for 0 dB SNR and four lter taps. Interestingly, the performance of the EWC-LMS algorithm is better than that of the REW algorithm
in the presence of noise. Similarly, the LMS algorithm is much better than the
RLS algorithm. This tells us that the stochastic algorithms reject more noise
than the xed-point algorithms. Researchers have made this observation before,
although no concrete arguments exist to account for the smartness of the adaptive
algorithms [35]. Similar conclusions can be drawn in our case for EWC-LMS and
REW.
479
Figure 10.12 Histogram plots showing the normalized error vector norm for EWC-LMS and LMS algorithms.
480
Figure 10.13
10.7.5
Figure 10.14
481
plot of the EWC cost function with noisy input data. Clearly, the Hessian of this
performance surface has both positive and negative eigenvalues, thus making the
stationary point an undesirable saddle point. On the same plot, we have shown the
weight tracks of the EWC-LMS algorithm in (10.67) with b 0:5. Also, we have
used a xed value of 0.001 for the step-size. From the gure, it is clear that the
EWC-LMS algorithm converges stably to the saddle point solution, which is
theoretically unstable when a single-sign step-size is used. Note that due to the
constant step-size, there is misadjustment in the nal solution. Although no
analytical expressions for misadjustments are derived in this chapter, we have done
some preliminary work on estimating the misadjustment and excess error for EWCLMS [32, 33].
In Figure 10.15, we show the individual weight tracks for the EWC-LMS
algorithm. The weights converge to the vicinity of the true lter parameters, which
are 0.2 and 0.5, respectively, within 1000 samples. In order to see if the algorithm
in (10.67) converges to the saddle point solution in a robust manner, we ran the same
experiment using different initial conditions on the contours. Figure 10.16 shows a
few plots of the weight tracks originating from different initial values over the
contours of the performance surface. In every case, the algorithm converged to the
saddle point in a stable manner. Note that the misadjustment in each case is almost
the same. Finally, in order to see the effect of reducing the SNR, we repeated the
experiment with 0 dB SNR. Figure 10.17 (left) shows the weight tracks over the
contour, and we can see that there is more misadjustment now. However, we have
observed that by using smaller step-sizes, the misadjustment can be controlled to be
within acceptable limits. Figure 10.17 (right) shows the weight tracks when the
algorithm is used without the sign information for the step-size. Note that
convergence is not achieved in this case, which substantiates our previous argument
that a xed-sign step-size will never converge to a saddle point.
Figure 10.15
Weight tracks.
482
Figure 10.16
10.8
Contour plot with weight tracks for different initial values for the weights.
Mean square error has been the criterion of choice in many function approximation
tasks including adaptive lter optimization. There are alternatives and enhancements to MSE that have been proposed in order to improve the robustness of
learning algorithms in the presence of noisy training data. In FIR lter adaptation,
noise present in the input signal is especially problematic since MSE cannot
eliminate this factor. A powerful enhancement technique called total least squares,
on the one hand, fails to work if the noise levels in the input and output signals are
not equal. The alternative method of subspace Wiener ltering, on the other hand,
requires the noise power to be strictly smaller than the signal power to improve SNR.
Figure 10.17 Contour plot with weight tracks for the EWC-LMS algorithm with sign
information (left) and without sign information (right) (0dB SNR and two lter taps case).
483
484
APPENDIX A
This appendix aims to achieve an understanding of the relationship between entropy
and sample differences. In general, the parametric family describing the error
probability density function (pdf) in supervised learning is not analytically available.
In such circumstances, nonparametric approaches such as Parzen windowing [29]
could be employed. Given the independent and identically distribution (iid) samples
fe1; . . . ; eNg of a random variable e, the Parzen window estimate for the
underlying pdf fe : is obtained by
N
1X
k s x ei;
f^e x
N i1
A:1
where k s : is the kernel function, which itself is a pdf, and s is the kernel size that
controls the width of each window. Typically, Gaussian kernels are preferred, but
other kernel functions such as the Cauchy density or the members of the generalized
Gaussian family can be employed.
Shannons entropy for a random variable e with pdf fe : is dened as [37]
He
1
1
A:2
A:3
This estimator uses the sample mean approximation for the expected value and the
Parzen window estimator for the pdf. Viola proposed a similar entropy estimator, in
which he suggested dividing the samples into two subsets: one for estimating the
pdf, the other for evaluating the sample mean [41]. In order to approximate a
stochastic entropy estimator, we approximate the expectation by evaluating the
argument at the most recent sample, ek . In order to estimate the pdf, we use the L
previous samples. The stochastic entropy estimator then becomes
!
L
X
1
H e log
k s ek ei :
L i1
A:4
For supervised training of an ADALINE (or an FIR lter) with weight vector
w [ Rn , given the input (vector)-desired training sequence xn; dn, where
xn [ Rm and dn [ R, the instantaneous error is given by en dn
wT nxn. The stochastic gradient of the error entropy with respect to the weights
APPENDIX B
485
becomes
L
X
@H X
k 0s en en ixn xn i
@w
i1
L
X
k s en en i;
i1
A:6
We easily note that the expression in (A.6) is also a stochastic gradient for the cost
function J Een en L2 =2s 2 .
APPENDIX B
Consider the correlation matrices R, S, P, and Q estimated from noisy data. For R,
we write
R ExnxT n E~xn vn~xn vnT
E~xn~xT n x~ nvT n vnxT n vnvT n
B:1
~ V:
E~xn~xT n vnvT n R
For S, we obtain
S ExnxT n xnxT n xnxT n L xn LxT n
2R E~xn vn~xn L vn LT ~xn L
vn L~xn vnT
~ V
2R
"
#
x~ n~xT n L x~ nvT n L vn~xT n L vnvT n L
E
x~ n L~xT n x~ n LvT n vn L~xT n vn LvT n
~ V E~xn~xT n L x~ n L~xT n
2R
EvnvT n L vn LvT n
~ V R
~ L VL :
2R
B:2
486
B:3
E~xnd~ n P~
Q Exn xn Ldn dn L
Exndn xndn L xn Ldn xn Ldn L
2P Exndn L xn Ldn
2P~ E~xn vnd~ n L un L ~xn L
vn Ld~ n un
B:4
2
3
x~ nun L vnd~ n L
6
7
2P~ E~xnd~ n L x~ n Ld~ n E4 vnun L x~ n Lun 5
vn Ld~ n vn Lun
2P~ P~ L :
APPENDIX C
Recall that the optimal solution of EWC satises (10.9), which is equivalently
E1 2b enxn b enxn L enxn L 0:
C:1
C:2
Note that the combination of x-values that multiply b form an estimate of the
acceleration of the input vector xn. Specically for b 1=2, the term that
multiplies en becomes a single-step prediction for the input vector xn (assuming
zero velocity and constant acceleration), according to Newtonian mechanics. Thus,
the optimal solution of the EWC criterion tries decorrelating the error signal from
the predicted next input vector.
Acknowledgments
This work is partially supported by NSF Grant ECS-9900394.
REFERENCES
487
REFERENCES
1. H. Akaike, A New Look at the Statistical Model Identication, IEEE Trans. Automatic
Control, vol. 19, pp. 716 723, 1974.
2. C. Beck and F. Schlogl, Thermodynamics of Chaotic Systems. Cambridge University
Press, Cambridge, 1993.
3. J. Beirlant and M. C. A. Zuijlen, The Empirical Distribution Function and Strong Laws
for Functions of Order Statistics of Uniform Spacings, Journal of Multivariate Analysis,
vol. 16, pp. 300 317, 1985.
4. A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic
Approximations. Springer-Verlag, Berlin, 1990.
5. P. J. Bickel and L. Breiman, Sums of Functions of Nearest Neighbor Distances, Moment
Bounds, Limit Theorems and a Goodness-of-Fit Test, Annals of Statistics, vol. 11,
pp. 185 214, 1983.
6. C. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
7. J. A. Cadzow, Total Least Squares, Matrix Enhancement, and Signal Processing,
Digital Signal Processing, vol. 4, pp. 21 39, 1994.
8. M. Chansarkar and U. B. Desai, A Robust Recursive Least Squares Algorithm, IEEE
Trans. Signal Processing, vol. 45, pp. 1726 1735, 1997.
9. B. de Moor, Total Least Squares for Afnely Structured Matrices and the Noisy
Realization Problem, IEEE Trans. Signal Processing, vol. 42 , pp. 3104 3113, 1994.
10. S. C. Douglas and W. Pan, Exact Expectation Analysis of the LMS Adaptive Filter,
IEEE. Trans. Signal Processing, vol. 43, pp. 2863 2871, 1995.
11. S. C. Douglas, Analysis of an Anti-Hebbian Adaptive FIR Filtering Algorithm, IEEE
Trans. Circuits and SystemsII: Analog and Digital Signal Processing, vol. 43, pp. 777
780, 1996.
12. D. Erdogmus, Information Theoretic Learning: Renyis Entropy and its Applications to
Adaptive System Training, Ph.D. dissertation, University of Florida, Gainesville, FL,
2002.
13. D. Erdogmus and J. C. Principe, An On-Line Adaptation Algorithm for Adaptive System
Training with Minimum Error Entropy: Stochastic Information Gradient, Proceedings
of ICA01, pp. 7 12, San Diego, CA, 2001.
14. D. Erdogmus and J. C. Principe, Generalized Information Potential Criterion for
Adaptive System Training, to appear in IEEE Trans. Neural Networks, vol. 13, no. 5, pp.
1035 1044, Sept. 2002.
15. D. Erdogmus, J. C. Principe, and K. E. Hild II, Do Hebbian Synapses Estimate
Entropy?, accepted by NNSP02, pp. 199 208, Martigny, Switzerland, Sept. 2002.
16. B. Farhang-Boroujeny, Adaptive Filters: Theory and Applications, Wiley, New York,
1998.
17. D. Z. Feng, Z. Bao, and L. C. Jiao, Total Least Mean Squares Algorithm, IEEE Trans.
Signal Processing, vol. 46, pp. 2122 2130, 1998.
18. K. Gao, M. O. Ahmad, and M. N. S. Swamy, A Constrained Anti-Hebbian Learning
Algorithm for Total Least Squares Estimation with Applications to Adaptive FIR and IIR
Filtering, IEEE Trans. Circuits and Systems Part 2, vol. 41, pp. 718 729, 1994.
488
19. G. H. Golub and C. F. van Loan, An Analysis of the Total Least Squares Problem,
SIAM J. Numerical Analysis, vol. 17, pp. 883893, 1979.
20. G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins University Press,
Baltimore, 1989.
21. P. Hall, Limit Theorems for Sums of General Functions of m-Spacings, Mathematical
Proceedings of the Cambridge Philosophical Society, vol. 96 pp. 517 532, 1984.
22. S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, New York,
1994.
23. S. Haykin, Adaptive Filter Theory, Prentice-Hall, Upper Saddle River, NJ, 1996.
24. L. F. Kozachenko and N. N. Leonenko, Sample Estimate of Entropy of a Random
Vector, Problems of Information Transmission, vol. 23, pp. 95 101, 1987.
25. H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and
Unconstrained Systems, Springer-Verlag, New York, 1978.
26. P. Lemmerling, Structured Total Least Squares: Analysis, Algorithms, and Applications,
Ph.D. dissertation, Katholeike University, Leuven, Belgium, 1999.
27. L. Ljung, Analysis of Recursive Stochastic Algorithms, IEEE Trans. Automatic
Control, vol. AC-22, pp. 551 575, 1977.
28. M. Mueller, Least-Squares Algorithms for Adaptive Equalizers, Bell Systems Technical
Journal, vol. 60, pp. 1905 1925, 1981.
29. E. Parzen, On Estimation of a Probability Density Function and Mode, in Time Series
Analysis Papers, Holden-Day, San Diego, CA, 1967.
30. J. C. Principe, N. Euliano, and C. Lefebvre, Neural and Adaptive Systems: Fundamentals
Through Simulations, Wiley, New York, 1999.
31. Y. N. Rao, Algorithms for Eigendecomposition and Time Series Segmentation, M.S.
thesis, University of Florida, Gainesville, FL, 2000.
32. Y. N. Rao, D. Erdogmus, and J. C. Principe, Error Whitening Criterion for Adaptive
Filtering, in review IEEE Trans. Signal Processing, Oct. 2002.
33. Y. N. Rao and J. C. Principe, Efcient Total Least Squares Method for System Modeling
Using Minor Component Analysis, Proc. IEEE Workshop on Neural Networks for
Signal Processing XII, pp. 259 258, Sep. 2002.
34. P. A. Regalia, Adaptive IIR Filtering in Signal Processing and Control. Marcel Dekker,
New York, 1995.
35. M. Reuter, K. Quirk, J. Zeidler, and L. Milstein, Non-Linear Effects in LMS Adaptive
Filters, Proceedings of IEEE 2000 AS-SPCC, pp. 141 146, October, 2000.
36. J. Rissanen, Stochastic Complexity in Statistical Inquiry, World Scientic, London, 1989.
37. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University
of Illinois Press, Urbana, 1964.
38. H. C. So, Modied LMS Algorithm for Unbiased Impulse Response Estimation in
Nonstationary Noise, IEE Electronics Letters, vol. 35, pp. 791 792, 1999.
39. F. P. Tarasenko, On the Evaluation of an Unknown Probability Density Function, the
Direct Estimation of the Entropy from Independent Observations of a Continuous
Random Variable, and the Distribution-Free Entropy Test of Goodness-of-Fit,
Proceedings of IEEE, vol. 56, pp. 2052 2053, 1968.
40. A. B. Tsybakov and E. C. van der Meulen, Root-n Consistent Estimators of Entropy for
Densities with Unbounded Support, Scandinavian Journal of Statistics, vol. 23, pp. 75
83, 1994.
REFERENCES
489
41. P. Viola, N. Schraudolph, and T. Sejnowski, Empirical Entropy Manipulation for RealWorld Problems, Proceedings of NIPS95, pp. 851 857, 1995.
42. A. Yeredor, The Extended Least Squares Criterion: Minimization Algorithms and
Applications, IEEE Trans. Signal Processing, vol. 49, pp. 74 86, 2000.
43. T. Soderstrom, P. Stoica, System Identication, Prentice-Hall, London, UK, 1989.
44. Y. N. Rao, D. Erdogmus, G. Y. Rao, J. C. Principe, Fast Error Whitening Algorithms for
System Identication and Control, submitted to IEEE Workshop on Neural Networks
for Signal Processing, April 2003.
INDEX
Acoustic echo cancellation FIR lter,
151
Acoustic echo control, 209
Active tap detection: Heuristics, 162
Adaptive equalization, 309, 367
Adaptive linear combiners, 1
Adaptive linear predictor, 364
Adaptive plant identication, 3
Adaptive process, 49, 61
Adaptive process (small step-size), 38
Afne projection algorithms (APA),
241, 242
APA as a contraction mapping, 252
block exact APA, 269
block fast afne projection: summary,
272
Almost-sure convergence, 95
Alternate single-channel time-varying
equivalent to the two-channel LTI
Wiener lter, 375
Analysis of autocorrelation of the error
signal, 447
Asymptotic behavior of learning rate
matrix, 326
Asymmetry of the probability
distribution, 100
491
492
INDEX
INDEX
Mean-square convergence, 93
Misadjustment, 8
Misadjustment due to gradient noise,
17
Misadjustment due to lag, 17
Mixed H 2 =H 1 problems, 141
Maximum likelihood estimators, 110
Model-order selection, 457
MSE learning curve time constants, 10
MSE optimality of the RPNLMS
algorithm, 327
Motivation for error-whitening Wiener
lters, 447
Multichannel adaptive or Wiener
ltering scenario, 357
493
494
INDEX
Weight-error correlations, 46
Weight-error correlations with delay, 74
Weight tracks and convergence, 480
Wideband adaptive noise canceller
(ANC) scenario, 340
Wiener solution, 4