Least Mean Square Adaptive Filters

LEAST-MEAN-SQUARE
ADAPTIVE FILTERS
LEAST-MEAN-SQUARE
ADAPTIVE FILTERS
Edited by
S. Haykin and B. Widrow
A JOHN WILEY & SONS, INC. PUBLICATION
This book is printed on acid-free paper.

Copyright q 2003 by John Wiley & Sons Inc. All rights reserved.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted
under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written
permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the
Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 7504744. Requests to the Publisher for permission should be addressed to the Permissions Department, John
Wiley & Sons, Inc., 111 River Street, Hoboken, New Jersey 07030, (201) 748-6011, fax (201) 748-6008,
E-Mail: PERMREQ@WILEY.COM.
For ordering and customer service, call 1-800-CALL-WILEY.
Library of Congress Cataloging-in-Publication Data:
Least-mean-square adaptive lters/edited by S. Haykin and B. Widrow
p. cm.
Includes bibliographical references and index.
ISBN 0-471-21570-8 (cloth)
1. Adaptive ltersDesign and constructionMathematics. 2. Least squares. I. Widrow,
Bernard, 1929- II. Haykin, Simon, 1931TK7872.F5L43 2003
621.38150 324dc21
2003041161
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1
This book is dedicated to Bernard Widrow for inventing the LMS lter
and investigating its theory and applications
Simon Haykin
CONTENTS
Contributors
ix
Introduction: The LMS Filter (Algorithm)

Simon Haykin
xi
1.
On the Efciency of Adaptive Algorithms

Bernard Widrow and Max Kamenetsky
2.
Traveling-Wave Model of Long LMS Filters

Hans J. Butterweck
3.
Energy Conservation and the Learning Ability of LMS

Adaptive Filters
Ali H. Sayed and V. H. Nascimento
35
79
4.
On the Robustness of LMS Filters

Babak Hassibi
105
5.
Dimension Analysis for Least-Mean-Square Algorithms

Iven M. Y. Mareels, John Homer, and Robert R. Bitmead
145
6.
Control of LMS-Type Adaptive Filters

Eberhard Hansler and Gerhard Uwe Schmidt
175
7.
Afne Projection Algorithms

Steven L. Gay
241
8.
Proportionate Adaptation: New Paradigms in Adaptive Filters

Zhe Chen, Simon Haykin, and Steven L. Gay
293
9.
Steady-State Dynamic Weight Behavior in (N)LMS Adaptive

Filters
A. A. (Louis) Beex and James R. Zeidler
335
vii
viii
10.
CONTENTS
Error Whitening Wiener Filters: Theory and Algorithms

Jose C. Principe, Yadunandana N. Rao, and Deniz Erdogmus
445
Index
491
CONTRIBUTORS
A. A. (LOUIS ) BEEX , Systems GroupDSP Research Laboratory, The Bradley

Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg,
VA 24061-0111
ROBERT R. BITMEAD , Department of Mechanical and Aerospace Engineering,
University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 920930411
HANS BUTTERWECK , Technische Universiteit Eindhoven, Faculteit Elektrotechniek, EH 5.29, Postbus 513, 5600 MB Eindhoven, Netherlands
ZHE CHEN , Department of Electrical and Computer Engineering, CRL 102,
McMaster University, 1280 Main Street West, Hamilton, Ontario, Canada L8S
4K1
DENIZ ERDOGMUS , Computational NeuroEngineering Laboratory, EB 451,
Building 33, University of Florida, Gainesville, FL 32611
STEVEN L. GAY, Acoustics and Speech Research Department, Bell Labs, Room
2D-531, 600 Mountain Ave., Murray Hill, NJ 07974
PROF. DR .-ING . EBERHARD HA NSLER , Institute of Communication Technology,
Darmstadt University of Technology, Merckstrasse 25, D-64283 Darmstadt,
Germany
BABAK HASSIBI , Department of Electrical Engineering, 1200 East California
Blvd., M/C 136-93, California Institute of Technology, Pasadena, CA 91101
SIMON HAYKIN , Department of Electrical and Computer Engineering, McMaster
University, 1280 Main Street West, Hamilton, Ontario, Canada L8S 4K1
JOHN HOMER , School of Computer Science and Electrical Engineering, The
University of Queensland, Brisbane 4072
MAX KAMENETSKY, Stanford University, David Packard Electrical Engineering,
350 Serra Mall, Room 263, Stanford, CA 94305-9510
IVEN M. Y. MAREELS , Department of Electrical and Electronic Engineering, The
University of Melbourne, Melbourne Vic 3010
ix
CONTRIBUTORS
V. H. NASCIMENTO , Department of Electronic Systems Engineering, University of

Sao Paulo, Brazil
JOSE C. PRINCIPE , Computational NeuroEngineering Laboratory, EB 451,
YADUNANDANA N. RAO , Computational NeuroEngineering Laboratory, EB 451,
ALI H. SAYED , Department of Electrical Engineering, Room 44-123A Engineering
IV Bldg, University of California, Los Angeles, CA 90095-1594
GERHARD UWE SCHMIDT, Institute of Communication Technology, Darmstadt
University of Technology, Merckstrasse 25, D-64283 Darmstadt, Germany
BERNARD WIDROW, Stanford University, David Packard Electrical Engineering,
350 Serra Mall, Room 273, Stanford, CA 94305-9510
JAMES R. ZEIDLER , Department of Electrical and Computer Engineering,
University of California, San Diego, La Jolla, CA 92092
INTRODUCTION: THE LMS FILTER

(ALGORITHM)
SIMON HAYKIN
The earliest work on adaptive lters may be traced back to the late 1950s, during
which time a number of researchers were working independently on theories and
applications of such lters. From this early work, the least-mean-square LMS
algorithm emerged as a simple, yet effective, algorithm for the design of adaptive
transversal (tapped-delay-line) lters.
The LMS algorithm was devised by Widrow and Hoff in 1959 in their study of a
pattern-recognition machine known as the adaptive linear element, commonly
referred to as the Adaline [1, 2]. The LMS algorithm is a stochastic gradient
algorithm in that it iterates each tap weight of the transversal lter in the direction of
the instantaneous gradient of the squared error signal with respect to the tap weight
in question.
^ n denote the tap-weight vector of the LMS lter, computed at iteration
Let w
(time step) n. The adaptive operation of the lter is completely described by the
recursive equation (assuming complex data)
^ n 1 w
^ n m undn w
^ H nun*;
w
where un is the tap-input vector, dn is the desired response, and m is the step-size
parameter. The quantity enclosed in square brackets is the error signal. The asterisk
denotes complex conjugation, and the superscript H denotes Hermitian transposition
(i.e., ordinary transposition combined with complex conjugation).
Equation (1) is testimony to the simplicity of the LMS lter. This simplicity,
coupled with desirable properties of the LMS lter (discussed in the chapters of this
book) and practical applications [3, 4], has made the LMS lter and its variants an
important part of the adaptive signal processing kit of tools, not just for the past 40
years but for many years to come. Simply put, the LMS lter has withstood the test
of time.
Although the LMS lter is very simple in computational terms, its mathematical
analysis is profoundly complicated because of its stochastic and nonlinear nature.
Indeed, despite the extensive effort that has been expended in the literature to
xi
xii
INTRODUCTION: THE LMS FILTER (ALGORITHM)
analyze the LMS lter, we still do not have a direct mathematical theory for its
stability and steady-state performance, and probably we never will. Nevertheless,
we do have a good understanding of its behavior in a stationary as well as a
nonstationary environment, as demonstrated in the chapters of this book.
The stochastic nature of the LMS lter manifests itself in the fact that in a
stationary environment, and under the assumption of a small step-size parameter, the
lter executes a form of Brownian motion. Specically, the small step-size theory of
the LMS lter is almost exactly described by the discrete-time version of the
Langevin equation1 [3]:
Dnk n nk n 1 nk n
m l k nk n f k n;
k 1; 2; . . . ; M;
which is naturally split into two parts: a damping force m l k nk n and a stochastic
force f k n. The terms used herein are dened as follows:
M order (i.e., number of taps) of the transversal lter around which the
LMS lter is built
l k kth eigenvalue of the correlation matrix of the input vector un, which
is denoted by R
f k n kth component of the vector m QH une*o n
Q unitary matrix whose M columns constitute an orthogonal set of
eigerivectors associated with the eigenvalues of the correlation matrix R
eo n optimum error signal produced by the corresponding Wiener lter
driven by the input vector un and the desired response dn
To illustrate the validity of Eq. (2) as the description of small step-size theory of
the LMS lter, we present the results of a computer experiment on a classic example
of adaptive equalization. The example involves an unknown linear channel whose
impulse response is described by the raised cosine [3]

8
< 1 1 cos 2p n 2 ; n 1; 2; 3;
W
3
hn 2
:
0;
otherwise
where the parameter W controls the amount of amplitude distortion produced by the
channel, with the distortion increasing with W. Equivalently, the parameter W
controls the eigenvalue spread (i.e., the ratio of the largest eigenvaiue to the smallest
eigenvalue) of the correlation matrix of the tap inputs of the equalizer, with the
eigenvalue spread increasing with W. The equalizer has M 11 taps. Figure 1
presents the learning curves of the equalizer trained using the LMS algorithm with
the step-size parameter m 0:0075 and varying W. Each learning curve was
obtained by averaging the squared value of the error signal en versus the number of
iterations n over an ensemble of 100 independent trials of the experiment. The
1
The Langevin equation is the engineers version of stochastic differential (difference) equations.
INTRODUCTION: THE LMS FILTER (ALGORITHM)
xiii
Figure 1 Learning curves of the LMS algorithm applied to the adaptive equalization of a
communication channel whose impulse response is described by Eq. (3) for varying
eigenvalue spreads: Theory is represented by continuous well-dened curves. Experimental
results are represented by uctuating curves.
continuous curves shown in Figure 1 are theoretical, obtained by applying Eq. (2).
The curves with relatively small uctuations are the results of experimental work.
Figure 1 demonstrates close agreement between theory and experiment.
It should, however, be reemphasized that application of Eq. (2) is limited to small
values of the step-size parameter m . Chapters in this book deal with cases when m is
large.
REFERENCES
1. B. Widrow and M. E. Hoff, Jr. (1960). Adaptive Switching Circuits, IRE WESCON
Conv. Rec., Part 4, pp. 96 104.
2. B. Widrow (1966). Adaptive Filters I: Fundamentals, Rep. SEL-66-126 (TR-6764-6),
Stanford Electronic Laboratories, Stanford, CA.
3. S. Haykin (2002). Adaptive Filter Theory, 4th Edition, Prentice-Hall.
4. B. Widrow and S. D. Stearns (1985). Adaptive Signal Processing, Prentice-Hall.
ON THE EFFICIENCY OF
ADAPTIVE ALGORITHMS
BERNARD WIDROW and MAX KAMENETSKY

Department of Electrical Engineering
Stanford University, Stanford, California
1.1
INTRODUCTION
The basic component of most adaptive ltering and signal processing systems is the
adaptive linear combiner [1 5] shown in Figure 1.1. The formed output signal is a
weighted sum of a set of input signals. The output would be a simple linear
combination of the inputs only if the weights were xed. In actual practice, the
weights are adjusted or adapted purposefully; the resulting weight values are signal
dependent. This process causes the system behavior during adaptation to differ
signicantly from that of a linear system. However, after the adaptive process has
converged and the weights have settled to essentially xed values with only minor
random uctuations about the equilibrium solution, the converged system exhibits
essentially linear behavior.
Adaptive linear combiners have been successfully used in the modeling of
unknown systems [2, 6 8], linear prediction [2, 9 11], adaptive noise cancelling
[4, 12], adaptive antenna systems [3, 13 15], channel equalization systems for highspeed digital communications [16 19], echo cancellation [20 23], systems for
instantaneous frequency estimation [24], receivers of narrowband signals buried in
noise (the adaptive line enhancer) [4, 25 30], adaptive control systems [31], and
in many other applications.
In Figure 1.1a, the interpretation of the input signal vector un
u1 n; . . . ; uK nT and the desired response dn might vary, depending on how
the adaptive linear combiner is used. In Figure 1.1b, an application to adaptive nite
impulse response (FIR) ltering is shown. In turn, an application of adaptive FIR
ltering to plant modeling or system identication is shown in Figure 1.2. Here, we
can view the desired response dn as a linear combination of the last K samples of
the input signal, corrupted by independent zero-mean plant noise vn. Our aim in
this application is to estimate an unknown plant (represented by its transfer function
Least-Mean-Square Adaptive Filters, Edited by Simon Haykin and Bernard Widrow.
ISBN 0-471-21570-8 q 2003 John Wiley & Sons, Inc.
ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS
Figure 1.1 Adaptive linear combiner and its application in an adaptive lter: (a) linear
combiner; (b) adaptive FIR lter.
Pz w1;o wK;o zK1 ) through the minimization of the output error en in

the mean square sense. For purposes of analysis, we consider the plant to be a
transversal FIR lter.
Referring to Figure 1.1b, the input signal vector at the nth sampling instant is
designated by
un un; . . . ; un K 1T ;
1:1
and the set of weights of the adaptive transversal lter is designated by

w w1 ; . . . ; wK T :
1:2
1.1 INTRODUCTION
Figure 1.2
Adaptive plant identication.
The nth output sample is

yn
K
P
wi un i 1 wT un uT nw:
1:3
i1
The input signal vector and the desired response are assumed to be wide-sense
stationary. Denoting the desired response as dn, the error at the nth time is
en dn yn dn wT un dn uT nw:
1:4
The square of this error is

e2 n d 2 n 2dnuT nw wT unuT nw:
1:5
The mean square error (MSE), j , dened as the expected value of e2 n, is

4
j
Ee2 n
Ed2 n 2EdnuT nw wT EunuT nw

Ed2 n 2pT w wT Rw;
1:6
where the cross-correlation vector between the input signal and the desired response
is dened as
2
3
dnun
..
6
74
1:7
Ednun E4
5 p;
.
dnun K 1
and the input autocorrelation matrix R is dened as
2
unun
unun K 1
6 un 1un
un 1un K 1
6
EunuT n E6
..
..
4
.
.
un K 1un
3
7
7
7
5
un K 1un K 1
R:
1:8
It can be observed from Eq. (1.6) that with wide-sense stationary inputs, the MSE
performance function is a quadratic function of the weights, a paraboloidal bowl.
This function can be minimized by differentiating j with respect to w and setting the
derivative to zero. The minimal point is
w wo R1 p:
1:9
The optimal weight vector wo is known as the Wiener weight vector or the Wiener
solution.
In practice, we would not know the exact statistics of R and p. One way of nding
an estimate of the optimal weight vector wo would be to estimate R and p for the
given input and desired response. This approach would lead to what is called an
exact least-mean-square solution. This approach is optimal in the sense that the sum
of square errors will be minimal for the given data samples. However, such solutions
are generally somewhat complex from the computational point of view [32 37].
On the other hand, one can use one of the simpler gradient search algorithms such
as the least-mean-square (LMS) steepest descent algorithm of Widrow and Hoff [1].
However, this algorithm is sometimes associated with a certain deterioration in
performance in problems for which there exists great spread among the eigenvalues
of the autocorrelation matrix R (see, for instance, [32, 34, 37]).
In order to establish a bridge between the LMS and the exact least squares
approaches mentioned above, we will introduce an idealized algorithm called LMS/
Newton [5]. For the implementation of this algorithm, we will have to assume
perfect knowledge of the autocorrelation matrix R. Naturally, that means that this
idealized algorithm cannot be used in practice. However, its performance provides a
convenient theoretical benchmark for the sake of comparison.1
1
It should be noted that there are numerous algorithms in the literature that recursively estimate the
autocorrelation matrix R and use this estimation for orthogonalizing the input data (see, for instance,
[3842]). These algorithms converge asymptotically to the idealized algorithm discussed here.
1.2 LEARNING WITH A FINITE NUMBER OF DATA SAMPLES
In the next section, we will briey analyze the performance of the exact least
squares solution when the weights are obtained with a nite data sample. Then in
Sections 1.3 and 1.4, we will analyze the idealized LMS/Newton algorithm and, in
Section 1.5, show, at least heuristically, that its performance is equivalent to that of
an exact least squares algorithm. Based on this heuristic argument, we will view the
LMS/Newton process as an optimal gradient search algorithm.
In Section 1.6 we will dene a class of nonstationary problems: problems in
which an unknown plant Pz varies in a certain random way. Once again, the
adaptive lter will perform a modeling task. For this class of frequently encountered
problems, we will analyze and compare the performance of LMS/Newton with that
of the conventional steepest descent LMS algorithm. We will show that both
perform equivalently (in the mean square sense) for this class of nonstationary
problems. In Section 1.7, we will examine the MSE learning curves and the transient
behavior of adaptive algorithms. The excess error energy will be dened to be the
area under the excess MSE curve. The LMS and LMS/Newton algorithms will be
shown to perform, on average, equivalently with respect to this important criterion if
they both start learning from random initial conditions that have the same variance.
In Sections 1.8 and 1.9, we will conclude this chapter by summarizing the various
comparisons made between the LMS algorithm and the ideal LMS/Newton
algorithm.
1.2
LEARNING WITH A FINITE NUMBER OF DATA SAMPLES
Suppose that the adaptive linear combiner in Figure 1.2 is fed N independent zeromean K 1 training vectors u1; u2; . . . ; uN and their respective scalar desired
responses d1; d2; . . . ; dN, all drawn from a wide-sense stationary process.
Keeping the weights xed, a set of N error equations can be written as
en dn uT nw;
n 1; 2; . . . ; N:
1:10
The objective is to nd a weight vector that minimizes the sum of the squares of the
error values based on the nite sample of N items of data.
Equation (1.10) can be written in matrix form as
e d Uw;
1:11
where U is an N K rectangular matrix

4
u1; u2; . . . ; uNT ;
U
1:12
e is an N element error vector

4
e1; e2; . . . ; eNT ;
e
1:13
and d is an N element vector of desired responses

4
d
d1; d2; . . . ; dNT :
1:14
A unique solution of Eq. (1.11), a weight vector w that brings e to zero, exists only if
U is square and nonsingular. However, the case of greatest interest is that of N K.
As such, Eq. (1.11) would typically be overconstrained and one would generally
seek a best least squares solution. The sum of the squares of the errors is
eT e dT d wT U T Uw 2dT Uw:
1:15
This sum multiplied by 1/N is an estimate j^ of the MSE j . Thus,

1
j^ eT e
N
1:16
lim j^ j :
1:17
and
N!1
Note that j^ is a quadratic function of the weights. The parameters of the quadratic
form are related to properties of the N data samples. U T U is square and is assumed
to be positive denite. j^ is a small-sample-size MSE function. j is the large-samplesize true MSE function, and it is also a quadratic function of the weights. Figure
1.3 shows a comparative sketch of these functions. Many small-sample-size datadependent curves are possible, but there is only one large-sample-size curve. The
unique large-sample-size curve is the average of the many small-sample-size curves.
Figure 1.3 Small- and large-sample MSE curves.
1.2 LEARNING WITH A FINITE NUMBER OF DATA SAMPLES
The minimum of a small-sample-size function can be found by differentiating

Eq. (1.15) and setting the derivative to zero. The result is
wLS U T U1 UT d:
1:18
This is the exact least squares solution for the given data sample. The Wiener
solution wo is the expected value of wLS .
Each small-sample-size curve is an ensemble member. Let the ensemble be
constructed in the following manner. Assume that the vectors u1; u2; . . . ; uN
are the same for all ensemble members but that the associated desired responses
d1; d2; . . . ; dN differ from one ensemble member to another because of the
stochastic character of plant noise (refer to Fig. 1.2). Over this ensemble, therefore,
the U matrix is constant, while the desired response vector d is stochastic. In order to
evaluate the excess MSE due to adaptation with the nite amount of data available,
we have to nd
j excess
1
Ee T U T Ue ;
N
1:19
where the error in the weight vector is

4
e
wLS wo :
1:20
Expectation is taken over the above-described ensemble. Equation (1.19) can be

written as
j excess
1
1
TrEe T U T Ue ETre T U T U e
N
N
1
1
ETre e T U T U TrEee T U T U:
N
N
1:21
The covariance matrix of the weight error vector e , Ee e T , can be computed as

follows. First, note that
e U T U1 U T d wo UT U1 U T d Uwo :
1:22
Then, for small e ,2

Ee e T U T U1 U T Ed Uwo d Uwo T UU T U1 UT U1 j min ;
1:23
2
For the sake of simplicity, we assumed here that the plant noise vn in Figure 1.2 is white and that the
adaptive lter has enough weights to match the unknown plant.
where j min is the minimum MSE, the minimum of the true MSE function (see
Fig. 1.3).
Substitution of Eq. (1.23) into Eq. (1.21) yields
j excess
K
j :
N min
1:24
It is important to note that this formula does not depend on U. The above-described
ensemble can be generalized to an ensemble of ensembles, each having its own U,
without changing Eq. (1.24). Hence, this formula is valid for a very wide class of
inputs.
It is useful to consider a dimensionless ratio between the excess MSE and the
minimum MSE. This ratio is commonly called (see, e.g., [1, 2, 4]) the misadjustment, M. For the exact least squares solution based on learning with a nite
data sample, we nd the misadjustment from Eq. (1.24) as
M
K
number of weights
:
N number of independent training samples
1:25
The misadjustment is a dimensionless measure of how far on average the optimal

nite-data solution deviates from the optimal innite-data Wiener solution. There
are many ways to nd the optimal nite-data solution, such as adaptive algorithms of
various types and analytical methods based on Eq. (1.18) and Eqs. (1.12) and (1.14).
The particularities of the algorithm for nding the optimal nite-data solution have
no effect on this solution or on the resulting misadjustment.
The misadjustment formula in Eq. (1.25) was rst presented without detailed
proof by Widrow and Hoff [1] in 1960 and has been used for many years in pattern
recognition studies. For small values of M (less than 25 percent), it has proven to be
an excellent approximation. A formula similar to Eq. (1.25), based on somewhat
different assumptions, was derived by Davisson [43] in 1970. A more exact formula
was derived by Widrow and Walach [31, App. A].
1.3
STEADY FLOW LEARNING WITH THE LMS ALGORITHM
Gradient methods are commonly used to adjust adaptive parameters in order to

search the quadratic MSE performance function for its minimum. Most widely used
is the method of steepest descent. With this method, a sequence of changes is made
to the weight vector along the direction of the negative gradient. Thus, the next
weight vector, wn 1, is made equal to the present weight vector, wn, plus a
change proportional to the negative gradient at the nth iteration:
wn 1 wn m 7n:
1:26
The parameter m controls stability and rate of convergence. An instantaneous

^ n, an estimate of the true gradient 7n, can be found by differentiation
gradient 7
of Eq. (1.5) with respect to w:
^ n 2enun:
7
1:27
1.3 STEADY FLOW LEARNING WITH THE LMS ALGORITHM
This is a noisy but unbiased estimate of the gradient [5, p. 101]. Using this
instantaneous gradient in place of the true gradient in Eq. (1.26) yields the LMS
algorithm of Widrow and Hoff:
wn 1 wn 2m enun:
1:28
The behavior of this algorithm has been analyzed extensively in the literature (see,
e.g., [2 4, 44 51]). It was proved in [2] and [4] that if the adaptation constant m
were chosen such that
0,m ,
1
;
TrR
1:29
then the adaptive weights would relax from their initial condition to hover randomly
about the Wiener solution wo . The weight error vector
e n wn wo
1:30
will then converge to zero in the mean, and its variance will be stable ([2 4, 47, 52,
53]). The relaxation process will be governed by the relation
Ee n 1 I 2m REe n:
1:31
Therefore, there will be K different modes of convergence corresponding to K

eigenvalues of the autocorrelation matrix R. Using normal decomposition of the
matrix R:
QQT I;
R QLQT
2
l1
6
..
L6
.
4
0
3
7
7;
5
1:32
1:33
lK
we can nd the corresponding K time constants t i of the weight relaxation process as
m l i ! 1;
ti
1
;
2m l i
1 i K:
1:34a
As the weights relax toward the Wiener solution, the MSE, a quadratic function of
the weights, undergoes a geometric progression toward j min . The learning curve is
a plot of MSE versus number of adaptation cycles. The natural modes of the learning
curve have time constants half as large as the corresponding time constants of the
10
weights ([2 4]). Accordingly, the MSE learning curve time constants are
t iMSE
1
;
4m l i
1 i K:
1:34b
After convergence has taken place, there remains noise in the weights due to the
noise in the estimation of the gradient in Eq. (1.27). An approximate value of the
covariance of the weight noise, valid for small m , was derived in [4, App. D]:
Ee ne T n m j min I:
1:35
The noise in the weights will cause an excess error in the system output (in addition
to j min, the Wiener error):
j excess Ee T nun2
1:36
ETre ne T nunuT n:

Assuming, as has been done before ([2 4]), that e n and un are independent, Eq.
(1.35) can be substituted into Eq. (1.36) to obtain
j excess m j min KEu2 n m TrRj min :
1:37
Therefore, we can compute the misadjustment, dened as the ratio between the
excess and the minimum MSE:
4
M
j excess
m TrR:
j min
1:38
The adaptation constant m should be kept low in order to keep the misadjustment
low. However, low m is associated with slow adaptation in accordance with Eq.
(1.34). Equations (1.29) to (1.38) illustrate the potential vulnerability of the steepest
descent algorithm. The speed of convergence will depend on the choice of initial
conditions. In the worst case, the convergence will be dominated by the lowest
eigenvalue
l min minl 1 ; . . . ; l K :
1:39
This implies that even if we choose the maximal value allowable for the adaptation
constant m (due to the stability constraint in Eq. (1.29)), the slowest time constant
for the weights would be
t max MSE
1
:
4m l min
1:40
1.4 STEADY FLOW LEARNING WITH THE LMS/NEWTON ALGORITHM
11
For the class of problems for which there exists a great spread of eigenvalues of the
autocorrelation matrix R, this number will be high, resulting in long convergence
times (at least in the worst case).
1.4 STEADY FLOW LEARNING WITH THE LMS/NEWTON

ALGORITHM
The method of steepest descent makes changes in the weight vector for each
adaptation cycle in the direction of the negative gradient. The LMS algorithm does
this in the direction of the negative estimated gradient. The result is adaptation with
learning time constants whose number is equal to the number of distinct eigenvalues
of R. To achieve learning with a single time constant, the gradient estimate could be
premultiplied by R1 . The result would be a Newtons method form of LMS ([5, pp.
142 147]).
The LMS/Newton algorithm is dened as
wn 1 wn 2m l ave R1 enun:
1:41
The gradient estimate is premultiplied by R1 and in addition scaled by l ave, the
average of the eigenvalues of R. With this scaling, the LMS/Newton algorithm of
Eq. (1.41) becomes identical to the steepest descent LMS algorithm of Eq. (1.28)
when all of the eigenvalues are equal.
The LMS/Newton algorithm will be shown to be the most efcient of all adaptive
algorithms. For a given number of weights and convergence speed, it has the lowest
possible misadjustment. The LMS/Newton algorithm cannot be implemented
physically because perfect knowledge of the autocorrelation matrix R and its inverse
usually do not exist. On the other hand, the LMS/Newton algorithm is very
important from a theoretical point of view because of its optimality.
The conditions for stability as well as learning time constant and misadjustment
formulas for the LMS/Newton algorithm can be readily obtained. The condition for
convergence in the mean and in the variance for LMS/Newton is
0,m ,
1
:
TrR
1:42
This is identical to Eq. (1.29). The time constant of the MSE learning curve for
LMS/Newton is
t MSE
1
:
4m l ave
1:43
Comparing this to Eq. (1.34b), one can see that LMS has many time constants and
LMS/Newton has only one. When the eigenvalues are equal, both algorithms have
only one time constant and these formulas become identical. The misadjustment of
12
LMS/Newton is
M m TrR:
1:44
This is identical to Eq. (1.38).

Another expression for misadjustment of LMS/Newton can be obtained by
combining Eqs. (1.44) and (1.43). The result is
M
TrR
K
:
4t MSE l ave 4t MSE
1:45
1.5 OPTIMALITY OF LMS/NEWTON IN A STATIONARY

ENVIRONMENT
The LMS/Newton algorithm exponentially weights its input data over time as it
establishes its weight values. The settling time of the adaptive process is of the order
of four time constants of the MSE learning curve. At any moment, the weights are
determined by adaptation that has taken place over essentially the last four time
constants worth of data. Thus, in a steady ow situation, the training data
consumed or absorbed at any time by the LMS/Newton algorithm consist
essentially of the most recent 4t MSE samples. From Eq. (1.45), the misadjustment of
the LMS/Newton algorithm can therefore be expressed as
M
K
number of weights
:
4t MSE number of independent training samples
1:46
When learning with a nite data sample, the optimal weight vector is the best least
squares solution for that data sample, and it is often called the exact least squares
solution. This solution, given by Eq. (1.18), makes the best use of the nite number
of data samples in the least squares sense. All of the data are weighted equally in
affecting the solution. This solution will vary from one nite data sample to another.
From Eq. (1.25), the misadjustment of the exact least squares solution is given by
M
number of weights
:
number of independent training samples
1:47
For the same consumption of data, it is apparent that LMS/Newton and exact
least squares yield the same misadjustment. Although we are comparing apples
with oranges by comparing a steady ow algorithm with an algorithm that learns
with a nite data sample, we nevertheless nd that LMS/Newton is as efcient as
exact least squares when we relate the quality of the weight-vector solution to the
amount of data used in obtaining it. Since the exact least squares solution makes
optimal use of the data, so does LMS/Newton.
1.6 LMS AND LMS/NEWTON IN A NONSTATIONARY ENVIRONMENT
1.6
13
LMS AND LMS/NEWTON IN A NONSTATIONARY ENVIRONMENT
Filtering nonstationary signals is a major area of application for adaptive systems.

When the statistical character of an input signal changes gradually, randomly, and
unpredictably, a ltering system that can automatically optimize its input-output
response in accord with the requirements of the input signal could yield superior
performance relative to that of a xed, nonadaptive system. The performance of the
conventional steepest descent LMS algorithm is compared here with LMS/Newton
(which, as demonstrated in the previous section, possesses optimality qualities)
when both algorithms are used to adapt transversal lters with nonstationary inputs.
The nonstationary situations to be studied are highly simplied, but they retain the
essence of the problem that is common to more complicated and realistic situations.
The example considered here involves modeling or identifying an unknown timevarying system by an adaptive LMS transversal lter of length K. The unknown
system is assumed to be a transversal lter of the same length K whose weights
(impulse response values) vary as independent stationary ergodic rst-order Markov
processes, as indicated in Figure 1.4. The input signal un is assumed to be
stationary and ergodic. Additive output noise, assumed to be stationary and ergodic,
of mean zero and of variance j min, prevents a perfect match between the unknown
system and the adaptive system. The minimum MSE is, therefore, j min, and it is
achieved whenever the weights of the adaptive lter, wn, match those of the
unknown system. The latter are at every instant the optimal values for the
Figure 1.4
Modeling an unknown time-varying system.
14
corresponding weights of the adaptive lter and are designated as wo n, the time
index indicating that the unknown target to be tracked is time-varying.
The components of wo n are generated by passing independent white noises of
variance s 2 through identical one-pole low-pass lters. The components of wo n
therefore vary as independent rst-order Markov processes. The formation of wo n
is illustrated in Figures 1.4 and 1.5.
Figure 1.5
An ensemble of nonstationary adaptive processes.
1.6 LMS AND LMS/NEWTON IN A NONSTATIONARY ENVIRONMENT
15
According to the scheme of Figure 1.4, minimizing the MSE causes the adaptive
weight vector wn to attempt to best match the unknown wo n on a continual basis.
The R matrix, dependent only on the statistics of un, is constant even as wo n
varies. The desired response of the adaptive lter, dn, is nonstationary, being
the output of a time-varying system. The minimum MSE, j min , is constant. Thus
the MSE function, a quadratic bowl, varies in position, while its eigenvalues,
eigenvectors, and j min remain constant.
In order to study this form of nonstationary adaptation both analytically and by
computer simulation, a model comprising an ensemble of nonstationary adaptive
processes has been dened and constructed, as illustrated in Figure 1.5. Throughout
the ensemble, the unknown lters to be modeled are all identical and have the same
time-varying weight vector wo n. Each ensemble member has its own independent
input signal going to both the unknown system and the corresponding adaptive
system. The effect of output noise in the unknown systems is obtained by the
addition of independent noises of variance j min. All of the adaptive lters are
assumed to start with the same initial weight vector w0; each develops its own
weight vector over time in attempting to pursue the moving Markovian target wo n.
For a given adaptive lter, the weight-vector tracking error at the nth instant is
4
e n wn wo n. This error is due to both the effects of gradient noise and
weight-vector lag and may be expressed as
e n wn wo n
wn Ewn Ewn wo n :
|{z} |{z}
weight vector noise
1:48
weight vector lag
The expectations are averages over the ensemble. Equation (1.48) identies the two
components of the error. Any difference between the ensemble mean of the adaptive
weight vectors and the target value wo n is due to lag in the adaptive process, while
the deviation of the individual adaptive weight vectors about the ensemble mean is
due to gradient noise.
Weight-vector error causes an excess MSE. The ensemble average excess MSE at
the nth instant is

average excess
n E wn wo nT Rwn wo n :
MSE
1:49
Using Eq. (1.48), this can be expanded as follows:

average excess
MSE

n E wn EwnT Rwn Ewn

E Ewn wo nT REwn wo n

2E wn EwnT REwn wo n :
1:50
16
Expanding the last term of Eq. (1.50) and simplifying since wo n is constant over
the ensemble,
2EwT nREwn wT nRwo n EwnT REwn
EwnT Rwo n 2EwnT REwn EwnT REwn
1:51
EwnT Rwo n EwnT Rwo n 0:

Therefore, Eq. (1.50) becomes

average excess
n Ewn EwnT Rwn Ewn
MSE

E Ewn wo nT REwn wo n :
1:52
The average excess MSE is thus a sum of components due to both gradient noise and
lag:

average excess
n E Ewn wo nT REwn wo n
MSE due to lag

E Ew0 n w0o nT LEw0 n w0o n
1:53

average excess MSE
n E wn EwnT Rwn Ewn
due to gradient noise

E w0 n Ew0 nT Lw0 n Ew0 n ;
4
1:54
where w0 n QT wn and w0o n QT wo n. The total misadjustment is therefore

a sum of two components, that due to lag and that due to gradient noise. These
components of misadjustment have been evaluated by Widrow et. al. [54]. The total
misadjustment for adaptation with the LMS algorithm is

Msum
misadjustment
Ks 2
:
m TrR
4m j min
misadjustment
due to lag

1:55
Since Msum is convex in m , an optimal choice of m that minimizes Msum can be

obtained by differentiating Msum with respect to m and setting the derivative to zero.
Optimization takes place when the two terms of Eq. (1.55) are made equal. When
this happens, the loss in performance from adapting too rapidly (due to gradient
noise) is equal to the loss in performance from adapting too slowly (due to lag).
1.7 TRANSIENT LEARNING: EXCESS ERROR ENERGY
17
It is interesting to note that Msum in Eq. (1.55) depends on the choice of the
parameter m and on the statistical properties of the nonstationary environment but
does not depend on the spread of the eigenvalues of the R matrix. It is no surprise,
therefore, that when the components of misadjustment are evaluated for the LMS/
Newton algorithm operating in the very same environment, the expression for Msum
for the LMS/Newton algorithm turns out to be

Msum
misadjustment
misadjustment
due to lag
Ks 2
m TrR
;
4m j min

1:56
which is the same as Eq. (1.55). From this we may conclude that the performance of
the LMS algorithm is equivalent to that of the LMS/Newton algorithm when both
are operating with the same choice of m in the same nonstationary environment,
wherein they are tracking a rst-order Markov target. Since LMS/Newton is
optimal, we may conclude that the conventional physically realizable LMS
algorithm is also optimal when operating in a rst-order Markov nonstationary
environment. And it is likely optimal or close to it when operating in many other
types of nonstationary environments, although this has not yet been proven.
1.7
TRANSIENT LEARNING: EXCESS ERROR ENERGY
There are two properties of the learning curve decay that are more important than
how long it takes to die out (in principle forever): the amount of transient excess
MSE and the length of time it has existed. In other words, we need to determine how
much excess error energy there has been. Refer to Figure 1.6 and consider the area
under the learning curve above the j min line. Starting from the same initial
condition, the convergence times of two different learning curves are hereby dened
as being identical if their respective areas are equal.
1.7.1
Exact Gradient Knowledge
Assuming that we have knowledge of the true MSE gradient, adaptation will take
place without gradient noise. The weight vector w is then only a function of the
second-order statistics of the input u and the desired signal d and does not depend on
the actual values that a particular realization of these random processes may take.
That is, the weight vector is not a random variable and can be pulled out of the
expectations in the MSE expression. We thus obtain that at any iteration n, the MSE
can be expressed as
j n Ed2 2pT wn wT nRwn;
1:57
18
Figure 1.6 Idealized learning curve (no gradient noise). The shaded area represents the
excess error energy.
4
where Ed2 Ed2 n for all n, since the desired output d is wide-sense stationary.
When wn wo R1 p, we can obtain j min as
j min Ed2 pT R1 p:
1:58
Substituting Eq. (1.58) into Eq. (1.57),
j n j min pT R1 p 2pT wn wT nRwn

j min pT R1 p 2pT R1 Rwn wT nRwn
j min wn R1 pT Rwn R1 p
1:59
j min e T nRe n
j min b n;
4
where b n e T nRe n is the transient excess MSE.

1.7.1.1 Exact Newtons Method The exact Newtons method has perfect
knowledge of the R matrix and perfect knowledge of the gradient vector for
each iteration. There is no gradient noise. Accordingly, the weight error update
19
equation is
e n 1 e n 2ml ave e n 1 2m l ave n1 e 0:
1:60
Substituting Eq. (1.60) into the denition for b n, we obtain
b n 1 2m l ave 2n e T 0Re 0:
1:61
Excess error energy is the area under the transient excess MSE curve. Following this
denition,
4
Excess error energy a
1
P
b n
n0
1:62
1
e T 0Re 0:
1 1 2m l ave 2
As assumed previously, adaptation is performed slowly so that m l ave ! 1.

Therefore,
a
1
1
e T 0Re 0
e 0 T 0Le 0 0;
4m l ave
4m l ave
1:63
where e 0 n QT e n and, consequently, e 0 0 QT e 0.

An average excess error energy can be obtained by averaging the excess error
energy over an ensemble of learning experiments, each starting with different
randomly selected initial conditions. To compute the average excess error energy,
we need to make some assumptions about the statistics of e 0 0. We assume that
e 0 0 is a random vector with components each having a variance of g 2. Noting that
e 0 T 0Le 0 0 is a scalar, we can now compute Ea with the help of the trace
operator:

1
0T
0
Ea E
Tr e 0Le 0
4m l ave

1
0
0T
Tr Le 0e 0
E
4m l ave
h
i
1
T
Tr LE e 0 0e 0 0
1:64
4m l ave
K
P
1
g2 li
4m l ave i1
Kg 2
:
4m
Note that this result is independent of the eigenvalues of R.
20
1.7.1.2 Exact Steepest Descent Under the same conditions, analogous calculations can be made for the exact steepest descent algorithm. There is no gradient
noise. The weight error update equation is now
e n 1 e n 2m Re n I 2m Rn1 e 0:
1:65
Substituting Eq. (1.65) into the denition for b n, we obtain
b n e T 0I 2m Rn RI 2m Rn e 0

n
n
e T 0 QI 2m LQT R QI 2m LQT e 0
1:66
e 0 0I 2m L2n Le 0 0:
T
Then, once again exploiting the properties of the trace operator and assuming slow
adaptation, the excess error energy is

1
P
T
a
Tr e 0 0I 2m L2n Le 0 0
n0
Tr
1
P
I 2m L2n le0 0e 0 0
n0

Tr I I 2m L2 1 Le 0 0e 0 T 0
1:67

T
Tr 4m L1 Le 0 0e 0 0
1 0T
e 0e 0 0:
4m
Finally, again assuming that e 0 0 is a random vector with components each having a
variance of g 2, we obtain the average excess error energy as
Ea
i Kg 2
1 h 0T
E e 0e 0 0
:
4m
4m
1:68
Notice that the average excess error energy is once again independent of the
eigenvalues of R and is identical to Eq. (1.64) for Newtons method. The average
convergence time for steepest descent is therefore identical to the average
convergence time for Newtons method, given that both algorithms adapt with the
same value of m .
1.7.2
Gradient Estimation: The LMS Algorithm
In practice, the true MSE gradient is generally unknown, and the LMS algorithm is
used to provide an estimate of the gradient based on the input u and the desired
output d. The weight vector is now stochastic and cannot be pulled out of the
expectations in the MSE expression.
21
Furthermore, gradient estimation results in gradient noise that prevents the MSE
from converging to j min, as it does in the exact gradient case. Instead, the MSE,
averaged over an ensemble of learning curves, now converges to j fin
j min j excess , where j excess is the excess MSE due to gradient noise. This is
illustrated in Figure 1.7, where the excess error energy is now the area below the
transient MSE curve and above j n. It is useful to note that the misadjustment, in
steady ow, after adaptive transients have died out, is given by
M
j excess
:
j min
1:69
In order to derive expressions for the average excess error energy, we will use an
approach similar to [52]. Let eo n be the error when the optimal weight vector
wo R1 p is used. Then the MSE at a particular iteration n can be expressed as
j n Ee2 n

E eo n en eo n2

E e2o n 2eo nen eo n en eo n2 :
1:70
Figure 1.7 Sample learning curve with gradient noise. The shaded area represents the
excess error energy.
22
We can now examine the three terms in Eq. (1.70) separately. By denition,
4
Ee2o n j min . Also,

E eo nen eo n E eo ndn wT nun dn wTo un
Eeo nuT ne n
1:71
EdnuT ne n EwTo unuT ne n:

Once again assuming that un is a sequence of independent random vectors,
Eq. (1.71) reduces to

E eo nen eo n pT Ee n wTo EunuT nEe n 0:
1:72
Finally, evaluating the last term in Eq. (1.70) leaves us with

E en eo n2 E e T nun2

E Tre T nune T nun

Tr REe ne T n
TrLFn;
1:73
where
h
i
T
4
Fn
E e 0 ne 0 n :
1:74
Substituting Eqs. (1.72) and (1.73) back into Eq. (1.70), we obtain
j n j min TrLFn
1:75
and consequently,
Ea
1
P
TrLFn:
1:76
n0
Thus, we need to examine the evolution of TrLFn with n for LMS/Newton and
LMS.
23

4
1.7.2.1 LMS/Newton Using the denitions e 0 n QT e n and u0 n QT un,

the LMS=Newton weight update equation can be expressed as
e 0 n 1 e 0 n 2m l ave enL1 u0 n:
1:77
Once again assuming slow adaptation,

h
i
h
i
h
i
T
T
T
E e 0 n 1e 0 n 1 E e 0 ne 0 n 2m l ave E enL1 u0 ne 0 n
h
i
T
2m l ave E ene 0 nu0 nL1
h
i
T
4m 2 l 2ave E en2 L1 u0 nu0 nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave E enL1 u0 ne 0 n
h
i
T
2m l ave E ene 0 nu0 nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave E dnL1 u0 ne 0 n
h
i
T
2m l ave E wT nunL1 u0 ne 0 n
1:78
h
i
2m l ave E dne 0 nu0 T nL1
h
i
2m l ave E wT nune 0 nu0 T nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
2m l ave E wT nunL1 u0 ne 0 n
2m l ave Ee 0 n p0 L1
h
i
T
2m l ave E wT nune 0 nu0 nL1 ;
T
where p0 QT p.

We can
and subtract 2m l ave E wTo unL1 u0 ne 0 T n and
T now0 add
2m l ave E wo une nu0 T nL1 to the right-hand side of Eq. (1.78). Simplifying
24
further, we obtain
h
i
h
i
h
i
T
T
T
E e 0 n 1e 0 n 1 E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
2m l ave E e T nunL1 u0 ne 0 n
h
i
T
2m l ave E e T nuk e 0 nu0 nL1
T
h
i
T
2m l ave E wTo unL1 u0 ne 0 n
h
i
T
2m l ave E wTo une 0 nu0 nL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 p0 E e 0 n
h
i
T
T
2m l ave E L1 u0 nu0 ne 0 ne 0 n
1:79
h
i
T
T
2m l ave E e 0 ne 0 nu0 nu0 nL1
T
h
i
T
2m l ave L1 QT EunuT nR1 pE e 0 n
2m l ave Ee 0 n pT R1 EunuT nQL1
h
i
h
i
T
T
E e 0 ne 0 n 2m l ave L1 LE e 0 ne 0 n
h
i
T
2m l ave E e 0 ne 0 n LL1
h
i
T
1 4m l ave E e 0 ne 0 n :
That is,
Fn 1 1 4m l ave Fn
1:80
TrLFn 1 1 4m l ave n1 TrLF0:
1:81
and
25
Then, the average excess error energy can be found as

Ea
1
P
1 4ml ave n TrLF0
n0
1
TrLF0:
4m l ave
1:82
At this stage, dene the diagA operator to return a column vector containing the
main diagonal of a square matrix A. Using this operator, we note that
TrA 1T diagA
1:83
for any square matrix A, where 1 is a column vector of 1s of appropriate dimension.

Using this notation and once again assuming that e 0 0 is a random vector with
components each having a variance of g 2 , we can evaluate TrLF0 as
h
i
TrLF0 1T diag LE e 0 0e 0 T 0
h
i
T
1T Ldiag E e 0 0e 0 0
1:84
g 2 TrL:
Substituting Eq. (1.84) back into Eq. (1.82), we nally obtain
Ea
g2
Kg 2
TrL
:
4m l ave
4m
1:85
Note that this result is identical to Eq. (1.64).

1.7.2.2
LMS
In primed coordinates, the LMS weight update equation is
e 0 n 1 e 0 n 2m enu0 n:
1:86
Assuming slow adaptation,

h
i
h
i
h
i
T
T
T
E e 0 n 1e 0 n 1 E e 0 ne 0 n 2m E dnu0 ne 0 n
h
i
T
2m E wT nunu0 ne 0 n
h
i
T
2m E dne 0 nu0 n
h
i
T
2m E wT nune 0 nu0 n :
1:87
26
Following a derivation similar to the one for LMS=Newton,

h
i
h
i
h
i
T
T
T
E e 0 n 1e 0 n 1 E e 0 ne 0 n 2m LE e 0 ne 0 n
h
i
2m E e ne n L:
0
0T
1:88
That is,
Fn 1 Fn 2m LFn 2m FnL:
1:89
As before, we are interested in the evolution of TrLFn. However, it will once

again be more convenient to write TrLFn 1T diagLFn and instead examine
the evolution of diagLFn. Starting from Eq. (1.89),
diagLFn 1 diagLFn 2m diagL2 Fn 2m diagLFnL
I 4m LdiagLFn
I 4m L
n1
1:90
diagLF0:
Thus,
Ea
1
P
1T diagLFn
n0
1T
1
P
I 4m Ln diagLF0
1:91
n0
1T 4m L1 diagLF0:
Once again assuming that e 0 0 is a random vector with components each having a
variance of g 2 , we nally obtain
h
i
T
Ea 1T 4m L1 diag LE e 0 0e 0 0
1T 4m L1 diagLI g 2
1:92
Kg 2
:
4m
Note that this result is identical to Eq. (1.68) and the result for LMS=Newton. This
means that the average excess error energy is the same for LMS=Newton and LMS
with random initial conditions.
1.7.3
27
Convergence of the MSE
It is important to point out that the preceding results do not imply that the excess
MSE curves for LMS/Newton and LMS are identical when averaged over starting
conditions. In fact, this is completely false, and it is important to illustrate why.
Starting with Eq. (1.81) and making the same assumptions about e 0 0 as before,
the excess MSE for LMS/Newton is
b Newton n TrLFn
1 4m l ave n TrLF0
1T 1 4m l ave n diagLg 2
K
P
1:93
1 4m l ave n g 2 l i :
i1
Similarly, we can use Eq. (1.90) to derive the equation for the excess MSE of LMS:
b SD n 1T diagLFn
1T I 4m Ln diagLF0
1T I 4m Ln diagLg 2
K
P
1:94
1 4m l i n g 2 l i ;
i1
where the SD subscript stands for steepest descent.

Comparing Eqs. (1.93) and (1.94), it is clear that b Newton n and b SD n are not
equal. In fact, one needs to know the eigenvalues of the input autocorrelation matrix
in order to compare the two. It can be shown that b SD n will be less than or equal to
b Newton n for small values of n and greater than or equal to b Newton n for large
values of n. However, the exact crossover point depends on the distribution of the
eigenvalues of R.
1.7.4
Discussion
On the surface, it would seem that Section 1.7.3 contradicts the results that
immediately precede it. On the one hand, the average excess error energies for
LMS/Newton and LMS are the same. On the other hand, the excess MSE curves for
the two algorithms are not the same. How can both of these assertions be true? More
critically, if b SD n can be smaller than b Newton n for the same n, does it imply that
LMS is actually a superior algorithm to LMS/Newton?
The answer lies in ascertaining the exact method of comparison between
algorithms. A common mistake in comparing speed of convergence between two
algorithms is to plot two sample excess MSE curves and claim that one algorithm is
28
superior to the other because its initial rate of convergence is faster for a specic
starting weight vector. The inherent fallacy in such an approach is that the results
may not hold for other starting weight vectors. In fact, the results will often be
different, depending upon whether we compare worst-case, best-case, or average
convergence. But even when we average over some reasonable set of starting weight
vectors, it is not enough to look only at the initial rate of convergence. Even if the
initial rate of convergence of LMS is faster than that of LMS/Newton (meaning that
b SD n is smaller than b Newton n for small n), the fact that the average excess error
energy of the two algorithms is the same implies that the nal rate of convergence of
LMS is slower than that of LMS/Newton (meaning that b SD n is larger than
b Newton n for large n). Therefore, we cannot compare rates of convergence via a
direct comparison of excess MSE curves unless we also specify that we are only
interested in convergence to within a certain percentage of the nal MSE. For
example, direct comparison of two average excess MSE curves might reveal that, for
a particular eigenvalue spread, LMS converges to within 50 percent of the nal MSE
faster than LMS/Newton, but the result may be reversed if we compare convergence
to within 5 percent of the nal MSE. Unfortunately, the exact excess MSE at which
we can state that the algorithm has converged is usually problem-dependent. On
the other hand, the elegance of the excess error energy metric is that it removes this
constraint and thus makes the analysis problem-independent.
1.8
OVERVIEW
Using the same value of m for both LMS/Newton and LMS ensures that the steadystate performance of both algorithms, after transients die out, will be statistically
equivalent in terms of misadjustment. Further, with nonstationary inputs that cause
the Wiener solution to be rst-order Markov, the steady-state performance of LMS
is equivalent (in terms of the misadjustment) to that derived for LMS/Newton,
despite the spread in eigenvalues. Further yet, the average transient performance of
LMS is equivalent (in terms of the average excess error energy) to that derived for
LMS/Newton, despite the spread in eigenvalues. It is intuitively reasonable that
since the average transient performances of LMS and LMS/Newton are the same,
their average steady-state performances are also the same with certain nonstationary
inputs. Transient decay with Newtons method is purely geometric (discrete
exponential) with the single time constant t MSE 1=4m l ave . With Newtons
method, the rate of convergence is not dependent on initial conditions, as it is with
the method of steepest descent. Under worst-case conditions, adapting from a leastfavorable set of initial conditions, the time constant of the steepest descent algorithm
is t MSE 1=4m l min . With most-favorable initial conditions, this time constant is
t MSE 1=4m l max . Therefore, with a large eigenvalue spread, it is possible that the
steepest descent method could cause faster convergence in some cases, and slower
convergence in others, than Newtons method. On average, starting with random
initial conditions, they converge at effectively the same rate in the sense that
transient convergence time is proportional to excess error energy.
1.9 CONCLUSION
29
Using Newtons method is advantageous in that the convergence rate is not

dependent on initial conditions. The disadvantage, however, is that Newtons
method generally cannot be implemented. Applying the method of steepest descent
has the advantage of simplicity, but has the disadvantage of having a convergence
rate dependent on initial conditions. Although on average its rate of convergence is
equivalent to that of Newtons method, under worst-case conditions its convergence
can be much slower, depending on the eigenvalues.
Figure 1.8 shows plan views of a quadratic MSE surface, Figure 1.8a indicating
adaptive steps for Newtons method and Figure 1.8b showing corresponding steps
for the method of steepest descent with equivalent initial conditions. These steps
correspond to three adaptive transient experiments, each starting from a different
point on the same contour of constant MSE and operating with the same value of m .
The steps using Newtons method are always directed toward the bottom of the
bowl, whereas those of steepest descent follow the local gradient, orthogonal to the
contours of constant MSE.
Figure 1.9 shows learning curves corresponding to the adaptive steps illustrated
in Figure 1.8. All three learning curves derived using Newtons method are identical,
since the initial starting conditions are located on the same constant MSE contour,
and all three time constants must therefore be the same. Figure 1.9 shows all three
learning curves as a single curve labeled Newtons method. The three steepest
descent curves are distinct, having individual time constants. The curves corresponding to initial conditions falling on an eigenvector are pure exponentials,
whereas the curve corresponding to the initial condition between the eigenvectors is
a sum of two exponentials. The area under this curve is the same as that under the
Newton curve, and corresponds to the same excess error energy and therefore to an
equivalent rate of convergence. The rates of convergence of the other curves are
greater and lesser than that of Newtons method.
1.9
CONCLUSION
An adaptive algorithm is like an engine whose fuel is input data. Two algorithms
adapting the same number of weights and operating with the same misadjustment
can be compared in terms of their consumption of data. The more efcient algorithm
consumes less data, that is, converges faster. On this basis, the LMS/Newton
algorithm has the highest statistical efciency that can be obtained. The LMS/
Newton algorithm therefore can serve as a benchmark for statistical efciency
against which all other algorithms can be compared.
The role played by LMS/Newton in adaptive systems is analogous to that played
by the Carnot engine in thermodynamics. Neither one exists physically. But their
performances limit the performances of all practical systems, adaptive and
thermodynamic, respectively.
The LMS/Newton algorithm uses learning data most efciently. No other
learning algorithm can be more efcient. The LMS algorithm performs equivalently,
on average, to LMS/Newton in nonstationary environments and under transient
30
Figure 1.8 Illustration of Newtons method versus steepest descent: (a) Newtons method,
(b) steepest descent. The Wiener solution is indicated by . The three initial conditions are
indicated by W.
REFERENCES
Figure 1.9
31
Steepest descent and Newtons method learning curves.
learning conditions with random initial conditions. However, under worst-case

initial conditions, LMS can converge much more slowly than LMS/Newton. Under
best-case initial conditions, LMS converges much faster than LMS/Newton. On
average, their convergence rates are equivalent in terms of their excess error
energies. Along with the simplicity, ease of implementation, and robustness of the
LMS algorithm, the equivalent performance between LMS and LMS/Newton is one
of the major reasons for the popularity of the LMS algorithm.
REFERENCES
1. B. Widrow and M. E. Hoff, Adaptive switching circuits, IRE WESCON Conv. Rec., vol.
4, pp. 96 104, Aug. 1960.
2. B. Widrow, Adaptive lters, in Aspects of Network and System Theory, R. Kalman and
N. DeClaris, eds., pp. 563 587, Holt, Rinehart, and Winston, New York, 1971.
3. B. Widrow, P. Mantey, L. Grifths, and B. Goode, Adaptive antenna systems, Proc.
IEEE, vol. 55, no. 12, pp. 2143 2159, Dec. 1967.
32
4. B. Widrow, J. R. Glover, Jr., J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R.

Zeidler, E. Dong, Jr., and R. C. Goodlin, Adaptive noise cancelling: Principles and
applications, Proc. IEEE, vol. 63, no. 12, pp. 1692 1716, Dec. 1975.
5. B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice-Hall, Upper Saddle
River, NJ, 1985.
6. B. Farhang-Boroujeny, On statistical efciency of the LMS algorithm in system
modeling, IEEE Trans. Signal Process., vol. 41, no. 5, pp. 1947 1951, May 1993.
7. M. Mboup, M. Bonnet, and N. Bershad, LMS coupled adaptive prediction and system
identication: A statistical model and transient mean analysis, IEEE Trans. Signal
Process., vol. 42, no. 10, pp. 2607 2615, Oct. 1994.
8. S. Theodoridis, Adaptive ltering algorithms, Proc. 18th IEEE Instrum. and Meas.
Technology Conf., Budapest, Hungary, May 2001, vol. 3, pp. 1497 1501.
9. J. Makhoul, Linear prediction: A tutorial review, Proc. IEEE, vol. 63, no. 4, pp.
561 580, Apr. 1975.
10. B. Widrow, M. Lehr, F. Beaufays, E. Wan, and M. Bilello, Learning algorithms for
adaptive processing and control, Proc. IEEE Int. Conf. on Neural Networks, San
Francisco, CA, Mar./Apr. 1993, vol. 1, pp. 1 8.
11. P. Prandoni and M. Vetterli, An FIR cascade structure for adaptive linear prediction,
IEEE Trans. Signal Process., vol. 46, no. 9, pp. 2566 2571, Sep. 1998.
12. S. M. Kuo and D. R. Morgan, Active noise control: A tutorial review, Proc. IEEE, vol.
87, no. 6, pp. 943 973, June 1999.
13. L. C. Godara, Improved LMS algorithm for adaptive beamforming, IEEE Trans.
Antennas Propag., vol. 38, no. 10, pp. 1631 1635, Oct. 1990.
14. C. C. Ko, A simple, fast adaptive algorithm for broad-band null steering arrays, IEEE
Trans. Antennas Propag., vol. 39, no. 1, pp. 122 125, Jan. 1991.
15. S. Affes, S. Gazor, and Y. Grenier, An algorithm for multisource beamforming and
multitarget tracking, IEEE Trans. Signal Process., vol. 44, no. 6, pp. 1512 1522, June
1996.
16. R. W. Lucky, Automatic equalization for digital communication, Bell Syst. Tech. J.,
vol. 44, no. 4, pp. 547588, Apr. 1965.
17. R. D. Gitlin, E. Y. Ho, and J. E. Mazo, Passband equalization of differentially
phase-modulated data signals, Bell Syst. Tech. J., vol. 52, no. 2, pp. 219 238, Feb.
1973.
18. S. Qureshi, Adaptive equalization (data transmission), IEEE Commun. Mag., vol. 20,
no. 2, pp. 9 16, Mar. 1982.
19. J. G. Proakis, Digital Communications, 4th edition, chapter 11, McGraw-Hill, New York,
2001.
20. M. M. Sondhi and A. J. Presti, A self-adaptive echo canceller, Bell Syst. Tech. J., vol.
46, no. 3, pp. 497 511, Mar. 1967.
21. V. G. Koll and S. B. Weinstein, Simultaneous two-way data transmission over a twowire circuit, IEEE Trans. Commun., vol. COM-21, no. 2, pp. 143147, Feb. 1973.
22. D. L. Duttweiler, A twelve-channel digital echo canceler, IEEE Trans. Commun., vol.
COM-26, no. 5, pp. 647 653, May 1978.
23. K. C. Ho, Performance of multiple LMS adaptive lters in tandem, IEEE Trans. Signal
Process., vol. 49, no. 11, pp. 2762 2773, Nov. 2001.
REFERENCES
33
24. L. J. Grifths, Rapid measurement of digital instantaneous frequency, IEEE Trans.

Acoust. Speech Signal Process., vol. ASSP-23, no. 2, pp. 207 222, Apr. 1975.
25. J. R. Zeidler, E. H. Satorius, D. M. Chabries, and H. T. Wexler, Adaptive enhancement
of multiple sinusoids in uncorrelated noise, IEEE Trans. Acoust. Speech Signal Process.,
vol. ASSP-26, no. 3, pp. 240 254, June 1978.
26. J. R. Treichler, Transient and convergent behavior of the adaptive line enhancer, IEEE
Trans. Acoust. Speech Signal Process., vol. ASSP-27, no. 1, pp. 53 62, Feb. 1979.
27. J. T. Rickard, J. R. Zeidler, M. J. Dentino, and M. Shensa, A performance analysis of
adaptive line enhancer-augmented spectral detectors, IEEE Trans. Circuits Syst., vol.
CAS-28, no. 6, pp. 534 541, June 1981.
28. N. J. Bershad and O. M. Macchi, Adaptive recovery of a chirped sinusoid in noise, Pt. II:
Performance of the LMS algorithm, IEEE Trans. Signal Process., vol. 39, no. 3, pp.
595 602, Mar. 1991.
29. M. Ghogho, M. Ibnkahla, and N. J. Bershad, Analytic behavior of the LMS adaptive line
enhancer for sinusoids corrupted by multiplicative and additive noise, IEEE Trans.
Signal Process., vol. 46, no. 9, pp. 2386 2393, Sept. 1998.
30. R. L. Campbell, N. H. Younan, and J. Gu, Performance analysis of the adaptive line
enhancer with multiple sinusoids in noisy environment, Signal Process., vol. 82, pp. 93
101, Jan. 2002.
31. B. Widrow and E. Walach, Adaptive Inverse Control, Prentice-Hall, Upper Saddle River,
NJ, 1996.
32. L. J. Grifths, A comparison of lattice-based adaptive algorithms, Proc. IEEE Int.
Symp. Circuits Syst., Houston, TX, Apr. 1980, vol. 1, pp. 742743.
33. D. T. L. Lee, M. Morf, and B. Friedlander, Recursive least squares ladder estimation
algorithms, IEEE Trans. Circuits Syst., vol. CAS-28, no. 6, pp. 467 481, June 1981.
34. R. Medaugh and L. J. Grifths, Further results of a least squares and gradient adaptive
lattice algorithm comparison, Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Paris, May 1982, vol. 3, pp. 1412 1415.
35. B. Friedlander, Lattice lters for adaptive processing, Proc. IEEE, vol. 70, no. 8, pp.
829 867, Aug. 1982.
36. Y. C. Lim and S. R. Parker, On the identication of systems from data measurements
using arma lattice models, IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-34,
no. 4, pp. 824 828, Aug. 1986.
37. H. Fan and X. Liu, GAL and LSL revisited: New convergence results, IEEE Trans.
Signal Process., vol. 41, no. 1, pp. 55 66, Jan. 1993.
38. R. W. Chang, A new equalizer structure for fast start-up digital communication, Bell
Syst. Tech. J., vol. 50, no. 6, pp. 1969 2014, July Aug. 1971.
39. K. H. Mueller, A new fast converging mean-square algorithm for adaptive equalizers
with partial-response signaling, Bell Syst. Tech. J., vol. 54, no. 1, pp. 143 153, Jan.
1975.
40. D. Godard, Channel equalisation using a Kalman lter for fast data transmission, IBM
J. Res. Dev., vol. 18, no. 3, pp. 267 273, May 1974.
41. R. D. Gitlin and F. R. Magee, Jr., Self-orthogonalizing adaptive equalization
algorithms, IEEE Trans. Commun., vol. COM-25, no. 7, pp. 666 672, July 1977.
42. H. Sari, Simplied algorithms for adaptive channel equalization, Philips J. Res., vol.
37, no. 1 2, pp. 56 77, May 1982.
34
43. L. D. Davisson, Steady-state error in adaptive mean-square minimization, IEEE Trans.

Inf. Theory, vol. IT-16, no. 4, pp. 382 385, July 1970.
44. B. D. O. Anderson and R. M. Johnstone, Convergence results for Widrows adaptive
controller, Proc. 6th IFAC Symp. Identication and System Parameter Estimation,
Washington, DC, June 1982, vol. 1, pp. 247 252.
45. R. R. Bitmead, Convergence in distribution of LMS-type adaptive parameter estimates,
IEEE Trans. Autom. Control, vol. AC-28, no. 1, pp. 54 60, Jan. 1983.
46. O. Macchi and E. Eweda, Second-order convergence analysis of stochastic adaptive
linear ltering, IEEE Trans. Autom. Control, vol. AC-28, no. 1, pp. 76 85, Jan. 1983.
47. A. Feuer and E. Weinstein, Convergence analysis of LMS lters with uncorrelated
Gaussian data, IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-33, no. 1, pp.
222 229, Feb. 1985.
48. P. M. Clarkson and P. R. White, Simplied analysis of the LMS adaptive lter using a
transfer function approximation, IEEE Trans. Acoust. Speech Signal Process., vol.
ASSP-35, no. 7, pp. 987 993, July 1987.
49. S. C. Douglas and W. M. Pan, Exact expectation analysis of the LMS adaptive lter,
IEEE Trans. Signal Process., vol. 43, no. 12, pp. 2863 2871, Dec. 1995.
50. V. H. Nascimento and A. H. Sayed, On the learning mechanism of adaptive lters,
IEEE Trans. Signal Process., vol. 48, no. 6, pp. 1609 1625, June 2000.
51. E. Eweda, Convergence analysis of adaptive ltering algorithms with singular data
covariance matrix, IEEE Trans. Signal Process., vol. 49, no. 2, pp. 334343, Feb. 2001.
52. L. L. Horowitz and K. D. Senne, Performance advantage of complex LMS for
controlling narrow-band adaptive arrays, IEEE Trans. Acoust. Speech Signal Process.,
vol. ASSP-29, no. 3, pp. 722 736, June 1981.
53. A. Nehorai and D. Malah, On the stability and performance of the adaptive line
enhancer, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Denver, CO, Apr.
1980, vol. 5, pp. 478 481.
54. B. Widrow, J. McCool, M. G. Larimore, and C. R. Johnson, Jr., Stationary and
nonstationary learning characteristics of the LMS adaptive lter, Proc. IEEE, vol. 64,
no. 8, pp. 1151 1162, Aug. 1976.
TRAVELING-WAVE MODEL
OF LONG LMS FILTERS
HANS J. BUTTERWECK
Eindhoven University of Technology
2.1
INTRODUCTION
This chapter is concerned with least-mean-square (LMS) adaptive lters containing

tapped-delay lines of great length. Such long lters deserve a separate study for
several reasons. First, they are widely applied, particularly in audio-range applications (e.g., echo cancelling). Second, what cannot be expected beforehand, they
are characterized by strikingly simple behavior, both in the transient, adaptive phase
and in the steady state of misadjustment. Third, stability conditions can easily be
formulated for this lter class.
The theory of the long lter, as developed in forthcoming sections, differs
fundamentally from the classical approach to adaptive ltering. From the outset, the
shifting mechanism of the tapped-delay line is incorporated and, as such, replaces
the commonly used independence assumption. Instead, wave concepts are borrowed
from the classical theories of transmission lines, two-port cascades, antenna arrays,
and similar structures with translational symmetry. As on a transmission line of
innite length, where the electromagnetic disturbances propagate as waves toward
innity, the signals on an innite tapped-delay line propagate as wave modes toward
the innitely remote terminations, that is, the beginning and the end of the line. On
an actual line, corrections have to be added in the vicinity of the terminations, which
can be interpreted in terms of reections. For a long line, the size of the reection
region, in which the evanescent reected wave cannot be neglected, is small
compared to the total lter length so that the adaptive mechanism is hardly
disturbed.
Surprisingly enough, the wave approach yields satisfactory results for delay lines
of medium or even small length. It is only in the case of a complicated frequency
dependence of the input spectral density that the long-line assumption has to be
strictly satised. As a matter of fact, wave theory results are sufciently accurate for
most adaptive lters of at least a few taps. For this reason and due to its simplicity,
35
36
TRAVELING-WAVE MODEL OF LONG LMS FILTERS
the wave or transmission-line approach forms an appropriate didactic tool and, as

such, can provide a better understanding of the adaptive mechanism.
Contrary to our treatment, classical textbooks and monographs make use of the
independence assumption. Aware of its inconsistency with the tapped-delay line
structure, several authors, such as Macchi [22], modify the assumption such that the
result becomes logically consistent. Compared with these rather complicated
constructions, our approach does not require any assumptions concerning the
statistics of the input signal and the additive noise. Starting from an iterational
procedure [7] developed for the general LMS lter, short or long, approximations of
ascending order for the weight uctuations are developed. For a vanishing step-size,
the zeroth-order results apply, which provide a good understanding of the adaptive
mechanism. To achieve acceptable accuracy for larger values of the step-size,
higher-order corrections can be applied ad libitum.
For a unit-length lter (containing a single tap), the independence assumption
does not lead to any inconsistency. From a heuristic point of view, the difculties
will grow with increasing line length, so that asymptotic laws for innitely long
lters cannot be expected to be found along this way. Indeed, few authors have
considered long lters more than incidentally. An exception is the transfer function
approach of Clarkson and White [9], which is critically reviewed in Appendix A.
Our contribution based upon an earlier publication [8] is organized as follows.
Section 2.2 presents an overview of the properties of long adaptive lters, without
going into details and without in-depth discussion of the underlying theory. As such,
it invites and motivates the interested reader to go further. In Section 2.3 basic
elements of the wave theory are developed, containing the iterational approach and
the concept of spatial stationarity for the innite line. In Section 2.4 the zeroth-order
weight-error solution for a small step-size is constructed, using elements of a general
theory of linear shift-invariant systems. All signals are considered as twodimensional functions of (discrete) time and (discrete) space, where the spatial
variable corresponds to the tap number on the delay line. Section 2.5 is concerned
with an input signal and an additive noise of stochastic character leading to steadystate weight error correlations and the so-called misadjustment. It is followed by a
study of the adaptive process (neglecting additive noise) in Section 2.6, with special
attention to the stochastic uctuations superimposed on the exponential weight-error
decrease. Ultimately, in Section 2.7 the stability issue is addressed, which is shown
to allow some tight statements about the maximum step-size of the LMS algorithm.
The remaining sections deal with various problems in the margin, which can be
omitted in a rst reading. Section 2.8 addresses the question of whether or to what
extent the results for the innite line can be applied to nite-length lters. Section
2.9 treats applications to related structures such as Laguerre lters. Experiments are
discussed in Section 2.10, and some conclusions are drawn in Section 2.11.
2.2
PROPERTIES OF LONG FILTERS
In this section some characteristic properties of the long LMS lter are surveyed,
particularly those that distinguish it from its short and medium-length counterparts.
2.2 PROPERTIES OF LONG FILTERS
37
As mentioned in the Introduction, the behavior of the long lter is governed

throughout by strikingly simple laws that, as such, deserve particular attention. Here,
we conne ourselves to a discussion of these laws and their implications; the
pertinent derivations are postponed to forthcoming sections.
Three basic aspects of adaptive ltering are treated:
The transients of the adaptive process representing the primary task of an
adaptive lter
The steady-state weight uctuations into which the transients asymptotically
pass and which give rise to a lter misadjustment
The stability problem, particularly the determination of the upper bound for
the step-size
Following Figure 2.1, an adaptive nite-impulse-response (FIR) lter with the
time-dependent M 1 weight vector wn h nn tries to imitate a reference
lter with the weight vector h. This is assumed to be constant in time, so that no
tracking problem is addressed. Moreover, the notation wn h nn implies that
h and wn have equal length M.
The input signal un of both lters and the additive output noise vn are assumed
to be sample functions of mutually independent, stationary, zero-mean random
processes with unspecied coloring. For the LMS algorithm the weight error nn
wn h satises the updating rule [31]
nn 1 I 2m unut nnn 2m fn;
fn vnun;
2:1
with un un; un 1; . . . ; un M 1t denoting the M 1 input vector

and m denoting the step-size. For a given initial value n0, the nn rst decreases
(adaptive process, see Section 2.2.1), but eventually passes into a steady-state
uctuation (see Section 2.2.2). For large m , the lter can become unstable (see
Section 2.2.3).
Figure 2.1 Basic adaptive lter conguration.
38
2.2.1
The Adaptive Process (Small Step-Size)
During the adaptive process the difference nn between the weight vector wn and
the weight vector h of the reference lter is so large that the additive noise can be
neglected, fn 0. Then (2.1) passes into the homogeneous update equation
nn 1 I 2m unut nnn
2:2
for the weight error nn. For sufciently small step-sizes m the variations of the
weight error are much slower than those of the input signal, so that unut n can be
replaced with its time average. For an ergodic input signal this equals the ensemble
average, so that (2.2) passes into (direct averaging [21])
nn 1 I 2m Rnn m ! 0;
2:3
where R Efunut ng denotes the input correlation matrix. Using its eigenvalues
l i and eigenvectors qi such that
R
M
P
i1
l i qi qti ;
2:4
we can rewrite (2.3) in normal coordinates and arrive at the difference equation
qti nn 1 1 2m l i qti nn i 1; . . . ; M; m ! 0
2:5
for the ith mode, with the exponential solution

qti nn 1 2m l i n qti n0 i 1; . . . ; M; m ! 0:
2:6
This solution applying to an adaptive lter of unspecied length, short or long,

involves the eigenanalysis of the input correlation matrix and is therefore associated
with a certain amount of computational labor.
For a long lter, the analysis is simpler and less laborious. To demonstrate this,
we rst write (2.2) componentwise. Denoting the ith component of un; nn
i
i
i
ni n P
2m ui n
by
P u j n;nj n, respectively,
Pthis ireads j as n jn 1 i
i
ij
j u nn n n n 2m
j Efu nu ngn n n n 2m
jU
j
n n. The last sum is recognized as a spatial convolution, to be denoted by the
symbol :
ni n 1 ni n 2m U i ni n m ! 0:
2:7
Thus, the new weight error at position i equals the previous weight error at i minus a
weighted sum of neighboring previous weight errors, with the input correlation
U i Eful nuli ng as the weighting function. In the limiting case M ! 1 we
read (2.7) as a partial difference equation with the two independent variables n
2.2 PROPERTIES OF LONG FILTERS
39
(time) and i (position) satisfying 1 , i , 1. Its solution has the character of

traveling waves.
The linear, homogeneous update equation (2.7) with constant coefcients can be
solved advantageously in a spatial frequency domain. With F space f:g denoting a
spatial Fourier transform (2.7) passes into
F space fni n 1g 1 F space f2m U i gF space fni ng m ! 0
2:8
with the solution

F space fni ng 1 F space f2m U i gn F space fni 0g m ! 0:
2:9
Thus, the spatial Fourier transform of the weight error dies out exponentially, with a
decay factor depending on the spatial frequency under consideration. The time
constant n0 is determined by 1 F space f2m U i gn0 e1 , which in the limit
m ! 0 passes into
n0 F space f2m U i g1 :
2:10
Observing that F space fU g equals the input power spectral density (where the
temporal frequency is replaced with the spatial frequency), we arrive at the
following conclusions:
Spatial frequencies with high (low) spectral density are associated with fast
(slow) decay.
If certain frequencies are not excited (nonpersistent excitation), there is no
decay at all.
Small (large) step-sizes imply slow (fast) decay.
No eigenanalysis of the input correlation matrix is involved.
2.2.2
Steady-State Weight Fluctuations (Small Step-Size)
In the steady state, after completion of the adaptive process, the long adaptive lter
exhibits a still more noticeable behavior. This concerns particularly the weight-error
correlations obeying simple rules.
In contrast to the adaptive process, the additive noise cannot be neglected in the
steady state. Thus we now have to solve (2.1), which, again under the assumption of
a small step-size, can be approximated as
nn 1 I 2m Rnn 2m fn m ! 0:
2:11
We consider the weight-error correlation matrix K Efnnn ng, which satises

the Lyapounov equation [3]
P
2:12
RK KR 2m Effnf t n lg m ! 0:
t
Its solution can be determined with the aid of the representation (2.4) of the correlation matrix, again requiring the evaluation of the eigenvalues and eigenvectors
of the correlation matrix.
40
If the lter is long (M @ 1), a componentwise notation is preferable. The update

equation (2.11) then reads as
ni n 1 ni n 2m U i ni n 2m f i n;
f i n vnui n m ! 0:
2:13
So we deal with a linear, inhomogeneous partial difference equation with constant

coefcients, now under a stochastic excitation f i n, stationary in time and space.
Then the response ni n also becomes stochastic, stationary in time and space, and
(2.13) describes a linear, time-space-invariant operator relating the two signals to
each other. Due to the small step-size, this operator has the character of a low-pass
lter with an extremely low cutoff frequency.
For such a lter the output correlations can be easily expressed in terms of the
input correlations. For the limiting case M ! 1; m ! 0 we nd the simple relation
(cf. Section 2.5, eq. (2.42))
K e Efni nnie ng m Ve m Efvnvn e g m ! 0:
2:14
Thus the correlation between two weight errors, e taps apart, equals m times the
correlation between the noise at two instants, a distance e apart. This result is
remarkable in various respects. First, the correlations of the slow weight uctuations
are directly related to those of the fast noise uctuations. Second, the input signal
ui n has no inuence on the weight-error correlations. Neither its amplitude nor its
spectral distribution enters (2.14). One should not wonder if even the assumption of
a stationary, stochastic character for the input signal can be abandoned, so that (2.14)
holds also true for a limited class of deterministic inputs.
2.2.3
Stability
The last issue, for which the long lter provides meaningful statements, is stability.
For a given general lter, short or long, let the statistics of the input signal be known
(of course, the noise signal has no inuence on stability). Then for a sufciently
large step-size m . m 1 instabilities occur, whereas for a sufciently small stepsize m , m 2 the lter remains stable. But there is a rather broad gray zone,
m 2 , m , m 1 , where no stability statements are available. There the lter can be
stable or unstable, and if it has been stable during a long period of observation, there
is no guarantee that it will remain stable in the future.
Apparently, m , m 1 is a necessary stability condition, while m , m 2 is sufcient
for stability. An example of the rst type is provided by studying the approximate
updating rule (2.3) and its modal solution (2.6). Clearly, all the exponential solutions
(2.6) decay if [20]
m , l i;max 1 ;
2:15
2.3 BASIC ELEMENTS OF A WAVE THEORY FOR ADAPTIVE FILTERING
41
so that already the simplied updating mechanism (2.3) (which, in due course, will
serve as a starting point for an iterative solution) will be unstable if (2.15) is not
satised. But that bound is far too optimistic. As can be concluded from our stability
condition (2.16) for the long lter (containing the factor 1/M), m must be
substantially smaller than l i;max 1 .
For the long lter we derive in Section 2.7, eq. (2.59), the necessary stability
condition
m,
1
;
MPu e jV max
2:16
where M again denotes the lter length and Pu e jV stands for the input power
spectral density. Clearly, for a given Pu e jV and thus a given Pu e jV max , the
maximum m decreases with increasing lter length.
The right-hand side of (2.16) plays the role of m 1 . For m . m 1 the lter can be
shown to become unstable, because then our iteration procedure, to be discussed
later, diverges. But there are good reasons to suppose that (2.16) is not only
necessary for stability but also sufcient. Numerous experiments support this
conjecture. Then the long lter would be the only one without a gray m -zone in
which no statement about stability can be made.
2.3
BASIC ELEMENTS OF A WAVE THEORY FOR ADAPTIVE

FILTERING
In this section, basic elements for a theory of the long LMS adaptive lter are
developed. Emphasis is put on the weight uctuations, particularly (1) their natural
behavior during the adaptive process, (2) their forced steady-state behavior after
the adaptive process, and (3) their possibly unlimited growth due to instability. The
output signal and the error signal including the concept of misadjustment are viewed
here as secondary quantities, closely related to and derivable from the weight
uctuations.
Under study is the question of whether the adaptive lter behaves in a
characteristic and possibly simple manner in the limit M ! 1, that is, for an innite
length of the tapped-delay line. Such questions play an important role in numerous
other structures exhibiting translational symmetry, such as cascades of equal twoports, transmission lines, and antenna arrays. From a practical point of view, one
need not necessarily study innitely long structures. One can also formulate
statements about long but nite arrangements; for these, local modications have to
be developed in the vicinity of the line endings.
The question formulated above has an afrmative answer: All innitely long
symmetrical structures are distinguished by characteristic, simple, and occasionally
surprising behavior, and this is particularly true for the LMS adaptive lter.
Common to such innite structures is the occurrence of traveling waves, absorbed in
42
sinks at innity. On the long but nite-length line, the necessary modications at the
terminations then are referred to as reections.
Our wave approach is characterized by a number of peculiarities. First, the
vectors nn; un; fn are written in component form ni n; ui n; f i n, where
the space coordinate i denotes the tap number on the delay line. The common
notation 1 i M for the nite-length lter is now replaced by 1 , i , 1 for
the innite line, and the updating rule
ni n 1 ni n 2m ui n
u j nn j n 2m f i n
2:17
ni n 2m
ui nuij nnij n 2m f i n;
into which the vector difference equation (2.1) passes is now read as a partial
difference equation with the two independent variables n (time) and i (position). The
tapped-delay mechanism nds expression in the basic input relation
ui n un i 1;
2:18
where un denotes the input signal of the delay line. Unfortunately, in our notation
1 , i , 1 nonpositive i values 1 , i 0 imply unrealizable negative delays.
Since, however, our wave theory deals only with delay differences (occurring in
correlation expressions), a huge imaginary dummy delay can be added in (2.18)
without affecting any further results but now satisfying physical causality
requirements.
A further peculiarity of our wave approach is that special weight distributions
propagate as wave modes in either direction to imaginary sinks at i 1 and
i 1. For the limiting case of a vanishing step-size m ! 0 these have the form of
complex exponentials; using spatial Fourier transformations, more general weight
distributions can be decomposed into such wave modes.
What wave theory distinguishes from the classical approach is the shift
invariance or stationarity in time and space. Stationarity in time, already an
ingredient of classical adaptive theory, states equivalence of all time instants in the
sense that any probability and any correlation depend only on distances in time.
What is new is spatial stationarity, stating that, moreover, any probability and any
correlation depend only on distances in space.
Requiring temporal and spatial stationarity for the external signals ui n; f i n,
we have to assume that
U e d Efui nuie n d g; F e d Ef f i n f ie n d g;
2:19
that is, that the correlations are independent of time n and position i. For the tappeddelay line satisfying (2.18), spatial shift invariance follows from the temporal shift
2.3 BASIC ELEMENTS OF A WAVE THEORY FOR ADAPTIVE FILTERING
43
invariance of the input signal:

U e d Efun i 1un d i e 1g Ud e ;
2:20
F e d Efvnvn d gEfun i 1un d i e 1g

Vd Ud e ;
2:21
where Ud ; Vd denote the input and noise autocorrelation, respectively, and

where the statistical independence of un and vn has been exploited. On the other
hand, for the unknown ni n, steady-state stationarity follows from the innite line
length:
K e d Efni nnie n d g:
2:22
Thus the weight error correlations depend only on the time shift d (which in due
course will be set to zero) and the space shift e . For a nite-length line the latter is
not true in the vicinity of the terminations. The well-known weight-error correlation
matrix K Efnnnt ng then assumes an almost Toeplitz form with local
aberrations in the vicinity of the matrix borders.
Finally, we use an iterational technique to solve the updating equation (2.17).
This technique has been developed for the classical vectorial treatment of adaptive
ltering [7] but is also applicable to our scalar wave approach. It reads as
ni n a i n b i n g i n ;
2:23
where a i n represents the zeroth-order solution of (2.17) for the limiting case
m ! 0, and b i n; g i n; . . . are higher-order corrections for m . 0. At rst
glance, (2.23) suggests to represent a Taylor expansion of the weight-error
turns out to be slightly more
distribution in terms of m . However, the situation P
1
l
i
a
n
b i n
complicated:
Ultimately
we
nd
l0 al m O1;
P1
l
i
l1 bl m Om , and so on, so that a n has a Taylor expansion beginning
with m 0 , that of b i n begins with m 1 , and so on.
For m ! 0 the time variations of ni n are slow compared with those of the
factor ui nuij n in (2.17), so that the latter can be replaced with its (time or
ensemble) average (direct averaging [21]):
a i n 1 a i n 2m
Efui nuij nga ij n 2m f i n:
2:24
Dening the input correlation as U i Eful nuli ng and denoting a spatial

convolution by the symbol , (2.24) can be cast in the form (compare with (2.13))
a i n 1 a i n 2m U i a i n 2m f i n:
2:25
44
The higher-order terms in (2.23) represent corrections of a i n for m = 0. If they

satisfy the iterations
P
b i n 1 b i n 2m U i b i n 2m Pi;ij na ij n;
2:26
j
g i n 1 g i n 2m U i g i n 2m
Pi;ij nb ij n;
2:27
and so on, the sum (2.23) satises the p.d.e. (2.17), provided that the iteration
converges. Here
Pi;ij n ui nuij n U j
2:28
represents the deviation of ui nuij n from its mean value; as such, it is

stationary in time with zero mean.
2.4
THE ZEROTH-ORDER SOLUTION FOR SMALL STEP-SIZES
Based on an iteration procedure, we learned in the previous section that the update
equation (2.17) of the LMS algorithm is equivalent to the set of equations (2.25), and
so on. The zeroth-order solution a i n is determined by f i n (cf. (2.25)),
whereupon the rst-order correction b i n follows from a i n (cf. (2.26)), the
second-order correction g i n follows from b i n (cf. (2.27)), and so on. Thus
we proceed according to the scheme f i n ! a i n ! b i n ! g i n ! ;
where for sufciently small m the terms in the chain decrease to any wanted degree.
This procedure is attractive in that it replaces the difference equation (2.17) with
stochastically time-varying parameters into a set of constant-coefcient linear
differenceP equations (2.25), and
P so on, now with a stochastic excitation
f i n; j Pi;ij na ij n; j Pi;ij nb ij n; and so on. Thus the original
problem is reduced to a study of the passage of stationary stochastic signals through
a linear time-space-invariant system. Observe that the same operator Lf g applies in
all steps of the above scheme:
a i n Lf2m f i ng,
P
b i n Lf2m j Pi;ij na ij ng,
P
g i n Lf2m j Pi;ij nb ij ng;
and so on.
Viewed in the time domain, it has the character of a low-pass lter with a
vanishing cutoff frequency for m ! 0 (cf. (2.38)).
In this section we study the partial difference equation (2.25) for the zeroth-order
solution a i n, in which the stochastic character of ui n has been removed
through ensemble averaging of ui nuij n. The result is a constant-coefcient
2.4 THE ZEROTH-ORDER SOLUTION FOR SMALL STEP-SIZES
45
linear equation for a i n having a solution in convolutional form:
a i n Lf2m f i ng
1
1
P
P
h j l2m f ij n l;
2:29
j1 l1
a i n hi n 2m f i n;
hi n 1 hi n 2m U i hi n d i n;
2:30
2:31
where denotes a convolution in time, again denotes a convolution in space, and

Lf g stands for a linear, shift-invariant, two-dimensional operator with the impulse
response hi n. It does not apply in the vicinity of the terminations, because there
spatial stationarity is violated.
The impulse response hi n satises the p.d.e. (2.31), which has to be solved
under an initial condition (in time) and a boundary condition (in space):
hi n 0
for n , 0;
hi n hi n:
2:32
2:33
The rst condition reects causality; the second follows from symmetry with respect
to the origin i 0 (left and right are equivalent). With (2.32) and (2.33) we
can solve (2.31) stepwise: hi 0 0; hi 1 d i ; hi 2 d i 2m U i ; hi 3
d i 2m U i d i 2m U i ,
hi n d i 2m U i n 1 terms d i 2m U i :
2:34
Thus, with increasing time n, the impulse response gradually spreads over the whole
line and is ultimately absorbed at i +1. In this sense we can talk of a wave
propagating to innity.
While the impulse response represents the operator Lf g in the time/space
domain, the system function as the double Fourier transform of the impulse response
provides a useful frequency domain equivalent:
Hz; j
PP
i
hi nzn j i F time F space fhi ng;
2:35
where z e jV and j e jk denote the temporal and spatial frequency, respectively.

Performing the Fourier transformation of the p.d.e. (2.31)
zHz; j Hz; j Rj Hz; j 1;
Rj
P
i
2m U i j i F space f2m U i g;
2:36
2:37
46
we determine the system function as

Hz; j z 1 Rj 1 :
2:38
Remembering further that U is the input autocorrelation (where time shift is

replaced with space shift), we identify Rj as 2m times the positive, real input
power spectral density (with temporal frequency replaced with spatial frequency).
Thus, for any value of j e jk , the system function has a simple real pole at
z 1 Rj inside the unit circle. Obviously, for a given value of j e jk the
operator Lf g represents a low-pass lter, whose cutoff frequency tends to zero for
m ! 0.
2.5
WEIGHT-ERROR CORRELATIONS
The signal transformation (2.29) is now applied to (temporally and spatially)

stationary, zero-mean, stochastic excitations f i n under the assumption of small
step-sizes m (implying Re jk ! 1). We consider the steady state, where the adaptive process is nished and the output signal a i n becomes stationary, too. In
particular, the weight-error correlation becomes, using (2.19),
Ae d Efa i na ie n d g
Efhi n 2m f i nhi n 2m f ie n d g
2:39
e
h~ d 4m 2 F e d ;
i
1
jV
jk 2
h~ n hi n hi n F 1
time F space fjHe ; e j g:
2:40
Viewed as a function of n, the autocorrelation h~ n of the impulse response is

rather broad (due to the low-pass characteristic of jHe jV ; e jk j2 ) and can be
replaced as follows:
i
i
jk
h~ n h~ 0 F 1
space f1=2Re g;
because, using (2.38),

2
F 1
time jHz; j j
1
2p
1
2p
1
2p
1
2p
p
p
jHe jV ; e jk j2 dV
p
dV
cos V 1 Rj 2 sin2 V
p
dV
2 2 cos V1 Rj R2 j
dV
1
:
2
2 j
2R
j
V
R
1
2:41
2.5 WEIGHT-ERROR CORRELATIONS
47
The desired weight-error P

correlation (2.39) becomes Ae d h~ 0
e
Next, we
4m 2 F e d h~ 0 4m 2 d F e d A0e Efa i na ie ng. P
use the TDL result (2.21),P
yielding the low-frequency power density d F e d
e
e
of the signal f i n as
d F d Ue e Ve . This results in A 0
e
h~ 0 2m Ue 2m Ve and, using h~ 0 2m Ue 12 d e due to
(2.37) and (2.41),
Ae 0 Efa i na ie ng m Ve m Efvnvn e g:
2:42
This main result, valid for the combination m ! 0; M ! 1, directly relates the
spatial weight-error correlation to the temporal noise correlation (although the two
signals uctuate on completely different time scales). Surprisingly enough, the input
signal has no inuence on the weight correlations; neither its amplitude nor its
spectral distribution enters (2.42).
With e 0 the mean squared weight error equals the step-size times the noise
power: Efa i n2 g m Efv2 ng. Further, for white noise the weight uctuations
are uncorrelated. Notice that both results also are valid for a nite-length delay line
[4] under white noise; for that case they are also found with the aid of the
independence assumption [20, 7]. Why this illegitimate assumption succeeded in the
special situation under consideration has been elucidated in [5].
With the aid of (2.42) we can determine the misadjustment, dened [6] as the
ratio Efnt nun2 g=Efv2 ng of the powers of the output signal due to the weight
uctuations and P
of the additive output noise. In our notation and for m ! 0 the rst
signal reads as i a i nun i 1 so that, using (2.42), the numerator in the
misadjustment becomes
Efnt nun2 g E
PP
i
E
PP
i
PP
Efa i na j ngUi j
PP
i
a i nEfun i 1un j 1ga j n
a i nun i 1un j 1a j n
Efa i na ie ngUe m
mM
PP
i
Ve Ue
Ve Ue :
The approximation is justied due to the extremely different time scales on which
un and a i n uctuate. So we arrive at
misadjustment
mM P
Efvnvn e gEfunun e g;
Efv2 ng e
2:43
48
valid for small step-sizes m . Due to Parsevals theorem, the sum can be rewritten as
the average over the product of the spectra of the input and the noise signal [6]. In [4]
it has been shown that (2.43) holds true for lines of any length, but the pertinent
proof is far more complicated than that for the long line. Observe that the misadjustment vanishes if the input and the noise spectrum do not overlap.
Above we determined the weight-error correlation Ae 0 for a zero time shift.
Often the generalized weight-error correlation Ae d will be desirable; due to the
small step-size, it will slowly decrease as a function of the time shift d . The
expression for Ae d isPrather complicated (see below), but we can derive a simple
expression for its sum d Ae d over all time shifts, which can be interpreted as
the low-frequency spectral density of P
the weight-error uctuations. First, we
determine its spatial transform: F space d Ae d F time F space fAe d gjz1
jH1; j j2 F space f4m 2 Ve Ue g R2 j 2m Rj F space fV e g; thus 1=2m Rj
P
e
e
d F space fA d g F space fV g, which, after application of an inverse spatial
Fourier transform, yields the interesting result
P
U e Ae d Ve :
2:44
d
Notice that in this relation, the step-size m does not occur. In this respect it is the
counterpart of (2.42), where U e does not occur. Combination of (2.42) and (2.44)
eliminates Ve , yielding
P
m U e Ae d Ae 0:
2:45
d
i
Since the weight error a n varies slowly, the correlation Ae d Efa i n

a ie n d g is broad as a function of d (therefore, it is common to study only
Ae 0). Consequently, the sum of Ae d over all d is much larger than Ae 0 (cf.
(2.45)). With a given Ae 0 and decreasing m , the sum over Ae d increases, so
that, viewed as a function of d , the breadth of Ae d increases with decreasing stepsize. In the temporal frequency domain, F time fAe d g becomes narrower and
ultimately assumes d character. With
P Wj denoting the equivalent time-domain
width of F space fAe d g such that d F space fAe d g Wj F space fAe 0g, we
conclude from the spatial transform of (2.45) that Wj 2=Rj . As expected, for
m ! 0 with R ! 0, we have W ! 1. Observe that, for a given spatial frequency,
Wj equals twice the time constant of the adaptive process, as can be concluded
from (2.50) or (2.10).
Finally, we discuss the generalized weight-error correlation (the pertinent proof
is presented in Appendix B):
Ae d m f V1Ge 1 d 1 V0Ge d V1Ge 1 d 1
Ve 1G1 d e 1 Ve G0 d e
Ve 1G1 d e 1 g:
2:46
Ge d d e 2m U e jd j terms d e 2m U e : 2:47
2.6 THE ADAPTIVE PROCESS
49
Thus, Ae d is not only dependent on Ve but also on neighboring values

Ve 1; Ve 1, and so on. For small d we have Ge d d e (in d e ; d
denotes the well-known delta function), so that all terms in (2.46) vanish except
G0 d e 1, and (2.46) passes into (2.42) (assuming that d 0 and e is not too
large).
2.6
THE ADAPTIVE PROCESS
In this section we concentrate on the adaptive process, that is, the transient phase, in
which the additive noise plays a negligible role, f i n 0. The adaptive process
ultimately passes into the steady state, in which the weight uctuations assume a
stationary stochastic character and where the noise becomes essential, f i n = 0. In
Section 2.2 we reviewed the two phenomena in exactly this order, but here we
choose the inverse treatment, guided by didactic considerations: While the weighterror correlations can be sufciently modeled as a zeroth-order effect (the higherorder corrections do not create basically new aspects), the simple zeroth-order
theory of the adaptive process merely predicts a deterministic exponential decay of
the weight errors, as represented by a i n. In a following step, the superimposed
stochastic uctuations are described by the rst-order corrections b i n. Thus the
present section represents a rst exercise in the iterative solution of the lters
update equation, as proposed in Section 2.3. In Section 2.7, treating stability, we will
prot by the full iterative solution using all higher-order corrections.
In the adaptive process with f i n 0, the zeroth-order solution a i n satises
the homogeneous partial difference equation (cf. (2.7)),
a i n 1 a i n 2m U i a i n:
2:48
With an; e jk F space fa i ng denoting the spatial Fourier transform of a i n, the

partial difference equation (2.48) passes into the rst-order ordinary difference
equation
an 1; e jk 1 Re jk an; e jk
2:49
for an; e jk with the solution

an; e jk 1 Re jk n a0; e jk :
2:50
Thus the spatial transform of the weight-error distribution decays exponentially with
a decay factor dependent on the spatial frequency j e jk . Compare (2.50) with the
classical theory (cf. Section 2.2), where the eigenvalues of the input correlation
matrix Efunut ng determine the decay factors, while its eigenvectors determine
the pertinent spatial modes of the adaptive process. For the innitely long LMS
lter, as discussed above, we have a continuum of spatially sinusoidal modes, which
can also be found from the asymptotic behavior of large Toeplitz matrices [18].
50
Our result (2.50) becomes particularly simple in two special situations:

1. For a white input signal un, the power spectrum is constant and the decay
factor becomes independent of j e jk . In that case, there is one common
decay factor for the whole distribution resulting in
a i n 1 R1n a i 0;
2:51
with R1 2m Efu2 ng, so that the spatial structure of the weight errors is
preserved during the adaptive process.
2. The same result (2.51) is found for a colored input and a smooth initial
distributionP
a i 0, containing only small spatial frequencies, so that Rj
R1 2m d Efunun d g.
For an exact treatment of the adaptive process we have to solve the complete set
of equations (2.48) and (2.26), (2.27), and so on. As we have shown, the solution
a i n of (2.48) (zeroth-order solution) has a deterministic character, which for a
white input signal is given by the exponential decrease (2.51). Again for a white
input, we now consider the higher-order corrections, particularly the rst-order term
b i n. Since the excitation term of (2.26) is a mixture of deterministic and
stochastic signals, the same is true for the solution b i n, so that stochastic
uctuations are superimposed on the exponential a i n, whose amplitudes we
now determine. With (2.51) and the whiteness assumption U i U0d i ;
Rj R1 2m U0, the partial difference equation (2.26) reads as
b i n 1 1 R1b i n 2m
Pi;l na l n
2m 1 R1n e ngi n;
e n 0 for n , 0; e n 1 for n 0;
gi n
Pi;l na l 0; Pi;l n ui nul n U il :
2:52
The right-hand term of (2.52) is a product of two factors: 2m 1 R1n e n is a
deterministic signal starting at n 0, while gi n is a stationary, zero-mean
stochastic signal. The solution of (2.52) has the form
b i n hn f2m 1 R1n e ngi ng; hn e n 11 R1n1 :

2:53
51
2.6 THE ADAPTIVE PROCESS
We are particularly interested in the power Efb i n2 g of the uctuation:

Efb i n2 g 4m 2
PP
j1
h j1 h j2 1 R1nj1
j2
1 R1nj2 e n j1 e n j2 Gi j2 j1
4m 2
PP
j
h jh j d 1 R1nj
1 R1njd e n je n j d Gi d ;
Gi d Efgi ngi n d g
PP
l
PP
p
a l 0a k 0EfPi;l nPi;k n d g
a kp 0a k 0Ti; k; p; d ;
Ti; k; p; d EfPi;kp nPi;k n d g

Efui nukp n U ikp ui n d uk n d U ik g
Ud U p d Ui d kUk p d i:
Here we have used the TDL constraint (2.18), ui n un i 1 and the
additional assumption of a Gaussian input. For a white input, we nd (with d
denoting the well-known delta function)

P k 2
id
id
0a
0 ;
G d U 0 d d a 0 a
i

P k 2 P 2
a 0
h j1 R12n2j e n j
Efb i n2 g R2 1
j
PP
j
a id 0a id 0h jh j d 1 R1nj
1 R1
njd

e n je n j d :
With h j1 R1j e j 11 R11 and

e n j d n jd je n jd j we obtain
P
j
e j 1e j d 1e n j
Efb i n2 g

dP
n
P
R2 11 R12n2 n a k 02
n jd ja id 0a id 0 :
k
d n
2:54
52
Thus the power of the uctuations b i n is the sum of three

contributions:
P partial
a k 02 of the initial
The rst contribution is proportional to the total energy kP
weight errors and the second contribution is proportional to d a id 0a id 0.
They vary as n1 R12n2 , and they achieve their maximum at n n0 =2
1=2R1, that is, half the time constant (2.10) of the exponential (2.9). The third
contribution, decreasing monotonically in absolute value, can be neglected, because
at all line taps and for all (large) n n0 =2 we have jd j ! n. The total uctuation
power at n0 =2 becomes
i
Efb n0 =2 g Efb n gmax

2

R1 P k 2 P id
id

a 0 a
0a
0 ;
2e
k
d
2:55
valid for sufciently small step-sizes. For the special case of a uniform initial
weight-error distribution, a i 0 a 0, and an observation at the line center it
assumes the value 2e1 U0=U0max a 2 0, where U0max 1=m M denotes the
stability bound (cf. Section 2.7).
Summarizing, we conclude that the amplitude of the uctuations superimposed
on the exponential weight-error decay equals zero at the beginning and the end of the
adaptive process and reaches its maximum at half the time constant. That maximum
amplitude depends on the step-size m : It vanishes for m ! 0 but assumes
considerable values near the stability bound. Although we have explicitly studied
only white Gaussian input signals, similar statements also apply in more general
situations.
2.7
STABILITY
The iteration for the weight errors provides a useful tool to address the stability
issue. If the iteration diverges, the system is certainly unstable. Conversely, we only
conjecture stability if the iteration converges. In this case stability is not guaranteed,
because we refer to the class of stationary, stochastic processes and thus exclude
instabilities involving other signal classes. However, there is strong evidence,
theoretical and experimental, that our stability condition (2.59) is necessary and
sufcient.
In Section 2.3 we iteratively determined the weight errors in an adaptive lter
excited by a stationary stochastic signal f i n. In particular, we derived the steadystate weight-error correlation Ae 0 Efa i na ie ng in the limit m ! 0 (cf.
(2.42)). Here we return to that steady-state problem, but now we reckon with the
higher-order corrections. We derive an upper bound for the step-size beyond which
the adaptive lter becomes unstable. This maximum step-size turns out to be rather
small for long lters, so that throughout low-m approximations are justied.
First, we determine the autocorrelation of the rst-order correction b i n in
terms of the autocorrelation of the zero-order solution a i n. Replacing b i n with
2.7 STABILITY
53
g i n and a i n with b i n, we nd a similar relation between the second- and

rst-order corrections. This process can be continued such that for a given pth order
autocorrelation we can determine the p 1th order autocorrelation.
Because of the similarity between (2.25) and (2.26) we have, analogous to (2.39),
e
Be d Efb i nb ie n d g h~ d 4m 2 Ge d ;
2:56
where Ge d denotes the autocorrelation of the right-hand sum in (2.26). Since

Pi;ij n and a ij n uctuate on extremely different time scales, the expectation
operator in the autocorrelation can be factorized
autocorrelation Ae d can
P Pandethe
e
e
;j;l
d Ae lj 0; Qe ;j;l d
be approximated by A 0: G d j l Q
EfPi;ij nPie ;ie l n d g Efui nuij nuie n d uie l n d g
U j U l . Again, the time shift d can be approximately set to zero:
e
Be 0 h~ 0 4m 2
Ge d :
2:57
Substituting m l j; e d g we further obtain

P
Ge d
Ae m 0

PP
l
Efui nuilm nuie n e g uie l n e g g U lm U l :
Now the right-hand sum over l is unbounded for an innitely long lter (M ! 1) if
the expression between brackets does not vanish for l ! +1. If it approaches a
nonzero constant there, the sum over l approximately becomes M times this
constant. Using the tapped-delay line constraint (2.18), we nd for l ! +1
Efui nuilm nuie n e g uie l n e g g
Efui nuie n e g gEfuilm nuie l n e g g U g U mg ;
U lm U l 0;
P
d
Ge d M
Ae m 0
P
g
U g U mg MAe 0 U e U e ;
e
Be 0 h~ 0 4m 2 MAe 0 U e U e :
e
With (2.41), implying h~ 0 4m U e d e , we arrive at Be 0 m MU e

Ae 0 or, after Fourier transformation (remember that all Fourier transforms have
the character of power spectra and, as such, are positive)
F space fBe 0g m MF space fU e g F space fAe 0g:
2:58
54
This relation can be interpreted in terms of stability. Let m MF space fU e g , 1 be

satised for all spatial frequencies such that in terms of temporal frequencies
m,
1
for all V;
MPu e jV
i:e:; m ,
1
;
MPu max
2:59
where Pu e jV F fUe g denotes the input power spectrum. Then we have for all
spatial frequencies
F space fBe 0g , F space fAe 0g;
2:60
implying that B0 0 , A0 0 or Efb i n2 g , Efa i n2 g. Thus the power of

the rst-order correction is smaller than the power of the zeroth-order solution, with
a similar relation between the powers of the pth and the p 1th correction. Thus,
if (2.59) is satised, the powers decrease with increasing p and tend to zero for
p ! 1. Conversely, if (2.59) is not satised for some spatial frequency, then for that
frequency the power spectral density in b i n is larger than that of a i n and the
same is true for all consecutive corrections. Thus at this frequency the iteration
diverges, and the adaptive lter is unstable. Hence, (2.59) is a necessary stability
condition.
We now elaborate (2.59) as follows. Let a constant 0 , S , 1 exist such that for
all frequencies m , S2 =MPu e jV ; F space fBe 0g , S2 F space fAe 0g; then we
have Efb i n2 g , S2 Efa i n2 g and a similar inequality for the higher-order
corrections. Thus the powers of consecutive approximations decrease exponentially.
Considering the sum ni n a i n b i n g i n and using
q q q q
Efni n2 g Efa i n2 g Efb i n2 g Efg i n2 g
q
Efa i n2 g1 S S2 ;
we can conclude that the iteration converges for any S satisfying 0 , S , 1.
But does (2.59) imply stability, that is, is (2.59) a sufcient stability condition?
Remembering that ni n is the solution of a linear partial differential equation
(2.17) and, as such, is composed of a particular and a homogeneous solution, a
conceptual instability would imply that due to (2.59), a nite particular solution
(steady-state solution) would be associated with an unbounded homogeneous
solution (reecting instability), and that this would be true for all temporally and
spatially stationary excitations. Thus, with a large amount of evidence, we
conjecture that (2.59) is not only necessary but also sufcient for stability. As will be
shown in Section 2.10, the conjecture is experimentally supported.
In Section 2.2 we mentioned the classical stability condition (2.15) resulting from
the requirement of a decaying zeroth-order solution of (2.3). Apparently this
condition is necessary, but far from sufcient, particularly for a long lter. Also, in
our wave theory, such a zeroth-order stability can be established by requiring the
2.8 CORRECTIONS FOR FINITE LINE LENGTH
55
pertinent difference equation (2.48) to yield a decaying solution for all spatial
frequencies. In the z-domain this reads such that for any j e jk the poles of the
system function Hz; j must remain within the unit circle jzj 1. Following (2.38)
this amounts to Rj 2m F space fU i g , 2 or, transformed into the temporal
frequency domain,
m,
1
for all V:
Pu e jV
2:61
Fulllment of this condition guarantees stability of the zeroth-order solution. Obviously, our condition (2.59) guaranteeing convergence of the iterational procedure
is much stronger and implies (2.61).
Another stability condition (2.114) has been established by Clarkson and White
[9], which is based upon a transfer function approach of LMS adaptive ltering. In
Appendix A it is shown that condition (2.114) can be derived from but is weaker
than (2.59). But it is stronger than (2.61), which does not contain the crucial factor
M 1 .
2.8
CORRECTIONS FOR FINITE LINE LENGTH
The simple wave theory applies where spatial stationarity is guaranteed. This is the
case on a hypothetical tapped-delay line of innite length, but on an actual albeit
long line stationarity is violated in the vicinity of the terminations. The boundary
conditions (vanishing weight errors beyond the terminations) require local
perturbations of the wave modes called reections. Here we investigate the size of
the regions in which they occur and their inuence upon the lters steady-state and
transient behavior. Only in exceptional situations (short tapped-delay line, strong
coloring of the input signal) do the wave reections appear to deserve explicit
consideration; in most cases, the simple wave theory applies.
We assume the tapped-delay line to be so long that the reected waves set up at
the two terminations do not interact (no multiple reections); so we can concentrate on one of the terminations and apply the nal results mutatis mutandis to the
other termination. We arbitrarily choose the beginning (feeding point) of the line,
where the line taps are conveniently renumbered as i 0; . . . ; M 1, so that i 0
denotes the beginning of the line. Further, on the long line the reected waves do not
see the line end, so that the sequence i 0; 1; 2; 3; . . . can be viewed as
unterminated.
We now imagine a continuation of the line toward i , 0 while assuming the
validity of the original zero-order update equation (2.25) for all i. Then the response
to a delta excitation at i 0 penetrates into the virtual region i , 0, whereas the
response to an imaginary excitation at i , 0 (to be required below) penetrates into
the region i 0. Ultimately the total response a i n has to vanish for i , 0. For a
given excitation in the region i 0 this will be accomplished by applying imaginary
point excitations at i , 0. Just as the boundary condition for the electric eld of a
56
point charge in front of a metallic surface is satised by positioning one or more

image charges behind the surface, the boundary condition at i 0 is thus simulated
by appropriate image sources.
Let f i n f i n denote a right-handed (left-handed) excitation sequence
vanishing for i , 0 i 0; then we can decompose a general excitation as
f i n f i n f i n. The rst part represents the true sources on the visible
line, while the second part stands for the virtual images. Using a similar notation for
the zero-order response a i n a i n a i n, the problem under consideration can be formulated as follows: For a given f i n an image excitation f i n
has to be constructed such that a i n 0, stating that no weight uctuations occur
on the virtual line continuation i , 0.
The problem can be conveniently solved in the frequency domain. Let Fz; j and
Az; j denote the Fourier transforms of the excitation f i n and the response a i n
related to each other by Az; j Hz; j 2m Fz; j Hz; j 2m F z; j
F z; j , with Hz; j given by (2.38); then Az; j has to be the transform of a
spatially causal sequence and, therefore, has to obey the causality conditions
(necessary and sufcient)
Az; j is regular outside and on the unit circle, that is, for jj j 1;
Az; j is finite at infinity, that is, for j ! 1
2:62
2:63
To satisfy (2.62), for any given z the poles of the system function Hz; j outside the
unit circle jj j 1 have to be counterbalanced by zeros of Fz; j . Let the input
signal of the adaptive lter have a nite correlation length L, such that U i vanishes
for jij . L; then, with Rj Rj 1 in (2.37) assuming the form j L 2Lthorder polynomial in j , the denominator of Hz; j in (2.38) can be cast in the form
z 1 Rj Gz; j Gz; j 1 :
2:64
Viewed as a function of j ; Gz; j is an Lth order polynomial with zeros ql

ql z l 1; . . . ; L inside the unit circle, factorizable as
Gz; j const
L
Y
j ql with jql j 1;
2:65
l1
whereas the factor Gz; j 1 in (2.64) has zeros q1

outside the unit circle. The
l
requirement (2.62) then leads to the condition

1
F z; q1
l z F z; ql z 0; l 1; . . . ; L:
2:66
The second causality condition (2.63) is concerned with the behavior of Az; j for
j ! 1, where F z; j O1; Rj Oj L , and Hz; j Oj L . To obtain
Az; j O1, it is required that F z; j Oj L . This image source term and its
2.8 CORRECTIONS FOR FINITE LINE LENGTH
57
inverse Fourier transform can therefore be written as

F z; j
L
P
B j zj j ;
f i n
j1
L
P
b j nd ij :
2:67
j1
Thus f i n consists of exactly L point sources, positioned at i 1; 2; . . . ; L

in the virtual line region. For any z, the (temporal) Fourier transforms Bi z of their
amplitudes bi n are determined with the aid of the L equations (2.66).
The image sources f i n are required to compensate the response a i n due
to the primary excitation f i n, but they also contribute to a i n. Their
contribution has the character of a reected wave standing for the deviation from the
wave pattern of the idealized innite-length line. Quantitatively, the reected wave
is found as the (temporal and spatial) convolution of f i n with the impulse
response hi n, evaluated for i 0. Here we are not interested in the detailed
structure of the reected wave, but wish to estimate its inuence upon the overall
behavior of the adaptive lter. With some plausibility, we can expect that the
relative errors of the relevant performance parameters roughly equal the relative
size of the reection region (to be properly dened) compared to the total lter
length.
The L image point sources forming f i n each penetrate into the visible line
i 0 with the impulse response hi n as a weighting function. The penetration
depth depends on the position of a particular image and is strongest for the rst
source at i 1. For a rather pessimistic p
estimation
of the size of the reection
i
2
region, the spatial width or radius of inertia i of h n can be used, whose square
we (rather arbitrarily) dene as
i2
PP
i
i2 hi n2
PP
i
hi n2 :
2:68
With (2.41) and (2.38) and using Parsevals theorem, we nd
1 p
dk
;
dk dVjHe ; e j
2p p 2Re jk
i n
p p

P P 2 i 2 P P i 2
1 p p
i h n
ih n
dk dVjH 0 e jV ; e jk j2
4p 2 p p
i n
i n
2

1 p
1 p
R0 e jk

dk
dV jV
2
j
k
2p p
2p p
e 1 Re
1 p R0 e jk 2
dk ;

2p p 4R3 e jk
p
1 p R0 e jk 2
dk
2
dk
;
i
jk
2 p R3 e jk
Re
p
PP
1
h n
4p 2
i
p p
jV
jk
2:69
58
0 jk
jk
where H 0 e jk dHe jk =dk and
pR e dRe =dk . From (2.69) it can be
2
easily concluded that the width i of the impulse response signicantly exceeds
unity only if the input power spectrum Re jV strongly varies as a function of V
(which can occur only for a large input correlation length). In most practical
situations this width, and hence the size of the reection region, are conned to only
a few taps.
Summarizing, it can safely be stated that for LMS adaptive lters of moderate or
great length (such as those used for acoustic echo cancellation) the simple wave
theory applies with sufcient accuracy.
2.9
LMS FILTERS WITH ALL-PASS SECTIONS
In previous sections we developed a wave theory for long LMS adaptive lters
containing tapped-delay lines. Here we generalize the theory for a structure with
cascaded identical all-pass sections, as considered, for example, in [1] in the context
of Laguerre lters.
2.9.1
Steady State
First, we consider the weight-error correlations in the steady state, that is, after
completion of the adaptive process. To begin with, we modify (2.18) for a cascade of
identical all-pass sections:
ui n gi n un; gi n gn i terms gn;
2:70
where gn denotes the impulse response of the elementary all-pass section. Then the
input correlation (2.19) becomes
U e d Efui nuie n d g
Efgi n ungie n un d g:
2:71
The all-pass property can be advantageously formulated in the frequency domain.

With Ge jV F time fgng denoting the all-pass system function, we have
jGe jV j2 1
for all V;
2:72
leading to the important relation

jV 2
gn gn F 1
time fjGe j g d n:
2:73
On the innite cascade of all-pass sections we want to use 1 , i , 1. So we have

to extend the denitions (2.70) to negative values of i. Concluding from (2.73) that
59
2.9 LMS FILTERS WITH ALL-PASS SECTIONS
gn represents the inverse impulse response of the elementary all-pass section and
that going back on the delay line corresponds to system inversion, we have
gi n gi n:
2:74
With the input correlation Ud Efunun d g we are now ready to elaborate

(2.71):
PP
U e d E
n0 n00
PP
n0
n00
PP
n0
gi n0 un n0 gie n00 un n00 d
gi n0 gie n00 Un0 n00 d
2:75
gi n0 gie n0 nUd n:
The sum over n0 becomes

P
n0
gi n0 gie n0 n gi n gie n ge n:
2:76
The last identity follows through factorizing gi n and gie n according to

(2.70), and writing the convolution product as i e factors gn gn d n
times e factors gn (cf. (2.73)). (This reasoning holds for e . 0 but can easily be
modied for e , 0). So, (2.75) becomes
U e d
ge nUd n ge d Ud :
2:77
In due course we also need the correlation of the excitation f i n vnui n:

F e d Ef f i n f ie n d g
Efvnvn d gEfui nuie n d g
2:78
Vd U e d :
In Section 2.5, before (2.42), an expression for the weight-error correlation was
derived:
e
Ae 0 Efa i na ie ng h~ 0 4m 2
F e d :
2:79
Using Parsevals theorem, the sum can be elaborated as
P
P e
1 p
F d Vd U e d
F time fU e d g F *time fVd gdV:
2p p
d
d
2:80
While F time fVd g V~ V can readily be interpreted as the noise power spectral
e
density (notice the different meaning of the tilde in h~ 0 and in V~ V!), the term
60
F time fU e d g deserves further consideration. Using (2.77), we nd

F time fU e d g F time fge d Ud g F time fgd gd Ud g
Ge e jV U~ V:
2:81
Then (2.79) passes into

1
A 0 4m
2p
e
e
h~ 0 Ge e jV U~ VV~ VdV;
p
which, after a spatial Fourier transformation, becomes

p
e
e
2 1
F space fh~ 0g F space fGe e jV gU~ VV~ VdV:
F space fA 0g 4m
2p p
2:82
From (2.41) we conclude with (2.77) for sufciently small step-sizes
e
F space fh~ 0g
1
2F space f2m U i 0g
4m F space
1
P
ge nUn
2:83
4m
1
2p
p
p
^ k ; VU~ VdV
G
^ k ; V F space F time fge ng F space fGe e jV g:

G
Thus (2.82) becomes, using (2.81),
p
e
F space fA 0g m
^ k ; VU~ VV~ VdV

G
:
p
^
~
p Gk ; VU VdV
p
2:84
With (2.83) and writing Ge jV P

ejbV (with db=dV
P . 0) for the all-pass
^
transmission, we nd Gk ; V e Ge e jV eje k e eje bVk 2p d k
bV 2p K d V b1 k ,
K
1
:
db=dVVb1 k
2:85
Inserting this result into (2.84) shows that the Fourier transform of the weight-error
correlation is independent of the input signal (its amplitude and its spectral
distribution):
F space fAe 0g m V~ b1 k ;
2:86
61
2.9 LMS FILTERS WITH ALL-PASS SECTIONS
thus solely determined by the noise power spectrum. We are acquainted with such a
result from the TDL structure, where bV V; b1 k k , and Ae 0 m Ve
(cf. (2.42)). In our generalized situation, we have a simple relation only in the spatial
frequency domain, which, however, contains a nonlinear frequency transformation
V b1 k . In the spatial domain the weight-error correlation Ae 0 is determined
by the noise correlation Vd such that, for a certain e ; Ae 0 depends on Vd for
all values of d . The dependence is linear and can formally be described by an
(innite) matrix (this item is not elaborated here).
2.9.2
The Adaptive Process
Now we discuss the adaptive process, in which the additive noise can be neglected.
It is governed by the homogeneous difference equation
a i n 1 a i n 2m U i a i n;
2:87
where, following from (2.77),

U i U i 0
gi nUn:
2:88
Let an; e jk denote the spatial Fourier transform of a i n; then spatial transformation of (2.87) yields
an 1; e jk 1 Re jk an; e jk :
2:89
Our main task now is to determine Re jk for a cascade of identical all-pass sections:
Re jk 2m F space fU i g 2m F space
2m F space
1
2p
gi nUn
1
Ge jV i U~ VdV 2m
2p
p
p
p
^ k ; VU~ VdV:
G
^ k ; V 2p d k bV 2p d V b1 k =t V, where t V db=dV,

Using G
we nally obtain
Re jk 2m

U~ V
:
t V Vb1 k
2:90
Thus we have a nonlinearfrequency mapping V b1 k ,where b1 k represents the

inverse all-pass phase characteristic. It is only for the TDL structure with bV V that
Re jk represents the input power spectral density with V replaced with k , but for the
general case the nonlinear frequency mapping has rst to be applied.
Summarizing, from (2.89) we conclude that the decrease of an; e jk is
determined by Re jk . For low values of R; an; e jk decreases slowly; for high
62
values of R, it decreases rapidly. For a certain spatial frequency k , the value of R

is determined by the input power spectral density at the temporal frequency
V b1 k .
2.9.3 Stability Finally, we derive a necessary stability condition for the lter
e
under consideration following the reasoning of Section 2.7. First, Be 0 h~ 0
P
P
P
P
P
e
e m
4m 2 d Ge d ;
0 l g Efui nuilm nuie
d G d
mA
i e l
lm l
n e g u
n e g g U
U . For l ! +1 we have
Efui nuilm nuie n e g uie l n e g g
Efui nuie n e g gEfuilm nuie l n e g g;
U lm U l 0:
Using (2.77) we nd
P
Ge d M
Ae m 0
ge g Ug ge m g Ug
Ae m 0
F time g
1
2p
p
p
F time ge g Ug
e m
g Ug dV
P e m
1 p
0
Ge jV e U~ VG*e jV e m U~ VdV
M A
2p p
m
P e m
1 p je bV je mbV ~ 2
M A
0
e
e
U VdV:
2p p
m
We dene qm qm qm * 1=2p
obtain
P
d
Ge d M
P
m
Ae m 0qm M
p
p
eje bV e je mbV U~ VdV and

2
Ae m 0qm MAe 0 qe ;
Be 0 he 0 4m 2 MAe 0 qe ;
which, after a spatial Fourier transformation, passes into
F space fBe 0g m M U~ b1 k F space fAe 0g;
2:91
2.10 EXPERIMENTS
63
because
2
U~ b1 k t b1 k
;
F space fq gF space fh 0g
t b1 k 4m U~ b1 k
e
t V db=dV:
In order that the iteration procedure converges, we have to satisfy
m M U~ b1 k , 1 for all k ;
2:92
m M U~ max , 1:
Comparing this result with (2.59) for the TDL structure, we do not observe any
difference. Also for the general all-pass structure, the upper m bound is determined
only by the maximum input spectral density.
2.10
EXPERIMENTS
Several theoretical conclusions or conjectures from previous sections ask for

experimental support. This concerns the termination effects at the beginning and end
of the tapped-delay line, in the steady state, and in the transient state, as well as the
stability issue, mainly the sufciency of the condition (2.59).
2.10.1
Steady State
For a sufciently small step-size and for an innitely long delay line, the weighterror correlations have been shown to satisfy (2.42). For a line of moderate or small
length, deviations from (2.42) have to be expected, particularly in the vicinity of the
terminations. However, this occurs only if the input signal and the additive noise are
nonwhite: The weight-error correlation matrix then satises the Lyapounov equation
(2.12), whose solution exactly agrees with (2.42) if at least one of the two signals is
white. In that case, no reections occur at the terminations.
Therefore, let un; vn both be colored, for example, U0 V0 2; U1
U1 0:8; V1 V1 1; Ui Vi 0; jij . 1. For the weight-error
correlation between two taps i, j on the innitely long delay line, (2.42) yields
Efa i na j ng m Vi j. However, in the vicinity of the terminations, the (i, j)
element Efa i na j ng of the weight-error correlation matrix no longer depends
only on the tap distance (i j). In other words, in the vicinity of the borders, the
weight-error correlation matrix deviates from the Toeplitz form. We illustrate that
for a delay line of length 6, for which, apart from a multiplicative factor m , the exact
64
weight-error correlation matrix is found as

2:44
:90
:03
:01
:00
:00
:90
2:05
:98
:01
:00
:00
:03
:98
2:01
:99
:01
:01
:01
:01
:99
2:01
:98
:03
:00
:00
:01
:98
2:05
:90
:00
:00
:01
:03
:90
2:44
Particularly in the corners (left above, right below), deviations are observed from
what (2.42) predicts, viz a Toeplitz matrix T with Tii 2; Ti;i1 Ti;i1 1; Tij
0 elsewhere. The above result has been supported experimentally in a run of 5 107
cycles with m 0:782 103 . None of the measured correlations deviates more
than +0:02 from the theoretical results.
2.10.2
Transients
Here we examine the rst-order approximation ni n a i n b i n for the

weight error. In particular, we have experimentally compared Efb i ng2 with
Efni n a i n2 g in the adaptive process under the conditions un white, M
21; 2m Efu2 ng 103 ; n 500 (time of observation), i 11 (position of
observation line center), a i 0 1 (uniform initial distribution). The expectation of b i n2 is determined as 768 105 from (2.54), while experiments yield
759 + 12 105 for the ensemble average of b i n2 . Compare this with the
expectation of ni n a i n2 , whose measured value equals 742 + 11 105 .
2.10.3
Stability
Clearly, it is easier to declare a lter to be unstable than to state stability. If the

output is observed to exceed some (large) bound, we have instability. But even if
after a long observation time such an event has not occurred, it might still occur in
the future. In this sense, the following experiments have to be viewed with caution.
Two types of stability tests have been performed. First, the adaptive lter has
been excited with a signal un at the input and additive noise vn at the output, both
stationary in time. In the second test, the adaptive transients have been observed,
given an initial distribution of the weight errors. The rst test detects instabilities
more easily.
The experiments have been carried out with a lter of length M 50. This is
large enough to consider the lter as long and small enough to make the simulations
tractable. Then (2.59) requires m , m max ; m max 0:02=Pu max .
Under forced conditions with input signal un white and additive output noise
vn white, we have indeed observed instability for m m max in runs of 5 106
cycles, while for m 0:95m max the lter always remained stable. The transient
2.11 CONCLUSIONS
65
experiments supported these conclusions. For m m max instabilities were observed,

while for m 0:95m max typical bursts of large but nite amplitude occurred in the
adaptive process, followed by further bursts of decreasing amplitude (Fig. 2.2). (It is
interesting to note that experiments with shorter lters do not yield such unmistakable statements. There we nd a relatively broad gray m region, where the
experimental outcomes can be arbitrarily interpreted as stable or unstable.)
For a colored input signal un we haven chosen un bwn 0:5wn 1
with wn white, leading to U0 1:25b2 ; U1 U1 0:5b2 ; Ui 0
elsewhere, so that Pu e jV 1:25b2 1 0:8 cos V. Clearly, the experimentally
observed stability bound is less pronounced than in the white-input case: Under
forced conditions with white additive noise we never observed instabilities for
m m max , while for m 1:1m max instabilities did occur. Observation of the
transient process (Fig. 2.3), yields a still broader gray region in which a reliable
stability statement cannot be given.
2.11
2.11.1
CONCLUSIONS
The Long LMS Filter
In previous sections we studied the transient and steady-state behavior of the long
LMS adaptive lter. Further, we discussed the stability problem and derived an
upper bound for the step-size. Now we combine these studies and are led to a
number of interesting conclusions concerning the global properties of the long LMS
lter.
First, consider the stability bound (2.59) for the step-size m , which we rewrite in
the form
m h m max h
1
;
MF space fU i gmax
0 , h , 1:
2:93
Inserted into (2.37), this yields Rj max 2m F space fU i gmax 2h =M. Further,
writing the system function in (2.38) in the form Hz; j z z0 1 with the pole
z0 1 Rj , we have
z0;min 1 Rj max 1 2
h
:
M
2:94
Thus, the pole remains in the vicinity of 1, just inside the unit circle jzj 1.
Associated with the pole, a time constant n0 can be dened satisfying zn00 e1 ,
which for z0 in the vicinity of 1 can be approximated by n0 1 z0 1 ,
yielding
n0; min
M
:
2h
2:95
66
Figure 2.2 Natural behavior of a noise-free LMS adaptive lter for two different step-sizes.
The lter length equals M 50, the weight error is observed at the center of the delay line (tap
25), and the input signal is white.
2.11 CONCLUSIONS
67
Figure 2.3 Natural behavior of a noise-free LMS adaptive lter for two different step-sizes.
The lter length equals M 50, and the weight error is observed at the center of the delay line
(tap 25). The input signal is colored according to Pu e jV const 1 0:8 cos V.
68
Obviously, for a given normalized step-size h m =m max , satisfying 0 , h , 1, the

time constant increases with increasing M. Thus stability demands that long lters be
comparatively slow, that is,
have low adaptation rates (cf. (2.95))

allow low-m approximations (cf. (2.93))
have slow steady-state weight-error uctuations (cf. (2.45))
allow simple determination of the misadjustment (cf. (2.43))
With respect to the last item we conclude that, in accordance with (2.43) and the
relations ahead the misadjustment can be determined as
Efnt nun2 g
misadjustment

Efv2 ng
P e
K Ue
:
M e 2
Efv ng
P P
i
K ij Ui j
Efv2 ng
j
2:96
Hence, it is completely determined by the weight-error correlations on the innite

tapped-delay line. Only the second-order statistics of the input signal un and the
noise vn enter the nal results.
In many practical situations, it is not the normalized step-size h but the time
constant n0;min that has to be viewed as constant, while the lter length M is varied.
Following (2.95), then, the normalized step-size h increases proportional to M until
it reaches its maximum value, h max 1. Just below the corresponding m max the
lter approaches instability, and the zeroth-order approximation has to be rened
with the aid of higher-order corrections.
Thus, with respect to their line length, we can distinguish three categories of LMS
adaptive lters: The (very) short tapped-delay lines (1) are not suitable for the wave
approach and should be treated with the aid of the classical vector formalism. But
the remaining categories are candidates for the wave treatment, with the mediumlength lters (2) relatively fast and stable and the long lters (3) relatively slow and
potentially unstable. The last-mentioned lters require higher-order corrections but
allow simple determination of the misadjustment.
2.11.2
Normalized Least-Mean-Square Algorithm
Now we investigate which modications of the wave theory are required to adapt it
to the normalized least-mean-square (NLMS) algorithm, governed by the updating
relation

unut n
2m~
fn:
nn 1 I 2m~ t
nn t
u nun
u nun
2:97
2.11 CONCLUSIONS
69
For a long tapped-delay line we make the basic observation that, due to ergodicity,
the normalizing quantity ut nun becomes (almost) independent of time,
ut nun u2 n u2 n 1 u2 n 2 u2 n M 1
MEfu2 ng;
2:98
so that the NLMS lter is equivalent to an LMS lter with a step-size m equal to
m~ =MEfu2 ng. In particular, the weight-error correlation (2.42) passes into
Ae 0 Efa i na ie ng
m~
Ve
MEfu2 ng
m~
Efvnvn e g;
MEfu2 ng
2:99
while the misadjustment becomes

misadjustment
P
m~
Efvnvn e gEfunun e g;
Efv2 ngEfu2 ng e
2:100
which, in contrast to (2.43), is symmetric with respect to the input and noise signal
and independent of M. Similarly, expressions can be derived for the adaptive
process, again with m replaced by m~ =MEfu2 ng.
Using the same reasoning as above and using (2.59), we would arrive at the
stability bound
m~ ,
Efu2 ng averagefPu e jV g
Pu e jV
Pu e jV
for all V;
which is more restrictive than the well-known NLMS stability bound [20]
m~ , 1:
2:101
Only for the special case of a white input are both bounds identical. Which bound is
correct in the case of a colored input? Following the reasoning cited in [20], the
NLMS lter is stable under the condition (2.101), because then the homogeneous
updating equation (without an excitation term) is associated with a nonincreasing
energy function. This simple reasoning is convincing. Moreover, the bound (2.101)
is conrmed by simulations.
What then is wrong with our own reasoning? Apparently, the approximation
(2.98) can fail from time to time in that the length of the input vector can deviate
considerably from the value predicted by (2.98), and local instabilities can occur.
Thus, the bound (2.101) cannot be derived from the stability bound (2.59) for the
70
LMS lter. In passing, we note that from a stability point of view, NLMS obviously
deserves preference to LMS.
2.11.3
RLS and Other Algorithms
Application of the wave formalism to other algorithms can be taken into

consideration. In particular, the recursive-least-square algorithm (RLS) might be a
suitable candidate. Presently, however, no results can be reported concerning such
promising extensions.
2.12
APPENDIXES
2.12.1 Appendix A: Transfer Function Approach of LMS

Adaptive Filtering
In 1978 Clarkson and White published a transfer function approach of LMS adaptive
ltering [9]. Based on an innite-line-length assumption, this approach yields a
relation between the input signal, the output signal (or the error signal), and the
additive noise. Surprisingly enough, the relation, in which the weight error does not
appear, is linear and, as such, does not reect the inherent nonlinear behavior of
LMS. Also, the basic low-pass property of the LMS algorithm remains hidden. In
this appendix, we show that hasty conclusions from this approach can lead to wrong
results.
The weight error nn in an LMS adaptive lter satises the update relation (2.1).
Assuming that n0 0, we can derive from (2.1)
nn n1 n0 nn nn 1 2m u0ut 0n0

2m un 1ut n 1nn 1 2m v0u0 2m vn 1un 1:
2:102
The output signal is dened as the inner product yn ut nnn of the weight error
and the input vector. Multiplying (2.102) from the left by ut n, we nd that it
satises the relation
yn 2m
n1
P
j0
Gn; jy j 2m
n1
P
Gn; jv j Gn; j ut nu j:
2:103
j0
The factor Gn; j deserves particular consideration. For an extremely long delay
line, this quantity loses its stochastic character and can be approximated by a
constant. To show that, elaborate the inner product
Gn; j ut nu j
M1
P
i0
un iu j i
2:104
2.12 APPENDIXES
71
and exploit ergodicity of un (time averaging ensemble averaging). Then the sum
becomes approximately M times the autocorrelation of the input signal:
Gn; n l MUl;
Ul Efunun lg;
l n j:
2:105
Notice that even for large M, this relation has an approximate character. On the
constant determined in (2.105), an (albeit small) oscillatory stochastic term is
superimposed (cf. (2.110)), which has to be taken into account throughout when
interpreting (2.103). Below we demonstrate that for increasing values of l the
approximate value (2.105) becomes smaller and smaller, whereas the oscillatory
contribution does not decrease. Thus the relative error of (2.105) is large for large l.
From (2.104) we conclude that
EfGn; jg MUn j;
2:106
while the mean square of Gn; j is found as
EfG2 n; jg E
M1
P
2
un iu j i
i0
M1
P M1
P
i1 0 i2 0
2:107
Efun i1 un i2 u j i1 u j i2 g:
For a Gaussian input signal the right-hand expectation can be expanded as follows:
Efun i1 un i2 u j i1 u j i2 g
Efun i1 u j i1 gEfun i2 u j i2 g
Efun i1 un i2 gEfu j i1 u j i2 g
Efun i1 u j i2 gEfun i2 u j i1 g
U 2 n j U 2 i2 i1 Un j i1 i2 Un j i1 i2 :
Then we have (after minor elementary manipulations)
EfG2 n; jg
M1
P M1
P
i1 0 i2 0
U 2 n j U 2 i2 i1
Un j i1 i2 Un j i1 i2
M 2 U 2 n j
M
P
lM
M jljU 2 l Un j lUn j l:
72
The rst term equals EfGn; jg2 (cf. (2.106), so the sum
P
l
can be interpreted as
s 2 variance of Gn; j
M
P
M jljU 2 l Un j lUn j l:
2:108
lM
Furthermore, if M @ 1 and the input correlation length is nite, we can use the
approximation
1
P
s2 M
U 2 l Un j lUn j l:
2:109
l1
p
So, the RMS value of Gn; j increases with M , while its mean increases with M
(in accordance with a basic statistical law regarding the uncertainty in averaged
independent observations).
Now consider (2.106) and (2.109) for a white input signal. For the mean of Gn; j
wePnd that EfGn; jg M d n j, while the variance becomes s 2
M l d l d n j ld n j l M M d n j. Thus the variance assumes
a nonzero value for any pair n; j, while the mean vanishes for all n = j. Here we
have the key for the illegitimacy of the replacement of Gn; j by its mean value.
Even for taps n; j with a large mutual distance jn jj we have a nonvanishing
Gn; j, while the simple averaging yields a zero value. Hence, for most pairs n; j the
relative error in the approximation is 1arge. Similar reasoning applies to a colored
input signal.
We now decompose Gn; j into its mean value and a time-varying part gn; j
with zero mean:
Gn; j EfGn; jg gn; j MUn j gn; j:
2:110
Pn1
Neglecting
Pn1 gn; j, one derives from (2.103) yn 2m M j0 Un jy j
2m M j0 Un jv j. If, instead of n0 0 we choose n1 0 as the initial
condition, we deal with the steady state and nd
yn 2m M
n1
P
j1
Un jy j 2m M
n1
P
Un jv j:
2:111
j1
In the low-m approximation (zeroth-order solution) the rst right-hand term can be
neglected, so that we arrive at
yn 2m M U1vn 1 U2vn 2 U3vn 3
2:112
2.12 APPENDIXES
73
For a number of reasons this result must be considered wrong.

1. For a white input signal with U1 U2 0, (2.112) yields the
impossible result yn 0.
2. The mean square of yn ut nnn should be proportional to m (see (2.43)),
although (2.112) predicts a proportionality to m 2 .
3. Equation (2.112) states that yn is solely determined by (past values of) the
additive noise. But the fact is that at a certain moment yn ut nnn is
directly determined by un, yielding a linear combination of present and past
values of the input signal. With the slowly varying nn, the coefcients in the
linear combination change slowly. Only in that process vn plays a role.
4. It must be questioned whether (2.103) can determine the output signal, even
when the presented rough approximations are abandoned. One has to realize
that only the weight-error vector is a state variable, not the output yn.
Probably only (2.1) yields a complete description of the systems dynamic
behavior, so that even a rened version of (2.103) cannot provide a means to
determine yn.
2.12.1.1
Stability
Now consider (2.111) for the noise-free situation (vn 0):

yn 2m M
n1
P
Un jy j:
2:113
j1
Although we doubt the correctness of this linear equation, we can wonder whether a
stable lter at least satises the concomitant stability condition
1 2m MGz = 0 for jzj . 1;
2:114
in other words, all zeros of 1 2m MGz lie inside the unit circle jzj 1. Here
Gz is dened as
Gz U1z1 U2z2 U3z3 ;
2:115
with Ui Efunun ig Ui denoting the input autocorrelation. For the

sake of convenience, we assume nite correlation length, so that Gz
U1z1 UPzP .
We compare this result with the stability condition (2.59) of our wave theory.
The input power spectral density U~ e jV as the Fourier transform of Ui can be
considered as a special case of the z-transform
U~ z
P
P
iP
Uizi U0 Gz Gz1 :
2:116
74
With (2.59) and the nonnegativeness of U~ e jV we require

0 m M U~ e jV , 1 for all V:
2:117
Using Ge jV GejV 2<fGe jV g this is equivalent to

0 m MU0 2<fGe jV g , 1;
2:118
which admits the conclusion that

2m Mf<fGe jV gmax <fGe jV gmin g 1:
2:119
With Ge jV U1ejV U2e2jV UPePjV we conclude that the

average of Ge jV vanishes, so that <fGe jV g exhibits one or more sign changes in
the interval p V p . Then (2.119) implies that
1 , 2m M<fGe jV g , 1:
2:120
Since Gz is regular for jzj . 1 (cf. (2.115), so that it assumes its maximum and
minimum real parts on the boundary jzj 1; then
1 , 2m M<fGzg , 1
2:121
holds also for jzj . 1. Thus 1 2m MGz cannot vanish for jzj . 1, because then its
real part also would vanish there. We therefore conclude that (2.59) implies (2.114).
On the other hand, with the aid of simple examples, one can show that the converse
is not true. Thus (2.114) is necessary for stability but not sufcient. This conjecture
can already be found in [9].
2.12.2
Appendix B: Weight-Error Correlations with Delay
In Section 2.5 we studied the weight-error correlation Ae 0 and the sum

P
e
e
d A d over all time shifts d . Here we determine A d as a function of the
time shift d .
With (2.39) and (2.40) we have
F time F space fAe d g jHz; j j2 F time F space f4m 2 F e d g;
2:122
where, with (2.38),

jHz; j j2 Hz; j Hz1 ; j
1
1

:
1 Rj 2 z1 z R2 j 2 z1 z R2 j
2:123
2.12 APPENDIXES
75
Taking the inverse temporal transformation of (2.122), we obtain

1
z R2 j 1 g F space fF e d g; 2:124
F space fAe d g 4m 2 F 1
time f2 z
where
1
F 1
z R2 j 1 g
time f2 z
1
2p
p
p
dV zjd j
2 z1 z R2 j
1
dz zjd j
2
2p j z z2 R2 j 1
1
dz zjd j
zjd j
zjd j
1

1 ;
2p j z z1 z z2 z2 z1 2Rj
s
2

1 2
1
1 R j +
1 R2 j 1
2
2
z1;2
1 + Rj :
2:125
With (2.21) the excitation term in (2.124) becomes

F space fF e d g Vd F space fUd e g Vd
1
Rj j d :
2m
2:126
Now the space-transformed weight error correlation (2.124) equals

F space fAe d g 4m 2

z1jd j
1
Vd
Rj j d m z1jd j Vd j d
2m
2Rj
m Gs d ; j Vd j d ;
2:127
where
Gs d ; j z1jd j 1 Rj jd j F space fGe d g
2:128
is the space transform of the convolutional product with jd j terms

Ge d d e 2m U e jd j terms d e 2m U e :
2:129
76
Its time transform and the double transform are also of interest:
Gt z; e F time fGe d g;
Gts z; j F time F space fGe d g:
2:130
2:131
Further temporal transformation of (2.127) yields (V~ denotes the Fourier transform of Vd F time F space fAe d g m Gts z; j F time fVd j d g m Gts z; j V~ j z,
which, after an inverse spatial transformation, yields
F time fAe d g m Gt z; e Ve ze :
2:132
Inverse time transformation then leads to the desired weight error correlation:
Ae d
1
2p
p
p
F time fAe d gzd dV
1
m P
2p e 0 1
m
m
1
P
e 0 1
1
P
p
p
Gt z; e 0 Ve e 0 ze e zd dV
Ve e 0
1
2p
p
Gt z; e 0 ze zd e dV
e 0 1
Ve e 0 Ge d e 0 e m
1
P
VxGe x d x:
x1
Written out, this reads as

Ae d m f V1Ge 1 d 1 V0Ge d V1Ge 1 d 1
Ve 1G1 d e 1 Ve G0 d e
Ve 1G1 d e 1 g:
2:133
In passing we note that, due to (2.128), the weighting function Ge d can also be
written in the form (the approximation uses Rj ! 1)
1
G d
2p
e
1
1 Rj j d k
2
p
p
jd j e
which depends on the power spectral density Rj .
p
p
ejd jRj j e dk ;
2:134
REFERENCES
77
REFERENCES
1. H. J. W. Belt and H. J. Butterweck, Cascaded all-pass sections for LMS adaptive ltering.
Proc. European Conference on Signal Processing, Triest (1996) (ed. G. Ramponi), pp.
1219 1222.
2. N. J. Bershad, Analysis of the normalized LMS algorithm with Gaussian inputs. IEEE
Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-34 (1986), pp. 793 806.
3. H. J. Butterweck, An approach to LMS adaptive ltering without use of the independence
assumption. Proc. European Conference on Signal Processing, Triest (1996) (ed. G.
Ramponi), pp. 1223 1226.
4. H. J. Butterweck, Iterative analysis of the steady-state weight uctuations in LMS-type
adaptive lters. Eindhoven University of Technology, Report 96-E-299, ISBN 90-6144299-0, June 1996.
5. H. J. Butterweck, The independence assumption: a dispensable tool in adaptive lter
theory. Signal Processing vol. 57 (1997), pp. 305 310.
6. H. J. Butterweck, A new interpretation of the misadjustment in adaptive ltering. Proc.
IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta 1996 (ed. M. H.
Hayes), pp. 1641 1643.
7. H. J. Butterweck, Iterative analysis of the steady-state weight uctuations in LMS-type
adaptive lters. Trans. IEEE Signal Processing, vol. SP-47 (1999), pp. 2558 2561.
8. H. J. Butterweck, A wave theory of long adaptive lters. Trans. IEEE Circuits and
SystemsI, vol. 48 (2001), pp. 739 747.
transfer function approximation. IEEE Trans. Acoustics, Speech, and Signal Processing,
vol. ASSP-35 (1978), pp. 987 993.
10. P. M. Clarkson, Optimal and adaptive signal processing. Boca Raton: CRC Press, 1993.
11. S. C. Douglas and W. Pan, Exact expectation analysis of the LMS adaptive lter. Trans.
IEEE Signal Processing, vol. 43 (1995), pp. 2863 2871.
12. S. C. Douglas, T. H.-Y. Meng, Exact expectation analysis of the LMS adaptive lter
without the independence assumption. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, San Francisco, vol. IV (Mar. 1992), pp. 61 64.
13. R. Feldtkeller, Vierpoltheorie. Stuttgart: S. Hirzel, 1959.
14. A. Fettweis, Digital lters related to classical lternetworks. Arch. Elektr. Uebertr. vol.
25 (1971), pp. 79 89.
gaussian data. IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-33
(1985), pp. 222 230.
16. S. Florian and A. Feuer, Performance analysis of the LMS algorithm with a tapped-delay
line (two-dimensional case). Trans. IEEE Acoustics, Speech, and Signal Processing, vol.
ASSP-34 (1995), pp. 1542 1549.
17. W. A. Gardner, Learning characteristics of stochastic-gradient-descent algorithms: a
general study, analysis, and critique. Signal Processing, vol. 6 (1984), pp. 113 133.
18. R. M. Gray, On the asymptotic eigenvalue distribution of Toeplitz matrices. Trans. IEEE
Information Theory, vol. IT-18 (1972), pp. 725 730.
78
19. L. Guo, L. Ljung, and G. Wang, Necessary and sufcient conditions for stability of LMS.
Trans. IEEE Automatic Control, vol. AC-42 (1997), pp. 761770.
20. S. Haykin, Adaptive lter theory (fourth edition). London: Prentice-Hall, 2001.
21. H. J. Kushner, Approximation and weak convergence methods for random processes with
applications to stochastic system theory. Cambridge, Mass.: MIT Press, 1984.
22. O. Macchi, Adaptive processing: the least-mean-square approach with applications in
transmission. Chichester, UK: Wiley, 1995.
23. J. E. Mazo, On the independence theory of equalizer convergence. Bell System Tech. J.,
vol. 58 (1979), pp. 963993.
24. M. Reuter and J. Zeidler, Non-Wiener effects in LMS-implemented adaptive equalizers.
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Munich, vol. 3 (Apr. 1997),
pp. 2509 2512.
25. D. T. M. Slock, On the convergence behavior of the LMS and the normalized LMS
algorithm. IEEE Trans. Signal Processing, vol. ASSP-41 (1993), pp. 2811 2825.
26. V. Solo, The stability of LMS. Trans. IEEE Signal Processing, vol. SP-45 (1997), pp.
3017 3026.
27. V. Solo and X. Kong, Adaptive signal processing algorithms. Englewood Cliffs, NJ:
Prentice Hall, 1995.
28. V. Solo, The error variance of LMS with time-varying weights. IEEE Trans. Signal
Processing, vol. ASSP-40 (1992), pp. 803 813.
29. V. Solo, The limiting behavior of LMS. IEEE Trans. Acoustics, Speech, and Signal
Processing, vol. ASSP-37 (1989), pp. 1909 1922.
30. M. Tarrab and A. Feuer, Convergence and performance analysis of the normalized LMS
algorithm with uncorrelated Gaussian data. IEEE Trans. Information Theory, vol. 34
(1988), pp. 680 691.
31. B. Widrow and S. D. Stearns, Adaptive signal processing. Englewood Cliffs, NJ:
Prentice-Hall, 1985.
32. B. Widrow, J. M. McCool, M. G. Larimore, and D. R. Johnson, Stationary and
nonstationary learning characteristics of the LMS adaptive lter. Proc. IEEE, vol. 64
(1976), pp. 1151 1162.
ENERGY CONSERVATION
AND THE LEARNING
ABILITY OF LMS ADAPTIVE
FILTERS
ALI H. SAYED
Electrical Engineering Department, University of California, Los Angeles
VITOR H. NASCIMENTO
Department of Electronic Systems Engineering, University of Sao Paulo, Brazil
This chapter provides an overview of interesting phenomena pertaining to the

learning capabilities of stochastic-gradient adaptive lters, and in particular those of
the least-mean-squares (LMS) algorithm. The phenomena indicate that the learning
behavior of adaptive lters is more sophisticated, and also more favorable, than was
previously thought, especially for larger step-sizes. The discussion relies on energy
conservation arguments and elaborates on both the mean-square convergence and
the almost-sure convergence behavior of an adaptive lter.
3.1
INTRODUCTION
Adaptive lters are prominent examples of systems that are designed to adjust to
variations in their environments in order to meet certain performance criteria. The
learning curve of an adaptive lter is a widely used tool to evaluate how fast and how
well an adaptive lter meets (or learns to meet) its objectives. This learning process
has been extensively studied in the literature for slowly adapting systems, that is, for
systems that employ innitesimally small step-sizes. This chapter highlights several
1
This material was based on work supported in part by the National Science Foundation under awards
CCR-9732376 and ECS-9820765. The work of V. H. Nascimento was also supported by Grant 2000/
09569-6 from FAPESP, Brazil.
79
80
ENERGY CONSERVATION AND LEARNING OF LMS ADAPTIVE FILTERS
phenomena that characterize the learning capabilities of adaptive lters when larger
step-sizes are used. The phenomena actually occur even for slowly adapting systems
but are less pronounced, which explains why they may go unnoticed.
The purpose of the chapter is to provide a straightforward exposition to the topic
so that it can provide motivation for further study and analysis. For this reason, the
discussion focuses in some detail on a special case that helps illustrate and explain
the desired phenomena in their simplest forms. Readers interested in more advanced
cases, and in additional details, are referred to the article [1] and the textbook [20].
Among other results, it is argued here that after an initial learning phase,
an adaptive lter generally learns at a rate that is higher than that predicted by
mean-square theory. It is also argued that even single-tap adaptive lters can exhibit
two very distinct rates of convergence; they learn at a slower rate initially and
at a faster rate later. Several examples are provided to illustrate these and other
effects.
3.2
LMS ADAPTIVE FILTERS
As is well known, adaptive lters are generally characterized by recursive updates of

the form
wn 1 wn a correction term;
where wn denotes an estimate at time n of a certain unknown weight vector wo that
one wishes to estimate, wn 1 is the updated weight estimate at time n 1, and
the correction term determines the direction along which the correction to wn is
performed. It is assumed that wo and its estimates are M-dimensional column
vectors.
Different adaptive schemes differ in the manner in which they evaluate the
correction direction. This chapter focuses on adaptive updates of the form
wn 1 wn m
un
en;
gun
k 0;
3:1
en dn un wn;
T
where un is the regression column vector at time n, g is a positive scalar-valued

data nonlinearity, m is a positive step-size parameter, and en is the output error at
time n. The signal en measures the difference between the reference signal dn
and an estimate for it, which is given by the inner product unT wn. The reference
dn is usually assumed to have risen from a linear model of the form
dn unT wo vn;
3:2
3.3 THE LEARNING CURVE
81
where vn denotes possible disturbances or measurement noise. The random

variables fun; dn; vng are assumed to be zero-mean. Also, the initial condition
for recursion (3.1) is taken as w0 0, without loss of generality.
The choice gun ; 1 results in the famed LMS algorithm,
wn 1 wn m unen;
k 0;
3:3
en dn unT wn;
while the choice gun ; 1=d kunk2 results in the normalized least-meansquares (NLMS) algorithm
wn 1 wn m
un
en;
d kunk2
k 0;
3:4
en dn un wn;
T
where d is a small positive number and k k denotes the Euclidean norm of its vector
argument.
3.3
THE LEARNING CURVE
The variance of the error signal en is widely accepted as a performance measure;

its evolution with time is called the learning curve of the adaptive lter:
4
learning curve Ee2 n;
n 0;
where E denotes the expectation operator.

By examining the learning curve of an adaptive lter, one can extract useful
information about the adaptation process, such as the error variance that remains in
steady state (the smaller its value the better), the rate at which the error variance
tends to its steady-state value (the faster the better), whether a lter is tending to
steady state or not (the latter is indicative of instability), and so on. The learning
curve is usually evaluated in one of three manners, which will now be discussed.
A.
Closed-Form Evaluation. Perhaps the most desirable form of evaluation is

a closed-form expression for the learning curve. However, this method of
evaluation is possible only in very rare situations, so that it is generally
considered to be a formidable task (if at all possible). This is not only
because the update equation of an adaptive lter is time-variant and
depends nonlinearly on the data, but also because the underlying signal
statistics are usually unknown. For these reasons, it is considerably more
common in the literature to evaluate the learning curve of an adaptive lter
using either of the two methods that follow.
82
B.
Independence Theory. This method of evaluation relies on imposing

certain simplifying statistical conditions on the data. Among the most
widely used conditions are those collectively known as the independence
assumptions. These assumptions essentially require, whenever necessary,
that the underlying random variables be statistically independent of each
other in order to facilitate the evaluation of expectations of coupled terms,
such as assuming that:
(a) The regression sequence fung is statistically independent and identically
distributed (iid).
(b) The reference signal dn is statistically independent of the regressors
fum; m = ng.
(c) The noise sequence fvng is also iid and statistically independent of the
regression sequence fung.
While such assumptions are not valid in most practical cases (e.g., when the
regressors un arise from a tapped-delay line implementation, in which case
assumption (a) is violated), there is ample evidence in the literature (e.g., [2
7]) to support the premise that conclusions obtained under these conditions
are sufciently realistic for slow adaptation scenarios (i.e., for innitesimally
small step-sizes).
C. Ensemble Averaging. The third method of evaluation is the most practical
and also the most widely used. It relies on controlled simulation or
experimentation. In this technique, an adaptive lter is trained repeatedly and
the resulting squared error curves are averaged to approximate the variance
curve. More specically, one performs several independent experiments or
simulations, say L of them. In each experiment, the adaptive lter is applied
for a duration of N iterations, always starting from the same initial condition
and under the same statistical conditions for the sequences fdn; un; vng.
From each experiment i, a sample error curve is obtained:
sample error curve fei n; 0 n Ng:
After all L experiments are completed, an approximation for the learning
curve is computed by averaging as follows:
4
Ensemble-average curve
L
1X
ei n2 ;
L i1
0 n N:
3:5
This method of evaluation is useful for complex lter updates for which
closed-form expressions for learning curves are difcult to obtain even under
the independence conditions. The method is also useful even for simple lter
structures, e.g., when an analysis by the independence theory is not possible
or even reliable due, for example, to faster adaptation (a situation that
corresponds to non-innitesimal step-sizes).
3.4 ERROR MEASURES AND ENERGY RELATION
83
As mentioned before, the purpose of this chapter is to highlight several

phenomena regarding the learning ability of adaptive lters. This will be pursued,
both via examples and analytically, in the next sections.
3.4
ERROR MEASURES AND ENERGY RELATION
To begin with, it is helpful to introduce a few error measures and to derive a useful
energy relation that will be called upon later in the arguments.
A. Error Measures. It is common to associate with every adaptive scheme of
the form (3.1) (3.2) two estimation errors: the so-called a priori and a
posteriori errors,
4
ea n unT en;
ep n unT en 1;
where en denotes the weight error vector that is dened by

en wo wn:
The a priori error, ea n, is a measure of how well the estimate unT wn
approximates the uncorrupted part unT wo of dn. Likewise, the a posteriori
error, ep n, is a measure of how well unT wn 1 approximates the term
unT wo .
Substituting the data model (3.2) into the dening relation (3.1) for the
estimation error en, it is easy to see that
en unT wo vn unT wn;
so that the errors fen; ea ng are related via
en ea n vn:
3:6
Now under the common and reasonable assumption that:

A.1 The noise sequence fvng is iid with variance s 2v , and is statistically
independent of the regression sequence fung
we nd that
Ee2 n s 2v Ee2a n:
In other words, one nds that both curves
Ee2 n and
Ee2a n
3:7
84
can be used to describe the learning behavior of an adaptive lter since they
differ only by a constant factor (equal to s 2v ). The noise assumption A.1 stated
above will be enforced throughout this chapter, and the discussions will
therefore focus on studying the behavior of the curve Ee2a n.
B. Energy Relation. Subtracting wo from both sides of (3.1) leads to the
weight error recursion
en 1 en m
un
en:
gun
3:8
Multiplying by unT from the left, one nds that the errors fep n; ea n; eng
are related via
ep n ea n m
kunk2
en:
gun
3:9
Substituting this relation back into (3.8), one obtains, for nonzero un, a
recursion that relates all four error quantities fen 1; en; ea n; ep ng:
en 1 en
un
ea n ep n:
kunk2
3:10
Observe that the data nonlinearity function g does not appear explicitly in
this relation. Evaluating the energies of both sides of this equation leads to the
following energy conservation relation:
ken 1k2
1
1
e2 n kenk2
e2 n:
2 a
kunk
kunk2 p
3:11
When un 0, it is obviously true that

ken 1k2 kenk2 :
3:12
Both results (3.11) and (3.12) can be grouped together into a single equation
by dening
8
<
0
m n kunk
1
:
kunk2
2 y
if un 0
otherwise
in terms of the pseudo-inverse of a scalar, so that one obtains

ken 1k2 m ne2a n kenk2 m ne2p n:
3:13
3.5 TRANSIENT ANALYSIS
85
This energy conservation relation holds for all adaptive lters whose recursions are of the form (3.1) (3.2), and was originally developed in [8] in
the context of robustness analysis of adaptive lters. No approximations or
assumptions are needed to establish (3.13); it is an exact relation that shows
how the energies of the weight error vectors at two successive time instants
are related to the energies of the a priori and a posteriori estimation errors.
Thus the energy relation provides a convenient and powerful framework for
carrying out different kinds of performance analysis, both stochastic and
deterministic, for a wide range of adaptive lters (see, e.g., [8 14] and the
textbook [20]). It will be used in the sequel to shed some light on the learning
behavior of adaptive lters.
3.5
TRANSIENT ANALYSIS
A recursion for the transient behavior of an adaptive lter can be deduced from the
energy conservation relation (3.13). For this purpose, one reworks (3.13) so as to
express it in an equivalent form that eliminates ep n and keeps only the a priori error
ea n, whose variance, as indicated earlier, is of interest to the learning behavior of
an adaptive lter. Using (3.6) in (3.9) leads to

kunk2
kunk2
ep n 1 m
vn:
ea n m
gun
gun
3:14
Substituting this equality into the energy relation (3.13) and expanding terms, one
nds, after some straightforward algebra, the equivalent representation:
m 2 kunk2 2
v n
g2 un

2m
m kunk2
1
ea nvn;

gun
gun
ken 1k2 kenk2A
3:15
where A refers to the following matrix, which is a rank-one modication of the

identity matrix,

m
m kunk2
2
ununT ;
AI
gun
gun
4
and the notation kxk2P refers to the weighted inner product

4
kxk2P xT Px:
3:16
86
Actually, and in order to be precise, one should write An instead of A to emphasize

the fact that A changes with n. However, for now, it is sufcient to use the compact
notation A. The more explicit notation An will be used when it becomes necessary
to indicate the time index.
Recursion (3.15) describes the dynamic evolution of the squared norm of the
weight-error vector. It will be used in the sequel to characterize the mean-square and
almost-sure performances of an adaptive lter.
3.6
EXAMPLES OF LEARNING BEHAVIOR
Several examples are presented in this section in order to motivate a handful of

phenomena that characterize the learning capability of adaptive lters.
A. Example 1. Let gun ; 1 and thus consider the LMS recursion (3.3).
Assume that the regression sequence fung is Gaussian and iid with variance
matrix
4
R EununT :
Assume further that the reference sequence dn is independent of
fum; m = ng. These conditions correspond to a situation in which the
independence assumptions are satised and, in addition, the learning curve of
the adaptive lter can be evaluated in closed form.
Indeed, it follows from the independence assumptions that un is
independent of en and that vn and ea n are also independent. Taking
expectations of both sides of (3.15) with gun ; 1, and using the independence of vn and ea n, leads to the recursion
Eken 1k2 Ekenk2A m 2 s 2v TrR;
3:17
with
A I m 2 m kunk2 ununT :
Observe that the weight matrix A is a random variable since it depends on
un. However, the independence of un and en permits the replacement of
A by a constant matrix (namely, by its mean value). To see this, note that
Ekenk2A EenT Aen
EdEenT Aenjene
EenT EAjenen
Ekenk2F ;
87
3.6 EXAMPLES OF LEARNING BEHAVIOR
where
4
F EA
I 2m R m 2 2R2 TrRR
and the above value for F follows from the fact that for real-valued Gaussian
regressors it holds that
Ekunk2 ununT 2R2 TrRR:
In this case, recursion (3.17) is seen to be equivalent to
Eken 1k2 Ekenk2F m 2 s 2v TrR;
3:18
with A replaced by F.
Now consider the choice R s 2u I. In this case, TrR M s 2u and F
becomes a constant multiple of the identity
F d1 2m s 2u m 2 M 2s 4u eI;
so that the weight-error variance relation (3.18) can be rewritten more directly
as
Eken 1k2 1 2m s 2u m 2 M 2s 4u Ekenk2 m 2 M s 2u s 2v :
Now using (3.7) and the fact that
Ee2a n EjunT enj2 EenT Ren s 2u Ekenk2 ;
one nds that the learning curve for this example is described in closed form
by the recursion
Ee2 n 1 d1 2ms 2u m 2 M 2s 4u eEe2 n 2m s 2u 1 m s 2u s 2v ;
with initial condition Ee2 0 Ed2 0 ; s 2d .
It is clear that in this example the learning curve has a single mode and that
it will be decaying (i.e., convergent) if, and only if, the step-size m is chosen
to satisfy
j1 2m s 2u m 2 M 2s 4u j , 1:
Observe that the value of the mode is positive since
1 2m s 2u m 2 M 2s 4u 1 m s 2u 2 m 2 M 1s 4u . 0:
3:19
88
Therefore, condition (3.19) is satised for

0,m ,
2
:
s 2u M 2
When this is the case, the lter is said to be mean-square stable. In addition,
the fastest convergence rate occurs at the value of m that minimizes the
magnitude of the corresponding mode, which happens to be
mo
1
:
s 2u M 2
Figure 3.1 shows a plot of the learning curve for the numerical values
m 0:1429, M 5, s 2u 1, and s 2v 0:01. For these numerical values, the
lter is mean-square stable for step-sizes satisfying
0,m ,
2
0:2857:
7
In addition, the fastest convergence occurs at m 1=7 0:1429, with the

resulting mode at 0.8571.
B. Example 2. The learning curve of the previous example is now estimated
via ensemble averaging. The adaptive lter is applied several times, starting
always from the same initial condition and maintaining the statistical
properties of the data fun; dn; vng constant during the experiments. The
same numerical values are used for m 0:1429, M 5, s 2u 1, and
s 2v 0:01. Figure 3.2 shows the squared-error curves of four such
experiments, of duration N 50 samples each.
By averaging several such curves, one obtains the ensemble-average
learning curve shown in Figure 3.3. The gure shows a good t between the
actual learning curve derived in the previous example and the approximate
one that is obtained via averaging. In general, the more experiments one
averages, the better one expects the t to be between the actual curve and the
approximate curve. This statement seems intuitive but can also be deceptive,
as argued later. It is customary in the literature to average around 10 to 300
curves.
C. Example 3. As mentioned earlier, the performance and learning
capabilities of adaptive lters have been extensively studied in the literature
under slow adaptation approximations (i.e., for vanishingly small step-sizes).
However, faster adaptation rates are becoming desirable in order to track fast
changing environments more precisely. This mode of operation requires the
use of larger (non-innitesimal) step-sizes. It turns out that some surprising
phenomena arise under these conditions that seem to have been overlooked
by earlier performance analyses. The phenomena also exist for small stepsizes but are less pronounced.
89
Figure 3.1 Theoretical learning curve for the LMS algorithm with Gaussian iid regressors,
M 5, s 2u 1, s 2v 0:01, and m 0:1429.
Figure 3.2 Four sample squared-error curves for LMS with Gaussian iid regressors, M 5,
s 2u 1, s 2v 0:01, and m 0:1429.
90
Figure 3.3 Theoretical and ensemble-average learning curves for LMS with Gaussian iid
regressors, M 5, s 2u 1, s 2v 0:01, and m 0:1429.
In order to illustrate some of these phenomena, consider the same setting

of Example 1. Figure 3.4 shows the theoretical learning curve and two
ensemble average curves obtained when m 0:275 (almost double the value
of the previous step-size). The adaptive lter is still mean-square stable for
this choice of the step-size (the corresponding mode will be at 0.9794). One
of the ensemble-average curves was obtained by averaging over L 500
experiments, while the other was obtained by averaging over L 2500
experiments. By examining the three curves in the gure, one observes at
least ve distinctive features:
1. In contrast to the previous case shown in Figure 3.3 for a smaller stepsize, there is not a good match between the theoretical learning curve
and the ensemble-average curves.
2. The ensemble-average curves seem to converge faster than the
theoretical learning curve.
3. During the initial training phase, all three curves (by theory and by
experimentation) coincide reasonably well.
4. The ensemble-average curves seem to exhibit two distinct rates of
convergence: an initial rate that agrees with the one predicted by meansquare theory and a later rate that is faster than that predicted by meansquare theory. Note further that these two distinct rates exist even for
this example of an adaptive lter implementation with a single mode.
91
Figure 3.4 Theoretical and ensemble-averaged learning curves for LMS with Gaussian iid
regressors, M 5, s 2u 1, s 2v 0:01, and m 0:275.
5. There is even a difference in behavior between the two ensembleaverage curves themselves: The higher the number of averaging
experiments, the closer the resulting ensemble-average curve is to the
theoretical curve. The analysis in a later section will reveal that even if
the number of experiments is increased signicantly, there will
continue to exist a discrepancy between the theoretical curve and the
experimental curve.
It should be mentioned that although the earlier discussion was restricted to
an example with independence assumptions on the data, these assumptions
have actually been enforced in the analysis and in the simulations and are
therefore valid. Thus the differences in behavior that one sees between the
theoretical learning curve and the experimental ones are not due to
assumptions that are made on the theoretical level and that are not valid on
the practical level. In this way, one can conclude that even under these
controlled conditions, the differences still exist. Actually, the differences
occur even for situations where the independence assumptions are not
satised (see [1]).
D. Example 4. There is one more phenomenon to highlight before moving on
to a justication of the results observed so far. Thus consider again the
numerical values used in Example 1, viz., M 5, s 2u 1, and s 2v 0:01.
For these values, the lter was seen to be mean-square stable for step-sizes
satisfying m , 0:2857. The diverging graph in Figure 3.5 conrms this fact.
92
However, the gure also shows a plot of the ensemble-average curve that is
obtained for a larger step-size, m 0:29, by averaging over 500
experiments. Mean-square theory predicts instability for this value of m ,
while the ensemble-average curve does not seem to diverge. Averaging over
a larger number of experiments reveals a similar behavior. An explanation
for this behavior is provided later by showing that, for larger step-sizes, there
is a noticeable distinction between the mean-square and the almost-sure
convergence behaviors of an adaptive lter.
E. Example 5. The earlier examples were concerned with data that satisfy the
independence assumptions. Now consider a tapped-delay line implementation with two taps, so that the regression vector at time n has the form
unT unun 1:
Observe that due to the shift structure, two successive regressors cannot be
independent and that, therefore, this is a situation where the independence
assumptions are not valid. Assume further that the entries fung are iid and
uniform in the interval 0:5; 0:5 so that
R EununT s 2u I;
s 2u
1
:
12
Figure 3.6 shows the ensemble-average curves that are obtained by averaging
over L 100 and L 1000 experiments for m 7:9. It is seen that in both
cases, the averaged curves converge, as opposed to the theoretical curve,
Figure 3.5 A comparison of the theoretical learning curve and the ensemble-average
learning curve for a case that is mean-square unstable. The theoretical curve is seen to diverge
in both cases, while the experimental curves converge. The plot on the left assumes zero noise
while the plot on the right uses s 2v 104 . The ensemble-average curves were obtained by
averaging over 500 experiments, with step-size m 0:29.
3.7 MEAN-SQUARE CONVERGENCE
93
Figure 3.6 A comparison of the theoretical learning curve and the ensemble-average
learning curves for an unstable tapped-delay line implementation with uniform input. The
ensemble-average curves were obtained by averaging over 100 and 10000 experiments with
step-size m 7:9.
which is divergent for this value of the step-size (see [1]). Observe in addition
that the larger the value of L, the longer the averaged curve stays closer to the
theoretical curve before ultimately converging away from it.
3.7
MEAN-SQUARE CONVERGENCE
The examples in the previous section indicate that the behavior of the ensembleaverage curves may show signicant differences in relation to the behavior of the
theoretical learning curve. An explanation for the origin of these differences is
pursued in the following sections, which focus in some detail on the case of a singletap adaptive lter.
Thus assume that M 1, in which case wn and un become scalars. Assume
further that the noise signal v is negligible enough so that its effect can be ignored.
In this case, the energy recursion (3.15) collapses to
e2 n 1 Ane2 n;
where An is the random scalar variable

2
2m u2 n m 2 u4 n
m u2 n
1
An 1
:
gun g2 un
gun
4
3:20
94
Observe that the dependency of A on n is now indicated explicitly by means of a

time index n.
Assuming that the fung are iid, it follows that the evolution of the variance of
en is described by difference equation
Ee2 n 1 EAnEe2 n:
3:21
In other words, the dynamics of the mean-square behavior of the lter is determined
by EAn, which denotes the model of the above rst-order recursion. Moreover,
since the output error is given by
en dn unT wn unT en;
we nd that
Ee2 n s 2u Ee2 n;
so that studying the evolution of Ee2 n is equivalent to studying the learning curve
of the lter. Hence, the analysis in the sequel focuses on the behavior of e2 n.
Now it is clear from (3.21) that the lter will be mean-square stable if, and only if,
the step-size m is chosen such that

2
m u2
, 1:
EA , 1 () E 1
gu
Observe that since all variables are stationary, by assumption, the time index n is
being dropped for compactness of notation, with fA; ug written instead of
fAn; ung. The expectation of A is fully characterized in terms of the second and
fourth moments of the normalized random variable
u
4
u p :
gu
Indeed, let
4
s 2u Eu 2 ;
r 4u Eu 4 :
With these denitions, one gets

EA 1 2m s 2u m 2 r 4u ;
95
3.8 ALMOST-SURE CONVERGENCE
which describes a second-order equation in m . The condition EA , 1 then leads to

the interval
0,m ,
2s 2u
:
r 4u
In the LMS case, when gu ; 1, the denitions of the moments collapse to
s 2u s 2u ;
r 4u r 4u :
For ease of comparison with a later condition (see (3.27), it is convenient to rewrite
the requirement EA , 1 in the equivalent form (in terms of the natural logarithm):
ln EA , 0 where

2
m u2
4
;
A 1
gu
3:22
where u is an iid random variable.
3.8
ALMOST-SURE CONVERGENCE
In order to account for the differences between the theoretical and the experimental
learning curves, this section now examines the behavior of a single (or typical)
squared-error curve.
Starting from (3.20) and iterating it from time 0 up to time n, one arrives at the
expression
e2 n An 1An 2 . . . A0e2 0;
3:23
where the initial condition e0 is deterministic (and equal to wo since, by

assumption, wo 0). In addition, as explained in the previous section, the Am are
realizations of an iid random variable A dened by

2
m u2
4
A 1
gu
Taking the natural logarithm of both sides of (3.23), and dividing by n leads to
n1
1
1
1X
ln e2 n ln e2 0
ln Am:
n
n
n m0
3:24
Assuming that the variance of the random variable ln A is bounded, then one can
invoke the strong law of large numbers to conclude that, as n ! 1,
ln e2 n a:s:
! E ln A;
n
3:25
96
where a.s. denotes almost-sure convergence. In other words, for large enough n,
the curve ln e2 n=n converges almost surely to the constant value E ln A. But what
about the sample curve e2 n itself? The answer also follows from the strong law of
larger numbers, which guarantees that, with probability 1, for each experiment V,
there exists a nite integer KV (dependent on the experiment) such that for all
n KV, the sample curve e2 n will be upper bounded by the curve
e2 n e2 0 expnE ln A exp
2n lnln ns ln A
3:26
where s 2ln A denotes the variance of ln A.

Now the rst exponential in (3.26) dominates the second when n is large, which
implies that the upper bound tends to zero if, and only if, Eln A , 0. It thus follows
that a typical curve e2 n converges to zero almost surely (or with probability 1) if,
and only if, the step-size m is chosen such that
E ln A , 0;
where

2
m u2
;
A 1
gu
4
3:27
where u again is an iid random variable. This leads to a different condition on m than
the one derived in (3.22) for mean-square stability.
3.9
COMMENTS AND DISCUSSION
Comparing the conditions (3.22) and (3.27) for mean-square and almost-sure
convergence behaviors, one sees that there is a clear distinction between them. The
two conditions are not equivalent and, in fact, one always implies the other since, for
any nonnegative random variable A for which E A and E ln A both exist, it holds that
E ln A ln E A:
Therefore, values of the step-size m for which mean-square convergence occurs
always guarantee almost-sure convergence while the converse is not true: a value for
which ln EA . 0 (and thus for which mean-square divergence occurs) can still
guarantee almost-sure convergence, or E ln A , 0 which explains the phenomenon
in Figure 3.5.
However, these distinctions disappear for innitesimally small step-sizes, which
explains why the phenomena described before can pass unnoticed at this level of
adaptation. This is a consequence of the fact that under some reasonable
assumptions about the probability density function of the random variable fug, it
holds that (see, e.g., [1])
E ln A ln E A om ;
3:28
3.10 VARIANCE ANALYSIS
97
where om is a function satisfying limm !0 om =m 0: That is, both conditions,

which are functions of m , coincide for vanishingly small m . This explains why
learning curves and ensemble-average learning curves tend to agree reasonably well
for such small step-sizes.
It should be mentioned that connections between mean-square convergence and
almost-sure convergence have been studied before in the literature (see, e.g., [15
17]). However, these earlier studies are concerned with the case of vanishingly small
step-sizes for which both notions of performance tend to agree. The development in
this chapter elaborates on the discrepancies that arise when larger step-sizes are
employed.
To illustrate the above points further, and to highlight that for larger step-sizes,
the difference between E ln A and ln EA can be signicant, Figure 3.7 shows the
plots of these terms as functions of m for the case of Gaussian fu g with unit variance
(this case arises, for example, when gu ; 1 and u itself is Gaussian with unit
variance). In this case,
s 2u 1;
r 4u 3:
Note that both plots are close together for small m but that they become signicantly
different as m increases. Observe also that E ln A is negative well beyond the point
where lnE A becomes positive. This implies that there is a range of step-sizes for
which a typical curve e2 n converges to zero with probability 1, but Ee2 n
diverges. This explains the simulations in Figures 3.5 and 3.6. This is not a paradox.
Since the convergence is not uniform, there is a small (but nonzero) probability that
a sample curve e2 n will exist such that it assumes very large values for a long
interval of time before converging to zero. Finally, note that the value of m that
achieves the fastest mean-square convergence is noticeably smaller than the stepsize that achieves the fastest almost-sure convergence.
The above results can thus be used to understand the differences between
theoretical and simulated learning curves for large n and for larger step-sizes. In
other words, the almost-sure analysis condition allows one to clarify what happens
when L is xed (the number of experiments) and n is increased (the time dimension);
the ensemble-average curve tends to separate from the true average curve for
increasing n due to the difference in the convergence rates.
3.10
VARIANCE ANALYSIS
While the almost-sure analysis provides an explanation for the behavior of the
ensemble-average curves for large n, one observes from the curves of Figure 3.4 that
for small n, i.e., close to the beginning of the curves, there is usually good agreement
between the learning curve and the ensemble-average curves. This initial behavior
can be explained by resorting to a variance analysis, which focuses on evaluating the
98
Figure 3.7
Graphs of E ln A and ln E A for Gaussian u n.
variance of e2 n, as opposed to its mean,

vare2 n Ee4 n Ee2 n2 :
This approach is motivated by Chebyshevs inequality, which asserts that for any
random variable z it holds that
Probfjz Ezj k g
s 2z
:
k2
In other words, Chebyshevs inequality provides a bound on the probability that a

random variable will lie outside an interval around its mean. The bound is seen to
depend on the variance of the random variable. Hence, the smaller the variance of z,
the smaller the bound. This is rather intuitive since the smaller the variance of a
random variable, the more likely it is for the variable to assume values close to its
mean.
3.10 VARIANCE ANALYSIS
99
Thus, dene the ratio

p
vare2 n
g n
Ee2 n
4
3:29
which is time dependent in general. It then follows from the above Chebyshevs
inequality that

1
Prob je2 n Ee2 nj Ee2 n 4g 2 n:
2
For example, the bound evaluates to 0.01 for g n 0:05. This means that there is a
99 percent probability that e2 n will be close to its mean (and, more specically, lie
within the interval d0:5Ee2 n; 1:5Ee2 ne). Therefore, the smaller the value of g n,
the closer one expects the sample curve e2 n to be to the theoretical learning curve
at that time instant.
Now recall that the ensemble-average learning curve is constructed by averaging
together several sample curves e2 n to obtain, say,
L
X
^ n 1
de2 nei :
D
L i1
Assuming that the L experiments are independent, then the expected value of the
^ n is still equal to Ee2 n. However, the ratio g n that is
averaged curve D
^
associated with Dn will be smaller and given by
g 0 n
q
^ n
varD
Ee2 n
g n
p :
L
That is, the processpof constructing ensemble-average curves reduces the value of
g n by a factor of L.
^ n is close to
Although a small g n is desirable to conclude that e2 n or D
2
^ n approximates Ee2 n
Ee n, it turns out that g n increases with n (and thus D
less effectively for larger k, which is consistent with the results of the almost-sure
analysis). To see this, dene again the moments (assumed nite):
4
r 4u Eu 4
h 8u Eu 8 ;
s 2u Eu 2 ;
j 6u Eu 6 ;
100
p
where u denotes the normalized variable u= gu. Then
Ee4 n E A2 n e4 0
1 4m s 2u 6m 2 r 4u 4m 3 j 6u m 4 h 8u n e4 0:
Dene further the coefcients
4
r4 E A2 1 4m s 2u 6m 2 r 4u 4m 3 j 6u m 4 h 8u
3:30
r2 E A 1 2m s 2u m 2 r 4u :
It holds that r4 r22 (with equality only if un2 is a constant with probability 1).
With these denitions, g n is given by
p s
r n r 2n
rn
g n 4 n 2 2n 4 :
r2
r2 1
3:31
Therefore, except for the trivial case of a constant un, g n is strictly increasing,
and thus
lim g n 1;
n!1
while g 0 0 (due to the assumption of a deterministic e2 0).
3.11
ASYMMETRY OF THE PROBABILITY DISTRIBUTION
The variance analysis in the previous section shows that that in the initial adaptation
steps, e2 n tends to stay close to its mean, Ee2 n, since the variance of e2 n is
small. As time progresses, the variance grows, and one expects that e2 n will
wander farther and farther away from its mean. In principle, this could mean that
e2 n will assume equally likely large and small values. However, this is usually not
the case. As time increases, e2 n assumes small values more often than large values,
and its probability density function becomes more and more asymmetric!
To explain this behavior, return to and rewrite it as follows
ln
2 X
n1
e n
ln Am:
e2 0
m0
Dene also
v Eln Am;
s 2 Eln Am v 2 ;
3:32
3.11 ASYMMETRY OF THE PROBABILITY DISTRIBUTION
101
which are constants since fung is assumed stationary. Assuming that both v and s 2
are nite, one can use the Central Limit Theorem [19] to conclude that, for n ! 1,
2

1
e n
p ln 2
nv v N0; 1;
e 0
s n
that is, the quantity on the left-hand side tends to a normal distribution with zero
mean and unit variance. It then follows that, as n increases, the distribution of e2 n
can be well approximated by the following probability density function:
pe 2 x
2
1
2
2
p e1=2ns lnx=e 0nv ;
xs 2 p n
x . 0:
Figure 3.8 shows v and s 2 for u n uniformly distributed between 0:5 and 0.5.
Note the behavior similar to that seen in Figure 3.7, where u n is Gaussian. The next
gures show pe 2 x for several situations (in all cases, the vertical bar indicates the
position of E lne2 n=e2 0).
The plots in Figure 3.9 show the probability density function (pdf) for m 0:1,
n 10 (left plot) and n 500 (right plot). In this case, one has from Figure 3.8 that
v 1:679 102 ln E A 1:668 102 , and s 2 2:271 104 . Since v
ln E A and s 2 is small, one expects the learning curve to approximate well the
Figure 3.8 Graphs of v E ln A, s 2 Eln A v 2 and ln E A for Gaussian u n

uniformly distributed between 0:5 and 0.5.
102
Figure 3.9 Left: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 0:1, n 10. Right: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 0:1, n 500.
behavior of a single run of the lter. This expectation is conrmed by the pdfs of
e2 10 and e2 500, which show that e2 n tends to stay close to its mean.
On the other hand, one can see from Figure 3.10 that for m 2:0 the behavior is
quite different (now one has v 0:4005, ln E A 0:3331, and s 2 0:1521. That
is, v is signicantly larger than ln E A, and the variance is large). Even for n 10
the pdf of e2 n is already quite asymmetric, a characteristic that becomes more
pronounced as n increases. In this situation, e2 n is much more likely to be smaller,
rather than larger, than its average.
Figure 3.10 Left: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 2:0, n 10. Right: Graph of pe 2 x for u n uniformly distributed between 0:5 and 0.5,
m 2:0, n 20:
3.12 TWO RATES OF CONVERGENCE
3.12
103
TWO RATES OF CONVERGENCE
^ n when n
The above discussion can be used to compare the values of Ee2 n and D
is xed but L is allowed to vary. Indeed, it follows from the expression for g 0 n that
the larger the value of L is, the smaller the value of g 0 n will be. Hence, the more
^ n will be to that of Ee2 n.
experiments one averages, the closer the value of D
Another conclusion that follows from the almost-sure and variance analyses is
that an adaptive lter recursion exhibits two different rates of convergence (even for
single-tap adaptive lters). At rst, for small n, a sample curve e2 n is close to
Ee2 n and therefore converges at a rate that is determined by E ln A. For larger n,
the sample curve e2 n will converge at a rate that is determined by ln EA.
A nal remark: The knowledge that an adaptive lter is almost-sure convergent
does not necessarily guarantee satisfactory performance! Thus assume that a lter is
almost-sure stable but mean-square unstable. It follows from the earlier analysis that
a sample error curve will tend to diverge in the rst iterations (by following the
divergent mean-square learning curve), and only after an unknown interval of time
will the learning curve start to converge.
3.13
CONCLUDING REMARKS
This chapter provided an overview of two recent developments in the understanding

of adaptive lter performance. One result pertains to an energy conservation relation
that turns out to provide a convenient framework for the analysis of the performance
of a wide range of adaptive lters (cf. [8 14] and [20]). A second result pertains to
the learning abilities of adaptive lters, especially for larger step-sizes, where
several interesting phenomena arise that seem to indicate that the learning behavior
of adaptive lters is more sophisticated than was originally thought. More details
can be found in [1], such as extensions of the arguments to the vector case.
REFERENCES
1. V. H. Nascimento and A. H. Sayed, On the learning mechanism of adaptive lters, IEEE
Trans. Signal Process., Vol. 48, No. 6, p. 1609, 2000.
2. J. E. Mazo, On the independence theory of equalizer convergence, The Bell System
Technical Journal, Vol. 58, p. 963, 1979.
3. O. Macchi and E. Eweda, Second-order convergence analysis of stochastic adaptive
linear ltering, IEEE Trans. Automatic Control, Vol. 28, No. 1, p. 76, 1983.
Gaussian data, IEEE Trans. Acoust. Speech Signal Process., Vol. 33, No. 1, p. 222, 1985.
5. V. Solo and X. Kong, Adaptive Signal Processing Algorithms, Prentice Hall, NJ, 1995.
6. H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications,
Springer, 1997.
104
7. H. J. Butterweck, A wave theory of long adaptive lters, IEEE Trans. Circuits and
Systems I, Vol. 48, p. 739, 2001.
8. A. H. Sayed and M. Rupp, A time-domain feedback analysis of adaptive algorithms via
the small gain theorem, Proc. SPIE, Vol. 2563, p. 458, San Diego, CA, 1995.
9. A. H. Sayed and M. Rupp, Robustness issues in adaptive ltering, in DSP Handbook,
Chapter 20, CRC Press, 1998.
10. J. Mai and A. H. Sayed, A feedback approach to the steady-state performance of
fractionally-spaced blind adaptive equalizers, IEEE Trans. Signal Process., Vol. 48, No.
1, p. 80, 2000.
11. N. R. Yousef and A. H. Sayed, A unied approach to the steady-state and tracking
analyses of adaptive lters, IEEE Trans. Signal Process., Vol. 49, No. 2, p. 314, 2001.
12. M. Rupp and A. H. Sayed, A time-domain feedback analysis of ltered-error adaptive
gradient algorithms, IEEE Trans. Signal Process., Vol. 44, No. 6, p. 1428, 1996.
13. A. H. Sayed and T. Y. Al-Naffouri, Mean-square analysis of normalized leaky adaptive
lters, Proc. ICASSP, Vol. 6, p. 3873, Salt Lake City, Utah, 2001.
14. T. Y. Al-Naffouri and A. H. Sayed, Transient analysis of data-normalized adaptive
lters, IEEE Trans. Signal Process., Vol. 51, No. 3, pp. 639 652, March 2003.
15. R. R. Bitmead and B. D. O. Anderson, Adaptive frequency sampling lters, IEEE
Trans. Circuits and Systems, Vol. 28, No. 6, p. 524, 1981.
16. R. R. Bitmead, B. D. O., Anderson and T. S. Ng, Convergence rate determination for
gradient-based adaptive estimators, Automatica, Vol. 22, p. 185, 1986.
17. H. J. Kushner and F. J. Vazquez-Abad, Stochastic approximation methods for systems
over an innite horizon, SIAM Journal of Control and Optimization, Vol. 34, No. 2, p.
712, 1996.
18. R. Durrett, Probability: Theory and Examples, 2nd edition, Duxbury Press, 1996.
19. D. Williams, Probability with Martingales, Cambridge University Press, 2000.
20. A. H. Sayed, Fundamentals of Adaptive Filtering, Wiley, New York, 2003.
ON THE ROBUSTNESS OF
LMS FILTERS
BABAK HASSIBI
California Institute of Technology
4.1
INTRODUCTION
A central problem in adaptive signal processing is the following: Given a set of

input-vector and output scalars pairs1 fhn; yng, nd a model that well represents
this set of data. Well represents is often meant in a predictive fashion; that is, if the
model is presented with a new input vector h, then it should be able to provide a good
prediction of the output y. Among all possible models, arguably the simplest one that
can be conceived of is linear, that is, one for which we assume
yn hnT w
4:1
for some xed weight vector w (Fig. 4.1). Despite its apparent simplicity, the linear
model has broad implications and applies to many different problems and
applications. Most often, the crucial step in writing the model (4.1) is to determine
the input output pairs (hn; yn) and the nature of the approximation . For
example, if we are presented with scalar input output sequences, or time series, un
and yn then a possible model could be
yn w1 un w2 un01 wm un m 1:
4:2
In this case, the input vector (also called the regressor) is hn un un 1

un m 1T and the weight vector w w1 w2 wm T . Models of the
form (4.2) are referred to as nite-impulse-response (FIR) lters, and clearly the
crucial parameter that determines the model is the length of the lter, m (Fig. 4.2).
There are other examples that can be given, such as innite-impulse-response (IIR)
1
In this chapter we will assume, for simplicity, that the output is a scalar. The more general problem of a
vector output can be handled without much further difculty.

105
106
ON THE ROBUSTNESS OF LMS FILTERS
Figure 4.1
The basic adaptive ltering model.
lters; however, we shall not go into details here. We only mention in passing that
the model (4.1) is also of central importance because it can be regarded as the
linearization of more general nonlinear models (such as neural networks) around a
suitable operating point.
In any event, once the model (4.1) has been constructed, the problem is to
determine the best weight vector that describes the relationship between the inputs
fhng and the outputs fyng. The question, of course, is, in what sense do we mean
best? A reasonable choice is to have hnT w match yn in a least-mean-squares
sense, that is, to choose w according to the criterion
min Eyn hnT w2 ;
4:3
where E denotes expectation. This requires us to assume that hn and yn are

randomin fact, that they are (vector-valued) random processes (since they are
random time series). In order to evaluate the expectation in (4.3), we need to know
what the statistics of hn and yn are. Here the standard assumption is that hn and
yn are jointly wide-sense stationary random processes with second-order statistics:
EhnhnT R;
Ehnyn p;
Eyn2 ry :
4:4
Here R is the autocorrelation matrix of hn and p is the cross-correlation vector

between hn and yn. Note that the wide-sense stationarity assumption implies that
R, p, and ry are time-independent. In terms of these vectors and matrices, the mean
square error can be evaluated to be ry pT w wT p wT Rw, and hence the choice
Figure 4.2
An FIR adaptive ltering model.
4.1 INTRODUCTION
107
of w that minimizes this error is

wo R1 p:
4:5
For historical reasons, wo is called the Wiener solution [1, 2].

Computing the Wiener solution requires knowledge of the second-order statistics
of the signals involved. But in adaptive signal processing applications this is not
available. (If indeed it were, then there would be no need to provide the data
fhn; yng!) The whole premise is to adapt to the statistics of the signals involved.
4.1.1
The LMS Algorithm
Therefore the pioneering work of Widrow and Hoff came as a breakthrough since it
provided a recursive way of approximately solving (4.3) without knowledge of the
statistics of the signals involved [3, 1, 2]. Since the statistics of the signals are not
known, the expectation in (4.3) cannot be explicitly performed. Nor can the gradient
of the cost function be computed. However, the key observation of Widrow and Hoff
was that using the instantaneous value of the squared error, yn hnT w2 , rather
than its unknown mean, one can come up with an estimate of the gradient function
via differentiation with respect to w. This so-called instantaneous gradient is given
by hnyn hnT w, and so the algorithm updates estimates of the weight
vector along the negative direction of the instantaneous gradient:
^ n 1;
^ n w
^ n 1 m hnyn hnT w
w
4:6
^ n denotes the estimate of the weight vector at time n and m . 0 is the

where w
learning rate or step size.
This is the celebrated LMS (least-mean-squares) adaptive algorithm. It plays a
central role in adaptive lter theory and, along with its, variants, is by far the most
widely used adaptive algorithm. The backpropagation algorithm for learning in
neural networks, for example, is essentially an extension of LMS to the nonlinear
setting since weight updates are also performed in the direction of the negative
instantaneous gradient [4, 5]. The choice of the learning rate m has a profound
inuence on the performance, convergence rate, and so on of the LMS lter.
Although I shall say a few things later about the choice of m , it is an extensive topic
and will be covered in much greater detail in other chapters of this volume.
The LMS algorithm, as just described, presents a simple approximate solution to
(4.3). Nonetheless, one may question the validity of approximating the true gradient
by its instantaneous value and therefore question the performance of the recursion
(4.6). However, LMS has withstood the test of time, and extensive experience over
the past four decades has demonstrated that, over a surprisingly wide range of
applications (from biomedical to wireless communications, say), it yields good
performance. This may come as a surprise, especially because of the simplicity of
the algorithm and the approximations used in its derivation.
108
4.1.2
The RLS Algorithm
It is therefore natural to ask whether better approximations to the Wiener solution

can be found. One plausible approximation is to estimate the unknown statistics R
and p from the observed data. Assuming we have N data points fhn; yngNn1 ,
^ R
^ N
reasonable
estimates can be obtained Pby simple averaging: R
PN
N
T
1=N n1 hnhn and p^ p^ N 1=N n1 hnyn. Thus a solution is given
by
^ w
^ N
w
N
X
n1
!1
hnhn
N
X
hnyn:
4:7
n1
^ N converges to wo as N ! 1. When the hn

Of course, of interest is whether w
and yn are each iid (independent and identically distributed), by virtue of the law of
^ N ! R and p^ N ! p, so that w
^ N ! wo . However,
large numbers, R
independence is too strong an assumption for hn and yn and is rarely met in
practice. (It is, for example, not true of the FIR model (4.2)). Fortunately, the law of
large numbers holds under much milder conditions, such as hn and hn k
having vanishing autocorrelation as k ! 1. (This and similar conditions are
referred to as mixing conditions [6].) Therefore, in almost all practical applications,
^ N ! wo .
it follows that w
Another plausible approximation can be obtained by replacing the expectation in
the cost function of (4.3) with its estimated mean based on the data. Thus, we can
propose the criterion
min
w
N
1X
yn hnT w2 :
N n1
4:8
Note that, compared to (4.3), which was a stochastic least-squares problem, (4.8) is a
deterministic least-squares problem. Thus, there is no need to assume random
processes or to take expectations. Problem (4.8) can be readily solved via a
straightforward differentiation, or completion of squares. The reassuring, and not
altogether unexpected, result is that the solution to (4.8) is also given by (4.7). Thus,
replacing the unknown statistics R and p by their data-based averages and replacing
the mean-squared error by its data-based average are essentially the same thing and
lead to the same solution (4.7).
By all accounts, the solution of (4.7) appears to be a better approximation for
solving (4.3) than that of the LMS lter. However, compared to the LMS algorithm,
which yields a recursive solution, it has the drawback that one needs access to the
entire data set in order to compute the solution. Although this is not an issue in some
applications (such as system identication), it is crucial in many applications, such
as control and communications, where certain decisions must be made in real time,
and so depend on the current estimate of the weight vector. In such applications a
recursive solution is a must.
109
4.1 INTRODUCTION
Fortunately, the situation is easily remedied. Note that the solution to (4.8) at time
m i N (obtained by setting the upper indices in the sums to i) is given by
^ i
w
i
X
!1
hnhnT
n1
i
X
hnyn;
4:9
n1
where we have assumed that the matrix appearing in the parentheses is invertible
(which, P
incidentally is also why we need i m). It is convenient to dene the matrix
Pi in1 hnhnT 1 so that
^ i Pi
w
i
X
hnyn
n1
Pi 11 hihiT 1
i1
X
!
hnyn hiyi
n1
!

i1
Pi 1hihiT Pi 1 X
Pi 1
hnyn hiyi
1 hiT Pi 1hi
n1
^ i 1
w
Pi 1hihiT
^ i 1 Pi 1hiyi
w
1 hiT Pi 1hi
Pi 1hi
hiT Pi 1hiyi
;
1 hiT Pi 1hi
where in the third step we used the matrix inversion lemma A BCD1
A1 A1 BC1 DA1 B1 DA1 . Now the last expression readily yields
^ i w
^ i 1
w
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
4:10
which is the recursion we were pursuing. All that is needed is a recursion for Pi.
But the matrix inversion lemma can again be used to obtain
Pi Pi 1
Pi 1hihiT Pi 1
hiT Pi 1:
1 hiT Pi 1hi
4:11
Together the recursions (4.10 4.11) constitute what is known as the recursive-leastsquares (RLS) algorithm. It has a long history, and its inception goes back to Gauss
and Legendre. Due to its similarity to a differential equation rst studied by Riccati,
the recursion (4.11) is referred to as a Riccati recursion. (For an interesting review of
these, see [7].) Like all recursions, the recursions (4.10 4.11) must be initialized.
110
^ m and Pm, since

Here they are initialized with their values at time m, that is, w
this is the rst value for which the matrix Pm can be well dened.
4.1.3
Some Stochastic Interpretations
The RLS algorithm just described, at each timePinstant i, gives the exact solution to
the deterministic least-squares problem minw in1 yn hnT w2 . We also saw
that, under some mild mixing conditions, it converges to the Wiener solution, which
is the optimal solution to the stochastic least-squares problem (4.3). However, one
may wonder whether the estimate provided by the RLS algorithm at any time instant
i, and not just its limiting value, has a stochastic interpretation in its own right. It
turns out that this is indeed the case.
To this end, recall from (4.1) that the linear model we have assumed is
approximate. However, we can always make it accurate by adding an appropriate
disturbance signal vn, so that
yn hnT w vn:
4:12
Of course, a more complete description of the model would require us to say

something about the properties of vn. If we are interested in having a stochastic
model, then vn is assumed to be a stochastic process. In general, however, having
vn small (in some sense) is what constitutes a good model.
4.1.3.1 ML Estimators A common assumption for the disturbance signal vn is
that it is a zero-mean, stationary white Gaussian process with variance s 2 , which is
independent of both w and the hn. In other words, each of the vn are independent
zero-mean s 2 -variance Gaussian random variables. In this case it is straightforward
to write down the conditional probability density function
py1; . . . ; yNjw; h1; . . . ; hN
!
N
1
1 X
T
2
p exp 2
yn hn w ;
2s n1
2p s 2 N
4:13
since, conditioned on w and the hn, the yn are independent Gaussian with mean
hnT w and variance s 2 . The above conditional density is often referred to as the
likelihood function. Any estimator that maximizes it according to the criterion
max py1; . . . ; yNjw; h1; . . . ; hn
w
4:14
is appropriately called a maximum likelihood (ML) estimator. Inspection of the

formula in (4.13) clearly shows that solving (4.8) maximizes the exponent in (4.13)
and so yields the ML estimator. In other words, the RLS algorithm recursively
computes the ML solution.
4.1 INTRODUCTION
111
Some remarks on the stochastic assumptions that lead to the above observation
are in order. Most importantly, they differ from the stochastic assumptions we made
when obtaining the Wiener solution in two ways. First, we require the disturbance
sequence to be iid Gaussian (which we did not need for the Wiener solution). And
second, we do not need any stochastic assumption on the input vectors hn (as we
did in the Wiener case) since we treat them as known and condition on their
values.
4.1.3.2 MAP Estimators and Regularized Least Squares It is also possible to

allow the weight vector to be random. This allows us to write the a posteriori
conditional probability as follows
pwjy1; . . . ; yN; h1; . . . ; hN
p y1; . . . ; yNjw; h1; . . . ; hNpwjh1; . . . ; hN

;
py1; . . . ; yNjh1; . . . ; hN
4:15
where we have used Bayes rule. Any estimator that maximizes the above a
posteriori probability is referred to as a maximum a posteriori (MAP) estimator.
Since the denominator is independent of w, and since we assume that w is
independent of the regressor vectors hnso that pwjh1; . . . ; hN pwit
follows that MAP estimators satisfy the criterion
max py1; . . . ; yNjw; h1; . . . ; hNpw:
w
4:16
To obtain an explicit solution, we need to assume a certain model for w and, again,
the standard one is that it is a zero-mean Gaussian random vector with covariance
matrix EwwT P0 , independent of all the other signals involved. In this case, we
have
py1; . . . ; yNjw; h1; . . . ; hNpw
"
#!
N
X
1
1
T
2
T 2 1
yn hn w
exp 2 w s P0 w
; 4:17
K
2s
n1
q
where K 2p Nm s 2N det P0 . Therefore the MAP estimator is one that solves
the following regularized least-squares problem:
"
#
N
X
min wT s 2 P1
yn hnT w2 :
0 w
w
4:18
n1
This is identical to the least-squares problem (4.8), except for the regularization
term (often called the prior) wT s 2 P1
0 w. The solution is identical to that of RLS
112
(4.10 4.11) except that now we should initialize the recursions with
^ 0 0
w
and P0 s 2 P0 :
In fact, this makes the regularized solution more convenient to use than the
nonregularized one, for which we could only start the recurions from time m and
^ m and Pm.
needed to explicitly compute w
4.1.3.3 Least-Mean-Squares Estimation Let us return to the original leastmean-square criterion (4.3) but now apply the above stochastic assumptions. Thus,
^ to denote our estimate of the weight vector and w to denote its true unknown
using w
value, the mean square error (4.3) becomes
^ 2
^ 2 Evn hnT w w
Eyn hnT w
^ w w
^ T hn;
s 2 hnT Ew w
where in the second step we used the fact that vn is independent of w and where we
used our assumption that hn is deterministic to pull it outside the expectation. This
^ k2 . It is
implies that the criterion becomes that of minimizing EkhnT w hnT w
T
often customary to dene the uncorrupted output hn w as the desired signal
dn hnT w. In this case, the criterion is to choose w so as to best match the
desired signal dn in a least-mean-squares sense.
In any event, at time n, the optimal estimate of the desired signal is given by the
conditional mean [8]
d^ n EhnT wjy1; . . . ; yn hnT Ewjy1; . . . ; yn:
This implies that, irrespective of the input vector hn, we may dene the optimal
estimate of the weight vector at time n as
^ n Ewjy1; . . . ; yn:
w
4:19
The fact that the optimal estimate of w here does not depend on hn is signicant
and follows from the linearity of the conditional mean. We remark that this is not
necessarily true of other estimation criteriamore on this later.
When vn is a stationary white Gaussian process, it is well known that the
conditional mean is given by the MAP estimator [9]. Thus the solution to the
regularized RLS problem (4.18) yields the least-mean-squares solution. When vn
is not Gaussian, however, the conditional mean does not coincide with the MAP
estimator and the solution is not given by (4.18).
If we insist that our estimator be linear in the observations, that is, d^ n be a linear
function of fy1; . . . ; yng, then it turns out that the optimal estimator depends only
on the rst- and second-order statistics (the mean and covariance functions) of
the signals involved. In this case the optimal estimator is known as the linear
4.1 INTRODUCTION
113
least-mean-squares error (LLMSE) estimator. If vn is a zero-mean stationary

white process, Evn 0 and Evnvk s 2 d nk , where d nk represents the
Kronecker delta function and w is zero-mean, P0 -variance, and uncorrelated with
the vn, then it turns out that the LLMSE is given by the solution to the regularized
least-squares problem (4.18). This makes sense since in the Gaussian case the LMS
solution is linear and given by (4.18), whereas the LMMSE estimator depends only
on the rst- and second-order statistics. Therefore the LLMSE estimator should
coincide with the Gaussian MAP estimator with the same rst- and second-order
statistics, which is given by (4.18).
To summarize, the RLS algorithm yields the least-mean-squares estimator when
vn and w are independent zero-mean Gaussian random variables. It yields the
linear least-mean-squares estimator under the milder assumption that vn and w are
uncorrelated zero-mean random variables.
4.1.4
Some Questions
In the past few sections we have provided the RLS algorithm with a plethora of
properties and optimality criteria. However, at this stage, for the LMS algorithm all
we have provided is a heuristic argument for its being an approximation to the
Wiener solution.
We have argued that, under some mild mixing conditions, the RLS solution
converges to the optimal Wiener solution. The convergenceP
of the RLS algorithm
can be seen from the fact that, as n ! 1, the matrix Pi in1 hnhnT 1 approaches zero2 and therefore so does the gain vector Pi 1hi=1 hiT Pi 1
^ i 1 ! w
^ i ! wo . The LMS algorithm (4.6), on
hi in (4.10), implying that w
the other hand, has no chance of convergence since the gain vector m hn is nonzero
for all time, meaning that, as long as there is a nonzero error signal yn hnT w
(which is always the case when we have a disturbance signal vn), the value of
^ i 1 can never approach w
^ i.3
w
We have also shown that under a Gaussian disturbance model, and assuming that
the input vectors are deterministic, depending on whether we consider a regularizing
term or not, RLS recursively yields the ML and MAP estimates of the weight vector.
It also yields the least-mean-squares solution under the Gaussian assumption and the
linear least-mean-squares solution under the assumption of a zero-mean white
disturbance signal. For LMS, on the other hand, we have no such stochastic
interpretations.
With all that has been said, it appears that RLS should be the choice for adaptive
ltering. Nonetheless, a survey of the applications of adaptive ltering over the past
few decades reveals that the LMS algorithm and its variants are more widely used
P
2
This requires that for any scalar D . 0 there exist a time instant i such that in1 hnhnT . DIm ,
where the latter inequality is in the sense of positive denite matrices. This condition is referred to as
persistence of excitation and is a very reasonable assumption.
3
We should mention that the above argument holds when the LMS algorithm has a constant step size
m . 0. There also exist variants of LMS with a time-varying step size m i; . 0. If m i ! 0, then the
LMS algorithm will converge. However, we will not be considering vanishing step-sizes in this chapter.
114
than RLS and its variants. It is therefore natural to ask why this has been the case.
Apart from the performance and optimality issues just discussed, there are other
criteria that determine the applicability of a certain algorithm or methodology for
different practical problems. These may be listed as follows.
1. Simplicity. Simpler solutions are often preferred in practice, and the LMS
algorithm is certainly simple. However, the RLS algorithm is not really so
complex. It has a structure quite similar to that of LMS; the weight vector is
updated according to the error signal along the direction of a certain gain
vector. The only difference is that computing the gain vector in RLS requires
the propagation of a Riccati recursion.
2. Computational complexity. Algorithms that require fewer computations are
preferred in practice. The LMS algorithm clearly requires Om computations
per iteration. The RLS algorithm, as depicted in (4.10 4.11), requires Om2
computations per iteration, essentially because the Riccati recursion (4.11)
requires a matrix-vector product as well as a vector-vector outerproduct.
However, in many applications the input vectors possess certain structure. The
most common of these is the time series structure hnT un un 1
un m 1 of (4.2), where there is a great deal of redundancy between
hn and hn 1. There exist several clever techniques to exploit this
redundancy and thereby reduce the computations to Om per iteration [10].
These are generally referred to as fast RLS algorithms.
3. Numerical stability. Imprecisions and round-off errors are unavoidable in the
numerical implementation of any algorithm. Algorithms that suffer from
numerical instability in the face of such errors are not suitable in practice. If
the learning rate is not too large, so that the weight vector estimates do not
diverge, then it can be shown that the LMS algorithm is numerically stable. In
other words, round-off errors cannot lead to divergence and other problems.
With RLS, if implemented according to (4.10 4.11), numerical instability can
be an issue since round-off errors and nite precision can lead to the matrix
Pi of the Riccati recursion (4.11) losing its positive deniteness. (When this
happens, RLS may not yield meaningful estimates.) The problem of losing
positive deniteness of the Riccati variable has been known for a long time,
especially in the context of Kalman ltering, and has been resolved by
employing certain square-root algorithms [11, 8]. Rather than propagate the
Riccati variable Pi, these algorithms propagate its square-root factor, that is,
an m m matrix Pi1=2 , such that Pi1=2 PiT=2 Pi. With this trick, loss
of positivity is no longer an issue. Moreover, square-root algorithms make
extensive use of unitary transformations that, since they do not change the
norm of vectors, are the most numerically stable matrix operations that can be
performed.
In their most general form, square-root algorithms for RLS require Om2
operations. Under the time series structure hnT un un 1
un m 1, there exist fast square-root algorithms that require Om
4.2 THE QUESTION OF ROBUSTNESS
115
computations. However, these employ J-unitary, rather than unitary,

transformations, and so their numerical stability is still an area of active
research.
In view of the above comments, it follows that modications of the RLS algorithm
exist that, in terms of simplicity, computational complexity, and numerical stability,
are comparable to the LMS algorithm. When we add to these the optimality
properties of RLS that LMS lacks, we are still left with the question of why LMS is
more widely used in adaptive ltering applications. One possible explanation could
be that RLS is less well known, but this certainly cannot be true since it is much
older than LMS (going back to Gauss and Legendre). In fact, the celebrated Kalman
lter (of which RLS is a special case) was discovered at about the same time as the
discovery of LMS [12].
4.2
THE QUESTION OF ROBUSTNESS
The answer to the question of why LMS is more often used in adaptive ltering can
be found in the assumptions required for the various optimality properties of RLS.
When the inputs and outputs are stationary random processes, the Wiener solution
(4.5) is meaningful and the RLS algorithm converges to it. However, what if the
signals involved are not stationary? Although the Wiener solution is no longer
meaningful, under a persistence of excitation assumption, the RLS algorithm still
convergesbut what does it converge to? Moreover, what if the unknown weight
vector w, which we have so far assumed to be constant, itself varies with time? The
RLS algorithm (4.10 4.11) has a vanishing gain vector and so cannot track a timevarying w.4 The LMS algorithm, with its nonvanishing gain vector, on the other
hand, may be able to perform such tracking.
The stochastic optimality properties of being the ML and/or MAP estimators
require that the additive noise term vn be zero-mean, white, and Gaussian. The
optimality property of being the linear least-mean-squares estimator requires that
vn be zero-mean and white. But what if the vn are not Gaussian? What if they are
not white? In many adaptive ltering applications the additive noise term includes
modeling errors: The true model may be an IIR lter, and so by assuming the FIR
model (4.2), we are neglecting the tail of the lter and relegating it to vn. Or the
model may have some nonlinearities that we have ignored and included in vn. In
any event, with the inclusion of modeling errors, the vn are no longer Gaussian, or
white for that matter, and the stochastic optimality properties do not hold. So what
happens to these algorithms and what happens to their performance?
4
There exist certain variations of the RLS algorithm, such as the exponentially windowed RLS algorithm,
that have a non-vanishing gain vector and so may be suitable for tracking. More on these later.
116
Of course, no matter what the statistics and distributions of the noise term may be,
the RLS algorithm always is optimal in the sense of minimizing the least-squares
cost (4.8) since this is a deterministic cost. Now this very well may be a reasonable
optimality property to have. However, insofar as deterministic least-squares costs
go, it may also be reasonable to consider the weighted least-squares criterion:
min
w
N
1X
qn yn hnT w2 ;
N n1
4:20
where fqn . 0g is a set of weights. More generally, we can consider the weighted
least-squares problem:
min
w
N X
N
1X
yn hnT wqnk yk hkT w;
N n1 k1
4:21
where Q qnk is an N N positive denite matrix. This latter cost, especially, can
be obtained by rst applying the error sequence fyn hnT wg to a linear lter and
then applying the output of the lter to a standard least-squares problem of the form
(4.8).
We thus conclude that there is nothing special about the least-squares criterion
(4.8). Which of the three alternatives (4.8), (4.20), and (4.21) one should use
depends on the application at hand.
4.2.1
Robustness and Adaptation
The above discussions should bother us on two counts. First, we are left with the
question, what will the performance of RLS be when the stochastic assumptions are
not met? In other words, how robust is RLS with respect to modeling errors and lack
of statistical knowledge of the exogenous signals? Second, if indeed we assume
certain stochastic assumptions (such as Gaussianity, whiteness, etc.), then the
problem of adaptive ltering reduces to a statistical estimation problem. This really
takes the wind out of the sails of adaptive ltering. Instead of being an independent
eld in itself, it is relegated to being a subset of statistical estimation theory. More
importantly, the word adaptive becomes vacuous. What are we adapting to? Nothing
really, since we have assumed perfect stochastic models for all the signals involved.
In reality, adaptive algorithms should be able to operate under different stochastic
assumptions and tolerate different types of modeling errors. In other words, they
should be able to adapt to the stochastic (or otherwise) environment that they are in.
This clearly implies that adaptive algorithms must be robust to variations of the
system model and underlying statistical assumptions.
Therefore adaptation is much more related to robustness with respect to statistical
variation than it is to optimality with respect to a specic statistical model. This is a
very important point that is not nearly as well recognized as it should be. Of course,
due to their robustness properties, adaptive algorithms will be conservative and will
not perform as well as the optimal algorithm for any particular statistical model.
However, we do expect them to perform reasonably well over a wide range of
117
statistical models and environmentsindeed, very much like the way the LMS
algorithm performs in different environments.
In order to begin to quantify what we mean by robustness, it is helpful to pose the
basic robustness question for any estimation algorithm:
Is it possible that small disturbances and modeling errors may lead to large
estimation errors?
Note that in the above question we have made no reference to the statistics of the
disturbances. All we are asking for is that, if the disturbances and modeling errors
are small, then the estimation errors be small no matter what the statistics of the
disturbances may be. In other words, as long as we set up a model that reasonably
describes our data, that is, one in which the disturbances and modeling errors are
small, then a robust estimator will guarantee that the estimation errors will be small.
The above comments imply that any approach to robust estimation requires a
notion of largeness and smallness for the signals involved. For this there exist many
possibilities. For example, one can consider the peak of the absolute value of the
signals as one such measure. In control-theoretic jargon this is referred to as l1 theory
[13]. A perhaps more physical measure, widely used in practice and one that allows
for more analytic tractability, is the energy of the signal. This is what leads to H 1
theory.
4.2.2
The H 1 Approach
The rst systematic study of robustness, within the framework described above, was
done in the context of control theory and was introduced by Zames in 1981 [14].
Zames H 1 theory was concerned not with estimation problems but with the design
of controllers that were robust with respect to model uncertainty and lack of
statistical knowledge on the exogenous signals. H 1 control theory can be regarded
as the outgrowth and extension of classical linear-quadratic-Gaussian (LQG) control
theory, developed in the 1950s and 1960s, which assumed perfect models and
complete statistical knowledge [15]. Indeed, the development of H 1 theory in the
1980s and 1990s is considered one of the signicant achievements in control theory.
Now in the H 1 context, a robust estimator is one for which disturbances with
small energy lead to estimation errors with small energy. Therefore the natural
object to study is the energy gain from the disturbances to the estimation errors. In
particular, since we are interested in having small estimation error energies for all
small disturbance energies, we need to focus on the worst-case energy gain. This is
what is referred to as the H 1 norm.5
5
The norm dened here is really what is knwn as a 2-induced norm (since it is the maximum of the ratio of
energies or 2-norms). For historical reasons, in the control theory literature the 2-induced norm is referred
to as the H 1 norm. Here H stands for Hardy space, the space of all causal and stable functions of a
complex variable, that is, the space of all functions analytic outside the unit circle. The superscript 1
refers to the fact that H 1 is the space of functions analytic outside the unit circle and with nite magnitude
on the unit circle [16]. In any event, the term H 1 norm is a misnomer since, strictly speaking, the 2induced norm becomes an 1 norm (in the frequency domain) only when we consider innite-horizon
linear time-invariant (LTI) systems, which is the context in which H 1 control was originally introduced.
(As is well known, the maximum energy gain of any stable LTI system with transfer function Kz is given
by maxv jKe jv j2 .) Nonetheless, we too shall be guilty of using this loose terminology.
118
Figure 4.3 The H 1 norm is the maximum energy gain from the disturbances to the
estimation errors.
4.2.2.1 Denition 1 H 1 Norm Suppose that an estimator maps the disturbance

sequence fvng to the estimation error sequence feng. Then the squared H 1 norm
of the estimator is dened as
kek2
;
2
fvng[l2 kvk
g 2 sup
4:22
P
P
where kvk2 n vn2 , kek2 n en2 , l2 denotes the space of square-summable
sequences and sup refers to supremum.
In H 1 estimation we seek the estimator that minimizes the H 1 B norm (see
Fig. 4.3).
4.2.2.2 Problem 1 (Optimal H 1 Estimation Problem) Find an estimator that
minimizes the H 1 (or 2-induced) norm from the disturbances fvng to the estimation
errors, that is,
kek2
;
all estimators fvng[l2 kvk2
inf
sup
4:23
where inf refers to inmum. Moreover, nd the resulting optimal value g opt inf g .
The minimax nature of H 1 -optimal estimation is evident from the above problem
formulation. H 1 estimation is essentially a game problem: Nature (the opponent)
has access to the unknown disturbance fvng and chooses it to maximize the energy
gain in (4.23), whereas we can choose the estimator to minimize it.
H 1 -optimal estimators safeguard against the worst-case disturbance that
maximizes the energy gain to the estimation errors. Since this worst-case
disturbance is a deterministic sequence, such estimators do not require any statistical
assumptions. Moreover, since the inmization in (4.23) is taken over all possible
disturbances, the resulting estimator will be robust with respect to disturbance
variation. It can, on the other hand, be quite conservative.
We should mention that H 1 theory is a very rich area and that there exist many
different approaches to solving it. The original methods were operator- and functiontheoretic and made use of interpolation theory [17, 18], but there also exist statespace [19], circuit-theoretic [20] and game-theoretic [21] approaches. We shall not
go into any of these here, though the interested reader may consult any of the
119
aforementioned books for details. We only mention in passing that in quadratic

forms and indenite matric spaces, where the connections to the classical theories of
stochastic and deterministic least squares are made explicit. Some of our discussions
below are motivated by this approach.
4.2.3
A First Attack
Let us begin to consider how we can apply the H 1 approach to the adaptive ltering
problem at hand. Suppose we are given the input-output pairs fhn; yngNn1 and
we want to estimate the unknown weight vector w. How should we set up the
problem?
First, let us look at the disturbances. Clearly, there are two unknowns: the weight
vector itself and the unknown additive noise
vn yn hnT w:
We can therefore dene the energy of the disturbances as6
m 1 wT w
N
X
n1
vn2 m 1 wT w
N
X
yn hnT w2 ;
4:24
n1
where m . 0 is a constant that adjusts the relative importance of our uncertainty

about the weight vector and our uncertainty about the additive noise. As we shall
see, it is no coincidence that we have used the same notation for this constant as for
the learning rate in LMS. Note that since the outputs yn are observed, the only
uncertainty is due to the unknown weight vector w.
What about the estimation error? One plausible choice is the following: Suppose
that after observing the N data points fhn; yngNn1 we are presented with a new
input vector hN 1 and are interested in our ability to predict the resulting desired
output dN 1 hN 1T w. Therefore a reasonable estimation error is
d^ N 1 hN 1T w, where d^ N denotes our estimate.7
Thus, we are confronted with the following problem:
min max
d^ N1
d^ N 1 hN 1T w2
:
P
m 1 wT w Nn1 yn hnT w2
4:25
6
^ 0, say, then it is more appropriate to dene the
If we have an initial estimate of the weight vector w
disturbance energy as
^ 0T w w
^ 0
m 1 w w
N
X
vn2 :
n1
^ 0 0.
However, without loss of generality we shall assume that w
7
We remark that in the prediction problem under consideration d^ N 1 is allowed to be a function of
hN 1 but not of yN 1.
120
To facilitate solving this problem, let us rst look at the suboptimal problem of
guaranteeing that the maximum energy gain is bounded by the value g 2 . In other
words, for all possible w, we would like to have
d^ N 1 hN 1T w2
, g 2;
P
or, equivalently,
m 1 wT w
N
X
yn hnT w2 g 2 d^ N 1 hN 1T w2 . 0:
4:26
n1
Note that due to the minus sign on the last term, this is an indenite quadratic form in
the unknown w. Dening
2
3
y1
6 . 7
y 4 .. 5;
H h1 hN;
d^ d^ N 1
h hN 1;
yN
allows us to rewrite (4.26) as
2
6
wT yT d^ 6
4
m 1 I HHT g 2 hhT
H
HT
2 T
g 2 h
32
76 7
76 y 7 . 0:
54 5
d^
g 2
0
4:27
Now the above indenite quadratic form is positive for all w if, and only if,
1. It has a minimum in the unknown variable w (otherwise its value could
approach 1).
2. d^ can be chosen such that the value at the minimum is positive.
The condition for having a minimum can be readily found from computing the
Jacobian (or second derivative with respect to w and insisting that it be positive
denite, that is,
m 1 I HHT g 2 hhT . 0:
Some simple algebra shows that this is equivalent to
g 2 . hT m I HHT 1 h:
4:28
121
From this we conclude that the optimal value of g is

"
g 2opt
hN 1
mI
N
X
#1
hnhn
hN 1:
4:29
n1
Some further algebra shows that once the minimization over w has been done, the
value of the quadratic form at its minimum is
"
y
d^
g 2
"
HT
#
H
hT
h
!1 " #
y
d^
A further completion of squares shows this to be

y
1
I HT H
m
1
y

1 !2
^d 1 hT H 1 I HT H
y
m
m
g 2 hT m I HHT 1 h
4:30
Note that due to (4.28), the second term in the above equation is always nonpositive.
Therefore it is clear that one choice that guarantees the value at the minimum to be
positive is

1

1
1
1
1
I HT H
y hT
I HHT
Hy:
d^ hT H
m
m
m
But this is nothing but the estimate obtained from the solution to the regularized
least-squares problem (4.18) with P0 m I:
d^ N 1 hN 1
mI
N
X
!1
hnhn
n1
N
X
hnyn !
4:31
n1
|{z}
^ ls
w
In other words, after all the trouble of dening robustness with respect to
disturbance variation and introducing the H 1 estimation problem, it turns out that
the optimal solution is still given by the regularized least-squares solution (4.18)a
strange predicament indeed. So was all this worthwhile? Let us probe further.
4.2.4
A Prediction Problem
What we have above is that if we are given the data fhn; yngNn1 , then, for any
new input vector hN 1, the best predictor in the H 1 sense is given by the
regularized least-squares estimate. Of course, the regularized solution we have
122
presented is off-linewe need all the data before we can predict the output for
hN 1. As mentioned earlier, in many applications, real-time constraints are
crucial and at any time n we will need to predict the output at time n 1 using our
past observations. In other words, we need an optimal solution that is recursive.
For the least-squares problem we saw that a recursive solution was readily
available via the RLS algorithm. In other words, at each time i the solution to the
(regularized) RLS algorithm solves the following problem:
min
w
i
X
yn hnT w2 :
n1
Since the off-line H 1 -optimal solution is given by the least-squares solution, it

appears plausible that the on-line recursive least-squares solution should recursively
give the H 1 -optimal solution. But is this true?
What is certainly true is that for each time i, the RLS solution solves the
following minimax problem:
min max
d^ i1
d^ i 1 hi 1T w2
;
P
m 1 wT w in1 yn hnT w2
and also, since future values of the disturbance have no effect on the prediction error
at time i:
min max
d^ i1
d^ i 1 hi 1T w2
:
P
From these arguments, we conclude that the RLS algorithm recursively solves the
following minimax estimation problem:
"
min
d^ 1;...;d^ N1
max
w
d^ 1 h1T w2
P
#
d^ N 1 hN 1T w2
max
:
PN
T
2
w m 1 wT w
n1 yn hn w
4:32
But this is not quite the problem we would like to solve. What we would like to solve
is the following:
PN1
min
d^ 1;...;d^ N1
max
w
d^ n hnT w2
;
P
n1
4:33
that is, we would like to minimize the maximum energy gain from the disturbances
to the prediction errors fd^ n hnT wg. The problem with (4.32) is that at each time
123
i we have a different worst-case disturbance sequence (since we have a different

maximization at each time). This does not make sense since we are allowed to have
only one worst-case disturbance sequence. Moreover, due to the obvious inequality
maxa1 a2 aN max a1 max a2 max aN ;
it follows that the optimal value of (4.33) is lower than that of (4.32). In other words,
we may be able to do better than the RLS algorithm. In fact, it turns out that this is
the case.
To solve the H 1 problem (4.33), we may have recourse to general H 1 solutions.
However, this will require us to develop quite a bit of theory. Fortunately, to solve
(4.33) we do not need to bring to bear the full force of H 1 techniques. Using some
hindsight, one can give a rather simple rst principles derivation of the solution
which is what we will set about to do.
4.2.4.1
The Value of g opt
We rst argue that

PN
g 2opt
min
d^ 1;...;d^ N1
max
w
T
2
^
n1 d n hn w
P
N
m 1 wT w n1 yn hnT w2
cannot be less than 1. To this end, suppose, without loss of generality, that the initial
^ 0 0. Then one may conceive of a disturbance
guess of the estimator is w
sequence vn that yields an output signal that is zero for all times and therefore
^ 0 0. Thus,
coincides with the output expected from w
vn hnT w
and
yn hnT w vn 0:
In this case, any permissible estimator will not change its estimate of w so that
^ n w
^ 0 0 for all n.8 Moreover, the prediction error becomes
w
d^ n 1 hn 1T w 0 hn 1T w vn;
and so the energy gain is
PN1
T
2
kvk2
n1 hn w
:
P
m 1 wT w kvk2 m 1 wT w Nn1 hnT w2
Note that if an estimator changes its estimate of w when confronted with an all-zero output yn 0, for
all n), then in the disturbance-free case (w 0 and vn 0, for all n) the denominator of (4.33) will be
zero but the numerator nonzero, which makes the energy gain innite. This is clearly not permissible.
8
124
Let us now assume that the fhng have the property that
lim
N!1
N
X
hnT hn 1:
4:34
n1
Note that this is a weaker condition than the earlier-mentioned

persistently exciting
P
condition, which required the matrix limN!1 Nn1 hnhnT to go to innity in all
directions. Therefore we shall call condition (4.34) just exciting.
In any event, when the fhng are exciting, for P
any e . 0, it is always possible to
nd a weight vector w and an integer N such that Nn1 hnT w2 wT w=e m . With
these choices we have
PN1
T
2
n1 hn w
P
N
m 1 wT w n1 hnT w2
PN
T
2
n1 hn w
P
N
m 1 wT w n1 hnT w2
1 e;
which can be made arbitrarily close to one. We thus conclude that g opt 1.
This implies that for all estimators, in the worst case, the prediction error energy
can be no less than the disturbance energy. In other words, in the worst case, it is not
possible to obtain disturbance attenuation.
The question now is whether a worst-case energy gain of unity is achievable. In
other words, is g opt 1? And if so, what is the optimal estimator?
4.2.5
The LMS Algorithm
Although this volume is devoted to the LMS algorithm, we have spent most of our
time studying the RLS algorithm and considering robustness issues. It is now time to
return to the LMS algorithm (4.6). If we dene the estimation error of the weight
~ n w w
^ n, then it is straightforward to see that
vector as w
~ n 1 m hnyn hnT w
~ n m 1=2 w
^ n 1:
m 1=2 w
4:35
(The reason for premultiplying by m 1=2 will become clear in a moment.) Moreover,
we have
^ n 1 hnT w
~ n 1:
vn yn hnT w
4:36
Squaring both sides of (4.35) and (4.36) and subtracting the results yields
~ nj2 vn2 m 1 jw
~ n 1j2 hnT w
~ n 12
m 1 jw
^ n 12 ;
1 m jhnj2 yn hnT w
where for any row vector a we have dened jaj2 aT a.
4:37
125
If we now add up all the equations (4.37) from time n 1 to time n N 1, all
~ nj2 cancel out and we are left with
but the rst and last terms of m 1 jw
~ N 1j2
m 1 jw
N 1
X
~ 0j2
vn2 m 1 jw
n1
N 1
X
~ n 12
hnT w
n1
N 1
X
^ n 12 :
1 m jhnj2 yn hnT w
n1
The latter equation can be rewritten as

P
~ N 1j2 N1
~ n 12
m 1 jw
hnT w
PN1 n1
^ n 12
n1 1 m jhnj2 yn hnT w
1:
P
2
~ 0j2 N1
m 1 jw
n1 vn
4:38
Note that the second term in the numerator is just the energy of the prediction
~ n 1 hnT w d^ n. Moreover, if we assume that
error hnT w
m
1
;
hnT hn
8n
4:39
then the third term in the numerator of (4.38) is nonnegative, and so we have
PN1
T
2
^
n1 d n hn w
P
2
m 1 wT w N1
n1 vn
1;
4:40
~ 0j2 wT w.
where we have used the fact that jw
The result of (4.40) is signicant since it shows that if the learning rate satises
the bound (4.39), then LMS guarantees an energy gain no greater than 1. Since we
previously argued that g opt 1, this implies that LMS is H 1 -optimal!
4.2.5.1 The Condition on the Learning Rate We have shown that if the learning
rate satises (4.39), then for prediction errors LMS is H 1 -optimal and achieves
g opt 1. But what if (4.39) is not satised? Is it still true that g opt 1?
To answer this question, suppose that g opt 1, so that for any time i there exists
an estimator such that for all disturbances
Pi1 ^
T
2
n1 d n hn w
1:
P
m 1 wT w in1 yn hnT w2
126
This is equivalent to the existence of an estimator such that
m 1 wT w
i
i1
X
X
yn hnT w2
d^ n hnT w2 0
n1
4:41
n1
for all possible values of w. This is an indenite quadratic form in w and, as

mentioned earlier, it can be positive for all w only if it has a minimum over w. To
check whether we have a minimum, we differentiate twice to compute the Jacobian
matrix and check for its positivity, which yields
m 1 I
i
X
n1
hnhnT
i1
X
hnhnT m 1 I hi 1hi 1T 0:
n1
Since the above Jacobian matrix has only one nonunity eigenvalue, this latter
condition is simply
m 1 hi 1T hi 1 0:
But since i was an arbitrary time instant, this is precisely the condition (4.39).
We have thus shown that if g opt 1, then (4.39) must hold. Therefore it follows
that if (4.39) does not hold, then g opt = 1, which implies g opt . 1.
4.2.6
H 1 -Optimality of the LMS Algorithm
We can now summarize the above results.

Theorem 1
equation
Consider the input-vector output-pair fhn; yng, related via the

yn hnT w vn;
n 0;
where w is an unknown weight vector, vn is an unknown disturbance sequence, and

the input vectors hn are exciting. Consider the maximum energy gain from the
disturbances w and vn to the prediction error d~ n d^ n hnT w, that is,
g 2 max 2
kd~ k2
:
kvk2
w;v[l m 1 wT w
1. If m inf n
1
, then the minimum value of g 2 is given by
hnT hn
g 2opt 1;
^ n 1, where w
^ n is
and an H 1 -optimal predictor is given by d^ n hnT w
found from the LMS algorithm (4.6).
2. If m . inf n
g 2opt
1
, then
hnT hn
2
1 sup l max 4 m I
n
n
X
!1
hnhn
127
hn 1hn 1 m I5;
T
i0
where l max A is the largest eigenvalue of A: An H 1 -optimal predictor is

^ n 1, where now w
^ n is found from the recursion
given by d^ n hnT w
^ n w
^ n 1 Pn 1hnyn hnT w
^ n 1;
w
^ 1 4:42
w
and where Pn satises the Riccatti recursion

Pn 1 Pn
PnhnhnT Pn
;
g 2opt
T
hn Pnhn
g 2opt 1
P0 m I:
4:43
We should remark that the proof for part 1 of the above theorem has already been
given. Proving part 2 requires knowledge of H 1 theory, and so we omit it and refer
the interested reader to [22]. Nonetheless, we have included the statement of part 2
for completeness and only mention that (4.42) is identical to the LMS algorithm
(4.6) except that the learning rate m I has been replaced by the Riccati variable Pn.
Theorem 1 solves the long-standing problems of nding a rigorous basis for the
LMS algorithm. Moreover, it conrms the robustness of the algorithm and gives
theoretical justication for its widespread use in adaptive ltering. More to the point,
the LMS algorithm is widely used not because it is an approximate least-squares
solution (the exact solution, RLS, is readily available), or because it is simple,
computationally efcient, or numerically stable (RLS can be made competitive to
LMS on all these counts), but rather because it is an algorithm that is robust with
respect to disturbance variation, a property of which RLS, for example, cannot
boast. In fact, for prediction errors and in the H 1 setting, it is the optimal (hence
most robust) algorithm in existence.
Since it is a robust algorithm, LMS exhibits reasonable to good performance over
a wide range of environments and operating conditions. However, it cannot hope to
compete with algorithms that know the exact statistics of the signals involved and
are optimized for them. The point is that LMS will invariably perform within reason
no matter what the statistics and modeling errors are.
Finally, the LMS algorithm can be viewed as providing a contractive mapping
from the disturbances to the prediction errors. (This is true since the prediction error
energy is always less than the disturbance energy.) This property turns out to have
signicant implications for studying the stability of a wide class of adaptive
algorithms. The idea is to represent any adaptive algorithm as the feedback
connection of the LMS algorithm and a secondary system and to apply the small-
128
gain theorem from control theory [23, 24]. Since the LMS algorithm is a contraction,
the loop gain is equal to the gain of the secondary system, so stability is guaranteed if
this gain is less than unity. This approach to stability analysis is expounded
in [25].
4.2.7
Nonuniqueness of the H 1 -Optimal Predictor
In the statement of Theorem 1 we were careful to mention that LMS is an H 1 optimal estimator, since we have not yet determined whether or not the H 1 problem
has a unique solution. Let us now explore this issue.
We are interested in determing all predictors that yield
Pi
T
2
^
n1 d n hn w
P
i
m 1 wT w n1 yn hnT w2
1
for all w and all times i. This is equivalent to the inequality
1
w w
T
"
#T
#
"
i
X
1 0 d^ n hnT w
d^ n hnT w
n1
yn hnT w
yn hnT w
0:
As mentioned several times earlier, the above indenite quadratic form is nonnegative for all w if, and only if, it has a minimum over w and the d^ n can be chosen
such that the value at the minimum is nonnegative. Due to our condition on the
learning rate (4.39), we always have a minimum over w. Minimizing over w, the
value at the minimum is
"
#T "
i
X
1 m jhnj2
^ n 1
d^ n hnT w
n1
^ n 1
yn hnT w
m jhnj2
1 m jhnj2
m jhnj2
#1
"
#
^ n 1
d^ n hnT w

;
^ n 1
yn hnT w
4:44
^ n satises the recursion

where w
^ n w
^ n 1 m hnyn d^ n;
w
^ 0 0:
w
4:45
(For a proof of this result and a more general discussion of the minimization of such
indenite quadratic forms see [22], Theorem 3.4.2, and Lemmas 3.4.1 to 3.4.3.)
129
Performing a lower-diagonal-upper (LDM) factorization of the center matrix in

(4.44) allows us to rewrite it as
i ^
X
^ n 12
dn hnT w
n1
1 m jhnj
i
X
^ n 1
yn hnT w
n1
m jhnj2 yn d^ n2
0:
4:46
Note that due to the learning rate constraint (4.39), the rst summation in the above
expression is non-positive. Clearly, one choice that renders (4.46) nonnegative is
^ n 1. Plugging this back into (4.45) readily gives the LMS
d^ n hnT w
algorithm. However, is this the only choice that renders (4.46) nonnegative?
Obviously not. As long as the sequences
(
^ n 1
d^ n hnT w
p
1 m jhnj2
)i
and
n1
^ n 1 m jhnj2 yn d^ ngin1
fyn hnT w
are related by a strictly causal contractive mapping, (4.46) holds.9
We thus have shown the following result.
Theorem 2 Consider the setting of Theorem 1 and assume that m inf n 1=jhnj2 .
Then all H 1 -optimal predictors that achieve g opt 1 are given by
^ n 1
d^ n hnT w
q
1 m jhnj2
^ n 2 m jhn 1j2 yn 1 d^ n 1;
fn yn 1 hn 1T w
^ n 3 m jhn 2j2 yn 2 d^ n 2; . . .;
yn 2 hn 2T w
4:47
^ n satises
where fn ; ; . . . is a strictly causal contractive mapping and where w
the recursion (4.45).
An illustration of the parametrization of Theorem 2 is given in Figure 4.4.
The simplest strictly causal contraction is fn ; ; . . . 0 for all n, which gives
the LMS algorithm. Another simple causal contraction is the identity map
Two sequences fang and fbng are said to be related via a strictly causal contraction if, and only if,
P
P
ai fi bi 1; bi 2; . . . and in1 an2 in1 bn2 for all i.
130
Figure 4.4
All H 1 -optimal adaptive lters.
fn an 1; an 2; . . . an 1. This gives
d^ n
q
^ n 2 m jhn 1j2 yn 1
1 m jhnj2 yn 1 hn 1T w
d^ n 1;
which, upon plugging into (4.45) and performing simplications yields

p
8
>
^ n 1 1 m jhnj2 yn 1 hn 1T w
^ n 1
d^ n hnT w
>
>
<
p
T
^ n 1 1 m jhnj2 yn 1
^ n w
^ n 1 m hnyn hn w
w
T
>
>
^ n 1
hn 1 w
>
:
^ 0 0:
w
4:48
The above adaptive algorithm (4.48), as well as the LMS algorithm (4.6), are two
different H 1 -optimal adaptive lters. Of course, as suggested by Theorem 2, there
is an innite number of such algonthms, since there is an innite number of
possibilities for the strictly causal contraction fn ; ; . . ..
In retrospect, it is perhaps not surprising that H 1 -optimal lters are nonunique.
The reason is that the H 1 criterion guarantees a certain performance only for the
worst-case disturbance. It does not really specify what the performance should or
will be with respect to other disturbances.
A natural question, then, is whether some of these H 1 -optimal lters are
preferable to other ones. In other words, although all the resulting lters are optimal
from a robustness point of view, is it possible that they may have drastic differences
with respect to other reasonable criteria, such as average rather than worst-case
performance or response to white Gaussian noise? It turns out that there is indeed a
great deal of performance variation among the family of H 1 -optimal lters with
131
4.3 A STOCHASTIC INTERPRETATION
respect to criteria other than robustness. It can be shown, for example, that the lter
(4.48) has particularly poor average performance [26].
In fact, the nonuniqueness of the H 1 -optimal lters has led many researchers to
attempt to optimize other criteria over the family of these lters. We may refer to the
superoptimal criterion, as well as to the mixed H 2 =H 1 criterion that attempts to nd
the lter with the best average performance among all those that guarantee a
prescribed worst-case bound [26].
However, we shall not go into any of these here. Instead, we will focus on the
question of whether there is anything special about the LMS algorithm or whether it
is just an arbitrary member of the family of H 1 -optimal lters, not worthy of any
further distinction. To answer this question, we will now turn our attention to nding
a stochastic interpretation of the LMS algorithm.
4.3
A STOCHASTIC INTERPRETATION
Recall that even though the RLS algorithm can be considered as an algorithm that
minimizes the deterministic quadratic forms (4.8) or (4.18), under suitable Gaussian
assumptions on the signals involved, it also yields the ML or MAP estimates. The
reason is that the deterministic quadratic form that RLS minimizes can be
considered to be the (negative of the) exponent of a suitably chosen Gaussian
probability density function (cf. (4.13)).
The LMS algorithm is related to the deterministic quadratic form
1
w w
T
"
#T
i
X
1
d^ n hnT w
n1
yn hnT w
0
1
"
d^ n hnT w
yn hnT w
#
:
4:49
Indeed, referring to our derivation of all H 1 -optimal lters in Section 4.2.7, we rst
minimized the above quadratic form to obtain the quadratic form (4.44), or
^,
equivalently (4.46). Inspection of (4.46) shows that the choice d^ n hnT w
which leads to the LMS algorithm, recursively maximizes this quadratic form.
In other words, LMS performs the following optimization:
max min m
d^ 1;...;d^ i w
1
w w
T
"
#T
i
X
1
d^ n hnT w
n1
yn hnT w
0
1
"
d^ n hnT w
yn hnT w
#!
;
4:50
where the maximization is done recursively, that is, d^ 1 0; d^ 2 depends only on
y1; d^ 3 depends only on y1; y2, and so on.
Now at rst sight it appears that (4.50) cannot be related to a stochastic problem,
since the quadratic form is indenite and so cannot be the exponent of a Gaussian
probability density function. Moreover, we have a minimization over w but a
132
maximization over the d^ n. Nonetheless, if we write the quadratic form in (4.50) as
m 1 wT w
i
i
X
X
4
yn hnT w2
d^ n hnT w2 J;
n1
n1
then we can identify the rst two terms as the (negative of the) exponent of a
Gaussian density. More specically, using (4.17), we can write
eJ
2p im=2
py1; . . . ; yNjw; h1; . . . ; hNpw

!
i
X
exp
d^ n hnT w2 ;
n1
where we have assumed that w and the vn are zero-mean independent Gaussian
random variables with variance m I and unity, respectively. Therefore we may also
write
eJ
2p
im=2
py1; . . . ; yNjh1; . . . ; hN
pwjy1; . . . yN; h1; . . . ; hN

!
i
X
d^ n hnT w2 :
exp
n1
4:51
Now it is not hard to show that for any p p matrix A . 0
1
1

exp a*

b*

exp mina*

q
B a
dadb 2p det A1 p
C b

A B a
b*
:
B* C b
A
B*
Therefore integrating both sides of (4.51) over w yields

p
2p det A1 m e minw J
2p im=2 py1; . . . ; yNjh1; . . . ; hN
!
i
X
d^ n hnT w2 ;
Ejy1;...;yN;h1;...;hN exp
4:52
n1
P
where A m 1 I in1 hnhnT . Since the LHS of (4.52) depends on the d^ n
only through J, we conclude from (4.50) that the LMS algorithm recursively solves
133
4.3 A STOCHASTIC INTERPRETATION
the following problem:

min
d^ 1;...;d^ i
4.3.1
!
i
X
T
2
d^ n hn w :
4:53
n1
Risk-Sensitive Optimality
We can now formalize the result we have obtained in the following theorem.
Theorem 3
Consider the model

yn hnT w vn;
n.0
where w and the vn are zero-mean independent Gaussian random variables with
variance m I and unity, respectively. Assume further that (4.39) holds. Then the LMS
algorithm (4.6) recursively solves the problem
min
d^ 1;...;d^ i
!
i
X
d^ n hnT w2 ;
n1
where d^ 1 0 and d^ n can depend only on past values of the observation, yn

1; yn 2; . . . : Moreover, if the hn are exciting, there exists no estimator that
renders the cost function
Ejy1;...;yN;h1;...;hN exp g
2
i
X
d^ n hnT w2
n1
nite for any g , 1.

In other words, under a standard Gaussian assumption on the disturbances, rather
than minimize the mean-square prediction error,
Ejy1;...;yN;h1;...;hN
i
X
d^ n hnT w2 ;
4:54
n1
which is what the RLS algorithm does, the LMS algorithm minimizes the meanexponential-square prediction error (4.53).
The exponential quadratic cost (4.53) was rst introduced in the control theory
context by Jacobson [27]. It has also been championed in statistics by Whittle, who
calls it the risk-sensitive criterion [28]. In fact, Whittle introduces a whole family of
134
cost functions parametrized by the positive scalar g :

Ejy1;...;yN;h1;...;hN exp g
2
!
i
X
T
2
^
d n hn w :
n1
Whittle refers to estimators that minimize this criterion as risk-averse. The reason is
that, compared to the mean-square criterion (4.54), the risk-sensitive criterion puts a
much larger penalty (in fact, an exponentially larger penalty) on large values of the
prediction error. In other words, the criterion is more concerned with the occasional
occurrence of large values of prediction error, rather than with the frequent
occurrence of moderate values of error. Some further intuition regarding the
criterion (4.53) can be obtained by expanding the exponential function and noting
that the criterion penalizes all the moments of the prediction error, not just the
second moment.
The smaller the parameter g is, the more risk-averse the estimator, since the
criterion has a stronger exponential function. However, it turns out that g cannot be
made arbitrarily small, and there exists a critical value for g (called g opt ) below
which there exists no estimator that renders the risk-sensitive cost nite. In our
problem, the critical value is g opt 1, since if g , 1 were possible, it would mean
that an H 1 estimator with g , 1 is possible, which as we know is not the case. This
is essentially the second statement of Theorem 3.
We should remark that the risk-sensitive optimality of the LMS algorithm goes
very nicely with the robustness properties we described earlier. Essentially, not
tolerating large values of error is a way of saying that the algorithm is robust. In any
event, the risk-sensitive optimality of LMS is a very interesting property that is not
shared by any of the other H1 -optimal lters of Theorem 2. It is important for two
reasons: rst, it gives a nonobvious stochastic interpretation to the LMS algorithm;
second, it further emphasizes its special nature.
4.4
NONSTATIONARY SIGNALS AND TRACKING
We have seen that the LMS algorithm outperforms the RLS algorithm when we have
nonstationary signals and need to track time variations of the weight vector w. At
rst sight, one may argue that comparison of the tracking abilities of these two
algorithms is not fair since the LMS algorithm (4.6) has a constant gain vector,
whereas the RLS algorithm (4.10 4.11) has a vanishing-to-zero gain vector and so
can have no hope of tracking a time-varying w. However, the comparison is fair if
we consider that both algorithms deal with the same time-invariant model
yn hnT w vn;
with the only difference being that the RLS algorithm assumes that the disturbance
sequence vn is stationary and white (also Gaussian) and nds the linear least-meansquares (also least-mean-squares) estimate, whereas the LMS algorithm ensures
only that the disturbance sequence is unknown and nds the H 1 -optimal estimate.
4.4 NONSTATIONARY SIGNALS AND TRACKING
135
The point is that the RLS algorithm explicitly uses the fact that w is constant and so
forces the gain vector to go to zero. The LMS algorithm, on the other hand, by virtue
of its robustness to modeling errors and disturbance variation, safeguards us against
a time-varying w by enforcing a nonzero gain vector.
4.4.1
Exponential Windowing
The fact that the vanishing-to-zero gain vector of RLS leads to poor tracking
performance is well recognized in the literature, and so various modications to RLS
that circumvent this shortcoming have been proposed. The most common one is to
use an exponential window and to replace the deterministic cost function (4.18) with
"
#
N
X
T
2
T 2 1
n
min w s P0 w
l yn hn w ;
4:55
w
n1
where 0 , l , 1 is referred to as a forgetting factor. Note that l n in the above cost

function gives larger weight to the more recent observations, and so in solving
(4.55), we tend to forget the earlier values of the observation and thereby track the
time variations of w. Clearly, the smaller l is, the more rapid is the time variation of
w that we assume.
The same technique (of differentiation, say) that we used to solve (4.8) and (4.18)
can be used to solve (4.55) and leads to the following weight update
^ i w
^ i 1
w
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0;
w
4:56
which is the recursion we were pursuing. Pi itself satises a Riccati recursion that
can be computed to be

Pi 1hihiT Pi 1
T
1
Pi l
Pi 1
hi Pi 1 ;
1 hiT Pi 1hi
4:57
P0 s 2 P0 :
The reason the gain vector in (4.56) does not go to zero, unlike (4.10), is that, due to
the factor l 1 . 1 in (4.57), the matrix Pi does not go to zero.
Therefore using an exponential window in (4.55) alleviates the problem of a
vanishing-to-zero gain vector and improves the tracking performance. However, it is
also possible to apply the exponential window to the H 1 setting and to obtain a
robust version of (4.56 4.57). All one needs to do is replace problem (4.33) with
PN1 n
T
2
^
2
n1 l d n hn w
g opt
min
max
:
4:58
P
N
T
2
n
d^ 1;...;d^ N1 w m 1 wT w
n1 l yn hn w
It turns out that nding an explicit formula for g opt in (4.58) is not possible.
However, in [29] (Section 11.3.1), the following bound for g opt is obtained: Let
4
h sup hiT hi;
i
Rl i
i
X
j1
l ij h jh jT :
4:59
136
Then
g 2opt sup
i
h s Rl i
;
l i =m s Rl i
4:60
where s represents the minimum singular value of its argument. Moreover, it is

shown that, for any g . g opt , an H 1 predictor that achieves a worst-case energy
^ i 1, where w
^ i 1 satises the recursion
gain of g 2 is given by d^ i hiT w
^ i w
^ i 1
w
Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0;
w
4:61
where now Pi 1 satises the recursion

Pi1 l Pi1 l hihiT g 2 hi 1hi 1T ;
P01 m 1 I g 2 h1h1T :
4:62
The above estimator is one of many possible level-g estimators. But, as with LMS, it
has the distinction of being the risk-sensitive optimal solution.
4.4.2
General Time Variation
The exponential windowing scheme just described is really an ad hoc way of dealing
with a time-varying w. In effect, what we are doing is assuming a constant w but
assigning (exponentially) higher weight to more recent observations. A more
fundamental approach would be to introduce a time-varying weight vector into the
model directly. In other words, we should assume that the observed sequence is
given by
yn hnT wn vn;
n . 0:
4:63
The question that then arises is how to describe the time variation of wn. A
reasonable assumption is that wn itself satises the recursion
p
4:64
wn 1 a wn m 1 a 2 un; n . 0;
disturbance
where 0 , a , 1 and un is ap
vector often referred to as process noise.
The reason for the coefcient m 1 a 2 is that, if we assume that
Ew0w0T m I;
EunumT I d mn
4:65
where d mn is the Kronecker delta, then the covariance matrix of the weight vector
wn is constant for all time:
EwnwnT m I;
n . 0:
In other words, even though the weight vector is time-varying, its covariance matrix
is constant for all time. The parameter a clearly determines the rate of the time
variation of wi. The smaller it is, the faster the time variation.
137
Under the assumptions (4.63 4.65), the linear least-mean-squares predictor of

4
^ i 1 is given by the Kalman lter recursion
^ iji 1 w
the weight vector w
^ i a w
^ i 1
w
a Pi 1hi
^ i;
yi hiT w
1 hiT Pi 1hi
^ 0 0; 4:66
w
where now Pi 1 satises the Riccati recursion

Pi a 2 Pi 1 m 1 a 2 I
a 2 Pi 1hihiT Pi 1
;
1 hiT Pi 1hi
4:67
P0 m I:
(For a detailed discussion of the derivation of the above equations see, for example,
[8, 2].) Note that due to the term m 1 a 2 I in the Riccati recursion, Pi is positive
denite for all times, so the gain vector in (4.66) cannot go to zero. Therefore our
introduction of the time-varying model (4.64) automatically leads to an algorithm
capable of tracking wi.
4.4.2.1 The Leaky LMS Algorithm In the H 1 setting, the disturbance signal un
in (4.64) is assumed to be unknown, and the predicted values of the uncorrupted
output hnT wn are determined via the criterion
PN1
T
2
^
n1 d n hn wn
g 2opt
min
max
:
P
P
N
N
T
T
2
d^ 1;...;d^ N1 w;u[l2 m 1 wT w
n1 un un
n1 yn hn wn
4:68
Note now that we have three disturbancesw, the un, and the vnwhich is why
we have three terms in the denominator of the above energy gain.
Using H 1 theory, one can show the following result.
Theorem 4 Consider the model (4.63 4.64) and assume that m 1=hiT hi for
all i. Then the optimal prediction error energy gain g 2opt , found via solving (4.68),
satises
m sup hiT hi g 2opt g 2 1;
4:69
where g 2 1 is the inmum, over all i, of the largest positive solution to the
quadratic equation
g 4
m jhij2 jhi 1j2 a 2 m hi 1T hi2 2

g
1 m jhi 1j2
m 2 jhij2 jhi 1j2 a 2 hi 1T hi2

0;
1 m jhi 1j2
where we have used the notation jaj2 aT a.
138
Note that the above theorem implies that the optimal energy gain can be less than
1.10 Although it is possible to give the expression for an arbitrary level-g predictor,
we shall not do so here. Rather, we will give an interesting, though slightly
suboptimal, predictor that achieves g 1.
To this end, a simple variation of the LMS algorithm that has been proposed for
tracking applications is the leaky LMS algorithm,
^ i a w
^ i 1;
^ i 1 a m hiyi hiw
w
^ 0 0:
w
4:70
Note that compared to the LMS algorithm (4.6), the leaky LMS algorithm attenuates
the weight vector estimate by the factor 0 , a , 1. This allows the algorithm to
forget earlier data and allows better tracking.
Let us now study the consequences of algorithm (4.70) for the time-varying
model (4.63 4.64). If we dene the prediction error of the weight vector at time
~ i wi 1 w
^ i 1, then
i 1 using the observations up to time i as w
subtracting (4.70) from (4.64) allows us to write
~ i 1 am hivi
~ i a I m hihiT w
w
p
1 a 2 m ui;
4:71
^ i 1 vi w
~ i 1. This now
where we have made use of the fact that yi w
~ i 1; vi; uig to the
allows us to write the mapping from the variables fm 1=2 w
~ i; hiT w
~ i 1g as
variables fm 1=2 w

~ i
m 1=2 w
T
~ i 1
hi w
3
2
p # m 1=2 w
~ i 1
2
a I m hihi a m hi
1a I 6
7
vi
5: 4:72
4
T
1=2
m hi
0
0
ui
|{z}
"
1=2
We will now show that the mapping A is a contraction. Indeed, from the above
equation, it follows that
"
I AA*
a 2 m 1 m jhij2 hihiT
a m 1=2 1 m jhij2 hiT
a m 1=2 1 m jhij2 hi
1 m jhij2
#
0;
where the last inequality follows from the fact that the block (2, 2) entry satises
1 m jhij2 0 and its Schur complement is
a 2 m 1 m jhij2 hihiT a 2 m 1 m jhij2 hihiT 0:

10
The reason for this is the existence of the exponential decay factor 0 , a , 1.
139
The fact that A is a contraction implies that the norm of the output variables is less
than the norm of the input variables, that is,
~ i 12 m 1 jw
~ ij2 hiT w
~ i 1j2 vi2 juij2 :
m 1 jw
4:73
Adding all of the above equations from time i 1 to time i N implies that
~ Nj2
m 1 jw
N
N
N
X
X
X
~ n 12 m 1 jw0j2
hiT w
juij2
vi2 ; 4:74
n1
n1
n1
from which we infer

PN
T
~ n 12
n1 hi w
P
P
m 1 jw0j2 Nn1 juij2 Nn1
vi2
1
4:75
for all possible disturbances w0, ui, and vi. In other words, we have shown that
the leaky LMS algorithm guarantees a worst-case prediction error energy gain of
unity.
Theorem 5 Consider the model (4.63 4.64) and assume that m , 1=hiT hi for
all i. Then leaky LMS algorithm (4.70) guarantees
PN
T
~ n 12
n1 hi w
P
P
m 1 jw0j2 Nn1 juij2 Nn1
vi2
1
for all possible disturbances w0, ui, and vi.

In particular, if m 1=hiT hi, then the leaky LMS algorithm is H 1 -optimal,
since in this case
PN1
min
max
d^ 1;...;d^ N1 w;u[l2
m 1 wT w
hnT wn2
1:
PN
T
2
n1 un un
n1 yn hn wn
PN
n1 d n
T
Finally, if w0 and the ui and vi are independent zero-mean Gaussian random

variables with variances m I, I, and 1, respectively, then the leaky LMS algorithm is
risk-sensitive optimal in the sense that it solves the problem
min
d^ 1;...;d^ N1
!
N
X
d^ n hnT wn2 :
n1
The rst statement of the above theorem follows from the arguments preceding
the theorem. The second statement follows from (4.69). The third statement can be
proven using an argument similar to the one that we presented for the risk-sensitive
optimality of the LMS algorithm.
140
In any event, Theorem 5 demonstrates the robustness of the leaky LMS algorithm
(4.70) for applications where the unknown weight vector varies with time. It also
demonstrates the robustness of the LMS algorithm itself for such applications,
provided that the time variation of the weight vector is slow, that is, a 1, since in
this case there is little difference between the LMS algorithm (4.6) and its leaky
version (4.70).
4.5
FURTHER REMARKS
In addition to yielding a new interpretation for the LMS algorithm and providing it
with a rigorous basis, the results described so far have lent themselves to various
generalizations and to several new results. We close this chapter by listing some of
these.
4.5.1
Filtered Errors and the Normalized LMS Algorithm
In this chapter we have focused on prediction errors and predicting the uncorrupted
output hnT w. It is also possible to look at ltered errors
^ n;
hnT w hnT w
4:76
that is, on the error in estimating the uncorrupted output hnT w using observations
up to and including the current time instant n. In this case, the H 1 -optimal algorithm
turns out to be the normalized LMS algorithm
^ n w
^ n 1
w
m
^ i 1;
hnyn hnT w
1 m jhnj2
4:77
which is a commonly used variant of the LMS algorithm. It turns out that optimal
energy gain is g 2opt 1 and that this is true irrespective of the learning rate m .
Results such as the nonuniqueness of the H 1 -optimal estimators, the risk-sensitive
optimality, tracking properties, and so on, all extend to the normalized LMS
algorithm. For a proof of these results the reader may refer to [30].
4.5.2
LMS with a Time-Varying Learning Rate
In many applications, the LMS algorithm is used with a time-varying step-size (or
learning rate), that is,
^ n w
^ n 1 m nhnyn hnT w
^ i 1:
w
4:78
In this case, it is straightforward to show that if the vectors m n1=2 hn are exciting
and m nhnT hn 1 for all n, then the above LMS algorithm with a time-varying
4.5 FURTHER REMARKS
141
step-size solves the following minimax problem:

min
d^ 1;...;d^ N1
4.5.3
max
w
PN1
T
2
^
n1 m nd n hn w
1:
P
N
T
wT w n1 m nyn hn w2
4:79
H 1 Norm Bounds for the RLS Algorithm
In order to compare the robustness of least-squares algorithms (such as RLS) with

that of H 1 -optimal algorithms (such as LMS), it is useful to obtain H 1 norm bounds
for these algorithms. In [31] this is done for the RLS and Kalman ltering
algorithms, in particular, it is shown that for the RLS algorithm
q
2
2

1 m h 1 max
w
PN1
T
2
^
n1 d n hn w
P
q
2
2
1 m h 1 ;
4:80
where h supn hnT hn. Thus, unlike the LMS algorithm, where the optimal
energy gain was independent of m and the hn, for RLS, it highly depends on these.
p
Moreover, for large values of m , the upper and lower bounds in (4.80) grow as m .
This is reminiscent of the robustness properties of LMS, where the learning rate had
to be small enought to guarantee H 1 optimality. More importantly, it shows that the
unregularized least-squares problem (4.8) (corresponding to m 1) can be highly
nonrobust with respect to prediction errors.
2
4.5.4
Mixed H 2 =H 1 Problems
As mentioned in Section 4.2.7, the H 1 -optimal predictors are highly nonunique. It

therefore may be expected that the different H 1 -optimal lters in this family have
very different performances with respect to other criteria. For example, the LMS
algorithm was shown to be risk-sensitive optimal under suitable Gaussian assumptions on the disturbances. One other possibility is to search for the H 1 -optimal
lter that has the least expected prediction error energy. Such an estimator would, in
a sense, achieve the best of both worlds: Among all estimators that achieve a certain
level of robustness, it would have the best average performance. Estimators that
have such a property are referred to as mixed H 2 =H 1 -optimal. Finding the mixed
H 2 =H 1 -optimal solution for general estimation problems is an important open
problem. However, for the adaptive ltering problem we have been studying, it turns
out that the solution can be found and is given below [32]:
min
d^ 1;...;d^ N
XN
n1
^ rls n 12 ;
d^ n hnT w
142
subject to
#1
"
#T "
Xn
1 m jhnj2
m jhnj2
^ n 1
d^ n hnT w
n1
m jhnj2
1 m jhnj2
^ n 1
yn hnT w
"
#
^ n 1
d^ n hnT w

0;
^ n 1
yn hnT w
4:81
^ rls n denotes the RLS estimate of the weight vector. The above problem is a
where w
quadratic program and can be readily solved.
4.5.5
A Time-Domain Feedback Analysis
Using some of the ideas presented here, a time-domain feedback analysis of

recursive adaptive schemes, including gradient-based and Gauss-Netwon lters, has
been developed [25] for both the FIR and IIR contexts. The analysis highlights an
intrinsic feedback structure in terms of a feedforward lossless or contractive map
(based on the LMS algorithm) and a feedback memoryless or dynamic map. The
structure lends itself to analysis via energy arguments and via standard tools in
system theory, such as the small gain theorem [23, 24]. It further suggests choices
for the learning rates in order to enforce robust performance (along the lines of H 1
theory), as well as to improve the convergence speed of the adaptive algorithms.
4.5.6
Nonlinear Problems
The results presented in this chapter are for linear adaptive lters and can be
somewhat generalized to nonlinear adaptive lters (such as neural networks) if one
linearizes these models around some suitable operating point. Using this approach, it
can be shown (see [33]) that, for nonlinear problems, instantaneous-gradient-based
methods (such as backpropagation [4]) are locally H 1 -optimal. This means that if
the initial estimate of the weight vector is close enough to its true value, and if the
disturbances are small enough, then the maximum energy gain from the disturbances
to the output prediction errors is arbitrarily close to 1. Global H 1 -optimal lters can
also be found in the nonlinear case, but they have the drawback of being innitedimensional (see [34]).
4.6
CONCLUSION
In this chapter we showed that the LMS algorithm is H 1 -optimal. This result solves
the long-standing problem of nding a rigorous basis for the LMS algorithm and
also conrms its robustness. We have argued that compared to exact least-squares
solutions, the wide use of the LMS algorithm over a broad range of applications is
best explained by its robustness to modeling errors and disturbance variation rather
than its simplicity, computational efciency, or numerical stability (for all of which
REFERENCES
143
exact least-squares solutions can be designed to be competitive). We further

obtained a stochastic interpretation of the LMS algorithm in terms of risk-sensitive
optimality: Under a Gaussian assumption on the disturbances, LMS minimizes the
expected exponential of the prediction error energy. We also demonstrated the
ability of the LMS algorithm to work in nonstationary environments that require
tracking the unknown weight vector and showed this especially for the leaky LMS
algorithm.
It is appropriate to conclude this chapter by emphasizing two important points.
First, even though interest in H 1 estimation theory is relatively recent, an H 1 optimal estimation algorithm has been widely used for four decades. The robustness
properties and various statistical interpretations of the LMS algorithm that we have
discussed in this chapter go far beyond the original intentions of the inventors of this
algorithm. And so it is a testament to the insight and intuition of Widrow and Hoff
that their algorithm possesses such remarkable properties. Second, the results
presented here clearly indicate the strong connection between adaptive ltering and
robust estimation. Over the past four decades, most of the contributions to adaptive
ltering have been coming from the eld of statistical estimation theory. Therefore
further scrutiny of the connection to robust estimation is certainly worthwhile and
may lead to many fruitful developments.
REFERENCES
1. B. Widrow and S. Stearns, Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs,
NJ, 1985.
2. S. Haykin, Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, 2002.
3. B. Widrow and M. Hoff, Adaptive switching circuits, IRE WESCON Conv. Rec., Pt. 4,
pp. 96 104, 1960.
4. D. Rumelhart and J. McLelland, Parallel Distributed Processing: Explorations in
Microstructure of Cognition. MIT Press, Cambridge, MA, 1986.
5. S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice-Hall, Englewood
Cliffs, NJ, 1998.
6. P. Billingsley, Probability and Measure. Wiley-Interscience, New York, 1995.
7. T. Kailath, A view of three decades of linear ltering theory, IEEE Transactions on
Information Theory, vol. 20, pp. 146 181, 1974.
8. T. Kailath, A. Sayed, and B. Hassibi, Linear Estimation. Prentice-Hall, Englewood Cliffs,
NJ, 2000.
9. H. V. Trees, Detection, Estimation, and Modulation Theory, Part I. Wiley, New York,
2001.
10. A. Sayed and T. Kailath, A state-space approach to RLS adaptive ltering, IEEE Signal
Processing Magazine, vol. 11, pp. 18 60, 1994.
11. J. Potter and R. Stern, Statistical ltering of space navigation measurements,
Proceedings of the AIAA Guidance and Control Conference, 1963.
12. R. Kalman, A new approach to linear ltering and prediction problems, Journal of
Basic Engineering Transactions of the ASME, Series D, vol. 82, pp. 35 45, 1960.
144
13. M. Dahleh and J. Pearson, l1 -optimal compensators for continuous-time systems, IEEE
Transactions on Automatic Control, vol. 32, pp. 889 895, 1987.
14. G. Zames, Feedback and optimal sensitivity: Model reference transformations,
multiplicative semi-norms and approximate inverses, IEEE Transactions on Automatic
Control, vol. 26, pp. 301 320, 1981.
15. A. Saberi, P. Sannuti, and B. Chen, H2 Optimal Control. Prentice-Hall, Englewood Cliffs,
NJ, 1995.
16. P. Duren, Theory of HP Spaces. Dover, New York, 2000.
17. B. Francis, A Course of H1 Control Theory. Springer-Verlag, New York, 1987.
18. A. Feintuch, Robust control Theory in Hilbert Space. Springer-Verlag, New York, 1998.
19. J. Doyle, K. Glover, P. Khargonekar, and B. Francis, State-space solutions to standard
H 2 and H 1 control problems, IEEE Transactions on Automatic Control, vol. 34, pp.
831 847, 1989.
20. H. Kimura, Chain-Scattering Approach to H1 Control. Birkhauser, Boston, 1997.
21. T. Basar and P. Bernhard, H1-Optimal Control and Related Minimax Design Problems:
A Dynamic Game Approach. Birkhauser, Boston, 1991.
22. B. Hassibi, A. Sayed, and T. Kailath, Indenite-Quadratic Estimation and Control: A
Unied Approach to H2 and H1 Theories. SIAM, Philadelphia, 1999.
23. H. Khalil, Nonlinear Systems. Prentice-Hall, Englewood Cliffs, NJ, 2001.
24. M. Vidyasagar, Nonlinear System Analysis. SIAM, Philadelphia, 2002.
25. A. Sayed and M. Rupp, Error-energy bounds for adaptive gradient algorithms, IEEE
Transactions on Signal Processing, vol. 44, pp. 1982 1989, 1996.
26. B. Halder, B. Hassibi, and T. Kailath, Mixed H 2 =H 1 estimation: Preliminary analytic
characterization and a numerical solution, Proceedings of the 13th World Congress
international Federation of Automatic Control. Vol. J. Identication II, Discrete Event
Systems, pp. 37 42, 1997.
27. D. Jacobson, Optimal stochastic linear systems with exponential performance criteria
and their relation to deterministic games, IEEE Transactions on Automatic Control, vol.
18, pp. 124 131, 1973.
28. P. Whittle, Risk-Sensitive Optimal Control. Wiley, New York, 1990.
29. B. Hassibi, Indenite Metric Spaces in Estimation, Control and Adaptive Filtering. Ph.D.
thesis, Stanford University, 1996.
30. B. Hassibi, A. Sayed, and T. Kailath, H 1 -optimality of the LMS algorithm, IEEE
Transactions on Signal Processing, vol. 44, pp. 267 280, 1996.
31. B. Hassibi and T. Kailath, H 1 bounds for least-squares estimators, IEEE Transactions
on Automatic Control, vol. 46, pp. 309 314, 2001.
32. B. Hassibi and T. Kailath, On adaptive ltering with combined least-mean-squares and
H 1 criteria, Conference Record of the Thirty-First Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1570 1574, 1998.
33. B. Hassibi, A. Sayed, and T. Kailath, H 1 -optimality criteria for LMS and
backpropagation, Advances in Neural Information Processing Systems, vol. 6, pp.
351 359, 1994.
34. B. Hassibi and T. Kailath, H 1 -optimal training algorithms and their relation to
backpropagations, Advances in Neural Information Processing Systems, vol. 7, pp. 191
199, 1995.
DIMENSION ANALYSIS FOR

LEAST-MEAN-SQUARE
ALGORITHMS
IVEN M.Y. MAREELS

Department of Electrical and Electronic Engineering, The University of Melbourne, Australia
JOHN HOMER
Department of Computer Science and Chemical Engineering, The University of Queensland,
Brisbane, Australia
and
ROBERT R. BITMEAD
Department of Mechanical and Aerospace Engineering, University of California, San Diego
5.1
PREAMBLE
For ease of reference some of the notations, and denitions that are used in this
chapter are listed here.
5.1.1
Notation
Im
Efg
N
1X
E limN!1

N 1
the identity matrix of dimension m m

the expected value operator
or the Cesaro-mean along a sample path
Most of this chapter was written when Iven Mareels was visiting the Department of Electrical and
Computer Engineering at the National University of Singapore. The support and hospitality of the
Department are hereby gratefully acknowledged. John Homer is with the Department of Computer
Science and Electrical Engineering, The University of Queensland, Brisbane, Qld 4072 Australia,
homerj@csee.uq.edu.au; Iven Mareels is the Department of Electrical and Electronic Engineering, The
University of Melbourne, Vic 3010, Australia, i.mareels@unimelb.edu.au; and Robert Bitmead is with the
Department of Mechanical and Aerospace Engineering, University of California, San Diego, 9500 Gilman
Drive, La Jolla CA 92093-0411 USA, rbitmead@ucsd.edu.
145
146
DIMENSION ANALYSIS FOR LEAST-MEAN-SQUARE ALGORITHMS
n 1; 2; . . .
un
yn
dn
dn
en
wn
1
u n
T
R
wo
m.0
en
5.1.2
an integer representing discrete time

the input signal
the adaptive lters output signal
the desired response
a disturbance or measurement error signal
en dn dn yn
the measurable error signal
the adaptive weight vector
a vector of nite impulse response (FIR) tap weights, of
dimension
is the maximum lag of the FIR lter
u un un 1 un 1T
the -dimensional regressor (a column vector)
the transpose
the autocorrelation matrix of u
R Efu nuT ng
the subscript indicates the dimension
p Efu ndng
the cross-correlation of the input
regressor and the desired response
the Wiener solution wo R1
p
the learning rate or step-size parameter
en wo wn
the parameter error
Denitions
In the discussions frequent use is made of o : and O : estimates; refer to [31] for
detailed denitions. Concisely, for two sequences u1 n and u2 n dened on n
0; 1; . . . ; it is said that u1 is of the order of u2 , denoted as u1 n Ou2 n, provided
that there exists a constant C . 0 and a time instant n0 . 0 such that ku1 nk
Cku2 nk for all n n0 . The notation u1 n ou2 n indicates that u1 n
ku1 nk
Ou2 n and limn!1
0.
ku2 nk
5.2
INTRODUCTION
The asymptotic ltering performance and learning dynamics of least-mean-square

(LMS) adaptive algorithms used to estimate a nite impulse response (FIR) lter
depend on the length of the FIR lter () and the autocorrelation characteristics of
5.2 INTRODUCTION
147
the input signal u. This dependence is investigated in this chapter. The main tools
used in the analysis are rst- and second-order averaging techniques [2, 3133].
A standing assumption is
Assumption 1 The input u and disturbance d signals are wide sense stationary and
possess well-dened mean, autocorrelation, and cross-correlation functions.
This assumption is elaborated upon in the sequel.
5.2.1
The General Scenario
The typical situation is depicted in Figure 5.1. An FIR lter with taps with
adaptively adjusted weight vector w is used to approximate the response of an
unknown but stable lter. The stationary input signal is u. The adaptive lters
output is denoted y, with yn wnT u n, with u n un un 1 un
1T the corresponding regressor vector. The desired signal d is the output of the
unknown lter. The latter may be disturbed by a (stationary) disturbance signal d.
The weight vector w of the adaptive FIR lter is adjusted so as to minimize the error
en dn dn yn in mean-square sense. The weights are updated using
the least-mean-square (LMS) update rule (equivalent to a stochastic gradient
approximation with constant step-size):
wn 1 wn mu nen:
5:1
This can be equivalently rewritten as

wn 1 I mu nuT nwn mu ndn dn:
5:2
The step-size parameter m . 0 is assumed to be small with respect to the magnitude

of the input signal. More precisely, at least m trace R , 1 (but typically m is much
smaller). The step-size m is constant in our discussions.
Figure 5.1
Typical adaptive lters signal environment.
148
The task of the LMS algorithm is to nd the best (in a least-square sense) linear
approximation for the desired signal d using the regressor vector u . Because of
its computational simplicity and excellent robustness characteristics, it is an
extremely widely used algorithm [2, 1].
The signal environment is considered to be an open loop signal environment
when the input signal is independent of the error signal, so Efundn kg 0 for
all k. A feedback signal environment is one where the error signal may leak back into
the input signal, a situation that occurs, for example, in acoustic or telephony echo
cancellation applications. The main difculty encountered in a feedback signal
environment is (closed loop) stability. In the open loop signal environment, there is
no stability issue for the FIR lter as long as the FIR lter coefcients themselves are
nite. This stability property is one of the main attractions for the use of FIR lters in
an adaptive context.
An in-depth analysis of stability properties of LMS algorithms under open loop
signal conditions can be found in [46]. There the LMS algorithm is analyzed under
weak conditions restricting the interdependence of the regressor vectors and the tail
of their distribution. Most importantly, stationarity is not assumed, an assumption
that is made in this chapter. A form of exponential stability, which entails good
robustness properties, is established using ideas akin to averaging. As is typical in
averaging results, the results are established under the condition that the step-size
parameter is sufciently small. In [46] the tracking performance of LMS algorithms
is also analyzed under various scenarios. Because the assumptions are very weak,
the analysis presented in [46] is unable to reveal dimension dependencies in
performance and/or tracking characteristics. The latter is precisely the topic of this
chapter. The study of LMS algorithms under feedback conditions is not as well
developed. For some rst results, refer to [30, 25].
5.2.2
Overview of the Literature
The dimension parameter and its inuence on the behavior of LMS algorithms has
played an important role in the literature dealing with LMS lters from the very rst
references on the topic. Understanding this dependence becomes even more
important as demanding applications such as adaptive acoustic echo cancellation
and acoustic equalization require FIR lters of very high dimension in order to
achieve good (lter) performance.
Most of the early literature (see, e.g., [3539, 41, 42]) deals with the convergence
rate of LMS lters in terms of the second-order characteristics of the input signal,
namely, the eigenvalues of the correlation matrix R . This observation itself
reveals a link between convergence speed and the dimension via the second-order
moment of the input signal. Such dimension dependence is further supported by the
very general (and hence conservative) theory of Vapnik [34]. The results in [34]
predict under very mild signal conditions a penalty on the convergence rate with an
increase in lter parameter dimension.
Most of the literature dealing with the convergence aspects of LMS lters
attempts to nd a best step-size m (under a variety of input conditions) so as to
5.2 INTRODUCTION
149
ensure rapid (asymptotic) convergence. Typically, one considers convergence in the

mean or the more demanding second-order moment convergence.
In [3539, 41, 42] the authors make the approximating assumption that the
regressor u is independent and identically distributed (i.i.d). This error is of little
importance, provided that the step-size is sufciently small, as conrmed by the
studies in [33, 47, 48]. On the one hand, this is a comforting observation in that the
assumption of i.i.d. regressors is not really important in establishing the results.
Nevertheless, that the assumption is of little consequence for sufciently small stepsizes does erode signicantly the value of the results reported in [3539, 41, 42] as
far as the optimal selection of the step size is concerned. Indeed, the unanswered
question arises, is the optimal step-size sufciently small for the analysis to be valid?
With this caveat in mind, a typical result [37, 41] for convergence in the mean; that
is, Efkenk2 g ! 0 as n ! 1, requires simply mRl , I. In the case of Gaussian
input, not necessarily white, [42] shows that the optimal step-size ms2u
2traceR2 =s2u 1 for convergence in second-order moment EfenenT g ! 0
as n ! 1. Similar results, special cases of the result in [42], are reported in [41, 36,
38]. Despite the mentioned caveat, the results do indicate how dimension and
autocorrelation properties of the input may affect convergence and thus the
dynamics of the LMS adaptation.
These results concerning the maximum or optimal step-size essentially determine
the fastest asymptotic (in the limit for n ! 1) average rate of convergence for the
LMS algorithm. As such, these estimates provide very little information about the
actual nite transient behavior of the LMS algorithm. In order to overcome this
deciency, [41, 36] propose to estimate the average initial transient of the LMS
algorithm. With hn Efen e1T ung, the averaged initial convergence
rate is determined by an estimate of the form h1 e1=t h0, where t represents
the dominant time constant (i.e., the largest time constant among the possible time
constants). Under the assumption that the input is white and Gaussian with variance
s2u , [36], this leads to the following estimate for the dominant time constant
ms2u t 1. Note that the actual FIR lter weights or the dimension do not inuence
the time constant estimate. In [41] a similar computation is performed for wide sense
stationary signals. In order to eliminate the inuence of the FIR weights, these are
averaged out by assuming that they are distributed uniformly among the eigenvectors of the autocorrelation matrix of the input.1 This allows [41] to approximate
t for sufciently small m (m ! traceR ) as

1
s2
5:3
2ms2u 1 l 2 :
t
Eflg
2
where Ef
is the mean of the eigenvalues of R and sl
Pl g 1=traceR
2
1=Ef i1 li El g is the variance of these eigenvalues. Note that the above
1
This complicates the proper denition of an appropriate probability space for the problem setting in a
signicant manner. One has to take care of randomness in initial conditions, signal values, and Wiener
solutions or channels. We do not attempt to formalize this setting, nor do we introduce any special
notation. The context will make it clear which aspect of randomness is considered or averaged.
150
result is a factor of 2 tighter than the [36] estimate in the white Gaussian signal case.
This estimate clearly shows a dimension effect as well as the inuence of the input
signals autocorrelation function. The wider the spread of the eigenvalues of the
autocorrelation function (which is the more likely as the dimension grows), the
slower the expected initial convergence.
In [3] the behavior of the expected squared error Efen2 g is analyzed for small n
as a measure of transient performance. The assumptions that the input is i.i.d and the
regressor vector u is independent of w (not unreasonable for small m) are imposed.
It is shown that the initial convergence rate is optimized by choosing m 1=s2e0 ,
where s2e0 is the variance of the expected initial parameter error. With this choice, it
is further shown that the expected squared error Efen2 g converges like 1 1=n
for n small. This shows that the length of the FIR lter penalizes the LMS
algorithms convergence. The actual dependence on the signals autocorrelation
function is, of course, absent because the signal was assumed to be i.i.d. over time.
Similarly, [4] considers an adaptive normalized LMS algorithm under the condition
that both input and disturbance signals are white and uncorrelated. The authors
consider the expected squared error and establish that the initial convergence rate
decreases as 1=. A closer analysis of the result in [4] reveals that it is slightly
different from [3]. The estimates of the convergence speed differ by an Om2 term.
Given that both [3] and [4] use rst-order averaging techniques [31] to establish their
results, this is completely acceptable, as all estimates are at best om correct.
More recently in [5], for normalized LMS, the inverse of the condition number of
the normalized (unit variance) input autocorrelation matrix R has been studied. It is
assumed that the input is Gaussian. A heuristic argument is mounted indicating that
the convergence speed is inversely proportional to this condition number. The main
result establishes that this condition number grows with the length of the FIR lter.
A cost function that captures more adequately the transient performance of the
LMS algorithm is inspired by [38]
N
1X
kEfengk2
:
N!1 N
kwo k2
n1
Ce wo lim
5:4
In order to eliminate the dependence of the transient performance index on the

desired FIR lters weights wo , consider the averaged transient performance over an
ensemble of optimal FIR lters:
(
)
N
1X
kEfengk2
Ce Ewo lim
:
5:5
N!1 N
kwo k2
n1
In this way the (averaged over possible Wiener solutions) transient cost Ce will only
be a function of the signal characteristics and the number of weights in the FIR lter.
This cost function was analyzed in depth in [29]; see also [22, 24] using rst- and
second-order averaging ideas. These results are presented in this chapter.
Algorithms that use matrix-valued step-size parameters may offer some
(convergence and computational cost) advantages in the context of correlated
input signals, in particular if some prior information is available about the inputs
5.2 INTRODUCTION
151
autocorrelation function. See, for example [12], where the step-size parameter is
different for each of the tap estimates and, moreover, adaptively adjusted based on
the size of the total update for the particular weight. These ideas are not pursued
here. In the same vein [13, 16] may be mentioned, in which variants of an algorithm
that were originally proposed by [14, 15] and that can be traced back to [17] are
analyzed. At each sample interval, the algorithm updates only those m , tap
weights for which the corresponding regressor entries are largest. The algorithm
requires as overhead a sorting of the regressor vector in descending order of
magnitude, which can be efciently implemented. The computational cost is
Om logm, which has to be compared to O for the classical algorithm.
Through a simulation analysis, it is shown that the penalty on the convergence and
performance is minimal as long as m is selected appropriately. Reference [16]
provides further theoretical justication for the algorithms performance. An
analogous analysis is performed in [18], where the update is combined with an afne
projection to provide improved performance. These algorithms are somewhat akin
to the active tap algorithms proposed in the context of acoustic echo canceling [23].
In acoustic echo canceling applications the effective length of the FIR lter can
be large compared to the actual number of required (nonzero) taps. This can be
intuitively attributed to the way the signal is constructed: travel delay and
reections. The situation is as depicted in Figure 5.2; the shaded regions indicate
where tap weights are important. In such circumstances it pays not only to identify
the tap weights, but also to determine which taps should be identied. In view of the
fact that dimension adversely affects the LMS learning performance, this strategy
promises a signicant improvement in transient performance as compared to the
brute force estimation of all time taps over the entire effective length of the FIR
lter. It is therefore no surprise that the literature dealing with acoustic echo
canceling is preoccupied with reducing the number of adaptively adjusted FIR lter
Figure 5.2
Acoustic echo cancellation FIR lter.
152
weights. The key issue is how to determine which of the possible taps should be
updated.
A few authors consider block processing of data, either in time domain or in a
linear transform domain. In [6] large FIR lters are adaptively updated not in time
domain, but after a linear transformation such as a discrete cosine transformation.
Data are block processed where the block length is larger than the maximum FIR
delay. In the transformed domain the FIR coefcients, which are considered most
active, are updated using a normalized LMS-like algorithm. The authors consider
various options on how to reduce the computational cost of the updates in the
transform domain. The computational cost is linear in the number of taps to be
updated (which is much less than the FIRs total delay) and the data block length. It
is shown through a simulation study that the convergence rate compares favorably
with time domain based normalized LMS algorithms. In [7] a block data method is
considered in the time domain. In every block of N data points the algorithm
determines the P most signicant taps from a possible maximum of M FIR taps. The
integers satisfy P , M , N. The rst most signicant tap is the tap with the largest
weight, as determined through a projection (the regressor vector most aligned with
the output vector). This process is then repeated on the residual, what remains of
the output vector after removal of the most aligned regressor vectors, until either P
taps are determined or the residual is deemed sufciently small. The computational
cost is OMP per sample interval, which should be compared with the OM
computational cost for a normal LMS algorithm.
In the context of decision feedback equalizers, the tap selection issue is also
considered; see [911]. In [9] a simple feedforward decision feedback equalizer is
considered. In [10] the feedback decision equalizer is sparse. The method appears to
rely on some rather strong prior information about the signal environment in order to
determine which taps to update.
In [1921] the more conventional LMS or normalized LMS is considered with a
tap activity measure based on input-output cross-correlation estimates. A heuristic
argument indicates that this correlation analysis allows one to rank the most
important taps, which are then updated using the normal LMS algorithm.
The disadvantage of all these two-stage approaches is that the tap selection
mechanism is essentially divorced from the optimization task to be performed by the
LMS algorithm. In contrast, the approach expounded in the sequel directly selects as
active those taps that contribute most to the minimization of the least-squares cost
function. This approach is advocated in [23, 26, 27, 29].
5.2.3
Chapter Organization
The remainder of the chapter is organized as follows. First, the effect of dimension
and correlation properties of the input signal on the convergence properties of
standard LMS adaptive lters is studied. The basic assumptions are formulated, the
averaging analysis is performed, and a particular measure of the quality of the
adaptive lters behavior is proposed. The main theorem that characterizes how
dimension and correlation properties affect the performance measure follows. The
result is illustrated with a number of representative simulations.
5.3 HOW DIMENSION AFFECTS LMS CONVERGENCE
153
The next section deals with LMS adaptive lters with a variable number of
nonzero or active taps. This situation is analyzed under the condition that the input is
required to be white. A measure for detecting active taps is introduced. Based on this
measure, an algorithm that combines detection of active taps with standard LMS
adaptation is then proposed. The results are illustrated with some simulation studies.
A modication valid for mildly correlated signals is argued heuristically and
presented.
Pointers to open questions and further reading conclude the chapter.
5.3
HOW DIMENSION AFFECTS LMS CONVERGENCE
The open loop signal environment is studied. The main result is obtained through
rst- and second-order averaging techniques, without necessarily imposing a stochastic framework for the signals. Basic Cesaro mean assumptions for rst- and
second-order moments sufce to derive the results. First, the assumptions are
introduced. The basic averaged equations are then derived. Next, the performance
measure that captures both transient and asymptotic behavior of the LMS algorithm
is introduced. In order to make the results independent of a particular ltering
situation, necessary to discuss the inherent algorithmic properties, it is assumed that
the orientation of the desired Wiener solution is drawn from a uniform distribution.
5.3.1
Signal Environment Assumptions
To quantify how the dimension affects the convergence rate of the LMS algorithm in
the open loop signal case, the following assumptions are imposed.
Assumption 2 (i) The input, un, and disturbance, dn, signals are zero mean,
bounded, and stationary so that the following limits exist for all :
X0
1 N1k
u nu nT ;
N!1 N
nk
R lim
s2u lim
N!1 N
N1k
X0
un2 ;
nk0
X0
1 N1k
dn2 :
N!1 N
nk
s2d lim
(ii) The input and disturbance signals are uncorrelated with each other over
time:
X0
1 N1k
undn m 0;
N!1 N
nk
lim
8m:
154
(iii) The input signals autocorrelation sequence frng;

N 1
1X
u ju j n;
N!1 N
j0
rn lim
(iv)
(v)
(vi)
(vii)
n 2; 1; 0; 1; 2; . . .
P
is absolutely summable: 1
n1 jrnj , 1. This guarantees the existence
of the power spectrum of the input signal.
The power spectrum Fuu v of the input signal is positive denite
Fuu v . 0, 0 v 2p. This implies that the input signal covariance
matrix R is positive denite for all .
The LMS step-size is m 1=s2u 1=traceRn :
The normalized unknown Wiener solution wo =kwo k is independent of the
input signal and has a probability distribution which is uniform in direction
in the -dimensional space (or, equivalently, the unknown channel vector
has equal probability of pointing in any direction in the -dimensional
subspace).
The LMS initial estimate w0 is zero.
Remark 1 Assumption 2 (ii) implies, among other things, that the Wiener solution
is the stationary point for the LMS algorithm. Condition (iv) ensures that the Wiener
solution is an attractive point for the LMS algorithm, regardless of the dimension .
One says that the input signal is persistently exciting of any dimension.
Finally, condition (vi) allows one to average out the effects of any particular lter
situation and concentrate solely on the LMS dynamics itself. It could be argued that
it is a strong assumption to divorce the Wiener solution from the input signal.
Indeed, in general, the Wiener solution may depend on the input signal, but of
course, this is not a very desirable situation. It is a most convenient assumption, as
without it, the calculations for the performance indicator become rather tedious and
uninformative.
Condition (vii) is a natural consequence of (vi); there is simply no prior
knowledge to justify any other choice.
5.3.2
Averaging
Rather than discussing the convergence properties of the original LMS algorithin
(5.2), which requires one to study a time-varying linear equation, an intermediate,
averaged time-invariant equation, which closely captures the behavior of the LMS
algorithm, is obtained rst. Assumption 2, in particular conditions (i), (ii), and (iii),
enables the following approximation.
Consider the averaged equation
wav n 1 I mR wav n mp ;
wav 0 w0 0:
5:6
155
Then, under Assumption 2, conditions (i) (iv), standard averaging results guarantee
that the solution wav n of (5.6) is an om approximation for wn, the solution of
(5.2), uniformly over time, because R . 0. More precisely, for all m sufciently
small (at least satisfying condition (v) from Assumption 2), the following bound
holds:
wav n wn Oum:
5:7
The approximation function um is given by

nk

0
X

T
um sup sup sup m u iu i R w diu i:

k0 w[D n[0;L=m ik
5:8
Under Assumption (2) it can be deduced that um om for any choice of compact
domain D and any choice of horizon parameter L.
Remark 2 Under the particular condition (iii) imposed by Assumption 2, one can
p
actually estimate that um om O m.
Remark 3 In essence the above conclusion allows one to study equation (5.6)
rather than the LMS equation (5.2) in order to describe both transient and
asymptotic properties. This is the power of time-based averaging analysis. It is an
approximation result, which here, thanks to the asymptotic stability of the averaged
equation, is valid over the entire time axis [31, 2].
Remark 4 Note that the stationary point of (5.6) is the Wiener solution. It follows
from equations (5.6) and (5.7) that the LMS algorithms solution converges
geometrically to an Oum neighborhood of the Wiener solution.
5.3.3
Convergence Cost Function
In order to study the transient performance of the LMS algorithm, consider the
following cost functional:
C^ e Ewo lim
N!1
N
X
kwav n wo k2
n0
kwo k2
5:9
In view of the stability of the Wiener solution for equation (5.6), the sum in the
above can be seen to be bounded, and hence C^ e is well dened. It clearly captures the
transient performance, not the asymptotic performance, which is characterized by
(5.8).
156
Combining (5.6) with (5.9), the cost functional may be rewritten as

C^ e Ewo lim
N!1
N
X
kI mR n wo k2
n0
kwo k2
N X
X
gj wo 2
lim
1 mlj 2n Ewo
:
N!1
kwo k2
n0 j1
5:10
Here lj . 0 is an eigenvalue of R and gj wo is the projection of wo onto the

associated eigenvector. Assumption 2 (vi), which states that the direction of the
Wiener solution vector is uniformly distributed, then leads to
N X
X
1
C^ e lim
1 mlj 2n
N!1 n0 j1

1X
1
1
:
2 j1 mlj 2 mlj
5:11
Because, by Assumption 2 (v), 1 , 2 mlj , 2, it follows that

1
1
1
1
^
traceR1
traceR1
, Ce ,
:
2m
4
2m
2
5:12
Clearly, the constant terms are irrelevant when compared to the O1=m terms. This
suggests that an appropriate measure for the transient learning cost of the LMS
algorithm is given by the expression
Ce;
5.3.4
1
traceR1
:
2m
5:13
Cost Function Properties
In the previous section it was argued that the convergence cost, or transient learning
cost for the LMS algorithm in a typical situation (typical because the Wiener
solution dependence is averaged out), is determined by (5.13). In this section, some
analytic results on how this convergence cost Ce; depends on signal properties and
dimension are provided. Of course, in any particular signal environment, it is
actually feasible to compute the cost functional Ce; for different parameters and
simply observe the dimensional dependence.
The following result holds for any signal environment conforming to
Assumption 2.
157
Theorem 1 Under Assumption 2, consider the convergence cost functional Ce; as

dened in (5.13).
1. The cost functional Ce;l is a monotonically nondecreasing function of
dimension . More precisely,
Ce;1 Ce;
b21 2; 1 2b21 3; 1 b21 1; 1

1
m 1
r1
0;
5:14
where 1=r is the (1, 1) element of R1
and b k; 1=r is the (k; 1) element of
R1
.
2. Ce;1 Ce; 0 if and only if the signal is discrete white R I s2u .

3. Ce; is bounded above by the limit lim!1 Ce;l Ce;1 , given by
Ce;1
1 1
2m 2 p
p
p
F1
uu vd v:
5:15
A detailed proof for this result can be found in [22, 29]. Part 1 and Part 2
effectively follow from the Levinson algorithm applied to the inverse of the inputs
correlation matrix, exploiting its Hermitian and Toeplitz structure. Part 3 is a
standard result from [44].
The results encapsulated in Theorem 1 may be paraphrased as follows:
If the input signal u is discrete white, the dimension does not affect the
convergence speed of the LMS algorithm. This is a clear pointer for
prewhitening lters advocated in conjunction with LMS algorithms in, for
example, echo-cancellation applications.
When the input signal u is not discrete white, the effect of dimension is more
pronounced the more u deviates from discrete white. The expression Ce;1 can
be effectively interpreted as measuring the required lter power to whiten the
signal u. It can also be observed that, under the constraint of unity signal
power, Ce;1 attains its minimum (and thus becomes a tight bound) when the
input signal is discrete white, that is, for signals with a constant power
spectrum [29].
To appreciate the effect of the input signal not being discrete white, Figure 5.3
represents the convergence cost function Ce;1 against the lter pole a [ 0; 1 for
the input signal which is rst-order ltered white noise un 1 aun 1n.
Here the variance s1 of the white noise 1 is scaled such that u has unity total power
158
Figure 5.3 The effect of correlation length on transient performance of LMS.
1 p
1 p iv
Fuu vdv s21
je aj2 dv. As Figure 5.3 clearly illustrates,
2p p
2p p
the more the signal u is correlated (a closer to 1), the worse the transient
performance becomes (compared to the white noise case).
More generally, for unit power input signals u, described by autoregressively
ltered white noise signals, the convergence cost function can be expressed as
follows.
1
Theorem 2 Suppose that the unit power signal, un, is described by the mth-order
autoregressive (AR) model
a0 un a1 un 1 a2 un 2 am un m 1n;
5:16
where am = 0 and 1n is a discrete white zero mean signal of unit variance.

Then, for m, the corresponding convergence cost function Ce; is given by
Ce;

2
4
2m 2
am
a20 1 a21 1 a22 1
:
2m
5:17
Remark 5 Models of the form (5.16) are typically used for voiced speech. In such
circumstances the typical maximal delay m is of the order of 10. In acoustic echo
159
cancellations typical FIR lter orders are of the order of 1000, so the stated
restriction of m is not a limiting factor in this context.
Remark 6 The unit power constraint for u in (5.16) imposes the following
constraint on the autoregressive lter Az1 a0 a1 z1 a2 z2 am zm :
1
2p
1
d v 1:
jv j2
jAe
p
Remark 7 As already observed in the general case, for signals described by (5.16)
the convergence cost function also increases with the lter dimension . In
particular, it follows from (5.17) that in this special case
Ce;1 Ce;
a21 2a22 ma2m

. 0:
2m 1
Remark 8 When the LMS algorithm is normalized, the step-size parameter m is

scaled like m ! mn = (assuming s2u 1). It follows that for the normalized LMS
algorithm the convergence cost increases linearly with the dimension of the lter.
Indeed,
norm
Ce;
a20 2a21 4a22 2ma2m

2mn
5:18
and
norm
norm
Ce;
Ce;1
a21 a22 a2m

:
2mn
By way of illustration, Figure 5.4 represents Ce; (on the vertical axis) for various
AR models (with m 10) against 10 (on the horizontal axis). Table 5.1 includes
the three AR coefcient sets A1, A2, and A3 used to construct Figure 5.4. The AR
models, corresponding to equation (5.16), are obtained through application of the
Yule-Walker method [44] to segments of unit variance voiced speech. Also included
in Table 5.1 is the equivalent AR coefcient set A0 for a unit variance white signal.
The limit in the convergence cost function, or the correlation level, for each of these
signals is, respectively, mCe;1 1:0A0, 122:9A1, 306:4A2, 195:1A3. This
implies that the (normalized) LMS convergence cost function for voiced speech
inputs typically is more than 100 times greater than that for white inputs of the same
variance. For lter lengths greater than 40, the same is also true for the
unnormalized LMS cost function, as indicated in Figure 5.4. Note that the graph for
A0 is not discernible from the horizontal axis. Figure 5.4 clearly suggests that in
160
Figure 5.4 The effect of autocorrelation and lter dimension on transient performance. The
gure represents Ce; (vertical axis) for various AR models against 10 (horizontal axis).
applications such as acoustic echo cancellation, which include speech input signals
and lter lengths of 100 to the order of 1000, input signal whitening techniques
should improve the convergence speed by more than 100-fold.
5.3.5
Step-size Selection
The performance of a typical LMS algorithm consists not only of the transient
performance but also of the asymptotic performance. As indicated in the averaging
TABLE 5.1 Voiced Speech AR Coefcient Sets (Used for Figure 5.4)2
AR Filter
A0
A1
A2
A3
AR Coefcients (m 10)
1.0 0.0
3.7743
1.7124
6.6274
1.6608
5.3217
0.3747
0.0
6.2498
0.6484
11.5847
0.3274
9.2948
2.2628
6.2608
0.9251
8.2213
3.7856
7.0933
0.3028
3.7777
0.6609
3.5218
2.5476
2.8152
1.7444
The AR lters are designed to satisfy the unit power constraint; see Remark 6.
2.8273
0.2540
4.4682
0.4924
2.5805
1.1053
1.8502
2.0949
2.4230
5.4 VARIABLE-DIMENSION FILTERS
161
result, asymptotically the adaptive FIR lter approximates the ideal Wiener lter,
with an error of the order of Oum given in equation (5.8). Under the signal
conditions imposed in Assumption 2, it follows that the least-squares performance
error of the adaptive lter in steady state is Om in excess of the Wiener lter
p
performance. (Here use is made of the estimate Oum O m, as indicated in
Remark 2.)
It is natural to propose a step-size selection that tries to achieve good transient
performance as well as good asymptotic performance. This would lead to a criterion
for step-size selection of the form
Jm
1 1
2m 2p
p
p
F1
uu vd v Om:
5:19
Minimizing Jm leads to an optimal step-size of the form

s
!
1 p 1
F vdv :
O
4p p uu
5:20
Although of some interest, it is not advisable to attach much importance to this

expression, as the unsatisfactory caveat remains that the above order estimates are
based on the assumption that m is sufciently small, and it is not at all clear if the
above selection would satisfy this condition. It provides at best a guideline on how to
select a step-size.
5.4
VARIABLE-DIMENSION FILTERS
In applications like acoustic echo cancellation, FIR lters with large total delay are
required to obtain adequate echo suppression. In the presence of colored input
signals, this is particularly bad news for LMS algorithms, but even in the white input
case this causes performance difculties.
In this section, a particular method of detecting the active taps (see Fig. 5.2) in
conjunction with a typical LMS algorithm is discussed. The selection of the active
taps is geared toward achieving good asymptotic performance. The detection
mechanism in conjunction with a normal LMS algorithm provides an estimation
approach with greatly enhanced asymptotic performance compared to the direct
estimation of an FIR lter with as many taps as the total delay requires. The price to
be paid for this enhanced performance is a marginal increase in computational cost.
The proposed detection method is shown to be structurally consistent; that is, it
identies correctly which taps are active when the input signal is white. No
structural consistency results are available for colored imput signals, but the method
is shown to be robust with respect to deviations from white input signals.
162
5.4.1
Detecting Active Taps?
In the case of sparse FIR lter estimation (a lter like the one in Fig. 5.2), detection
of the active taps is important even in the case of white input signals. Although in
this case the convergence speed of the LMS algorithm is not affected by dimension,
the nal asymptotic performance of the adaptive lter is greatly affected. Indeed, if
all performance taps were adaptively estimated, then because an LMS estimate is
never exact but only om accurate, each tap estimate contributes an om error to the
nal adaptive FIR lter performance. With taps estimated, this leads to an excess
error of the order of om for the adaptive FIR lter compared to the ideal Wiener
lter. If, on the other hand, only m ! taps actually contributed to the Wiener lter
solution, and only those m taps were LMS adaptively estimated, then the nal
adaptive FIR lter would have an excess error of only omm ! om over the ideal
Wiener lter. Clearly, it pays to detect those taps that actively contribute to the FIR
lters performance.
In the presence of colored input signals, there is a second equally compelling
reason to consider detection of active taps based on the LMS convergence
performance. As is clear from the previous section, the dimension of the regressor
vector and autocorrelation properties of the input signal inuence the convergence
properties of LMS algorithms in a nontrivial and detrimental manner.
5.4.2
Signal Environment
In order to focus the ideas, consider the following signal environment assumptions
in addition to the standing Assumptions 1 and 2, which are consistent with the
intuition behind Figures 5.1 and 5.2.
Assumption 3
(i) Let the desired signal d be given by

dn dn
m
X
wo tj un tj ;
5:21
j1
where it is expected that m ! and where the indices of the nonzero Wiener
lter weights tj , , for j 1; ; m are unknown. Denote the collection of
ftj ; j 1; ; mg as Jo .
(ii) The input signal is discrete white, with variance s2u and thus R s2u I .
(iii) The disturbance is discrete white and uncorrelated with the input.
5.4.3
Active Tap Detection: Heuristics
Under Assumptions 1, 2 and 3, the performance of an LMS lter without active tap
detection is given by
lim Efen2 gs2d s2u
n!1
X
j1
Efwo j w j2 g:
5:22
163
Under Assumption 3, it is possible to compute the summation on the right-hand

side in (5.22). Indeed, asymptotically the LMS lter tap weights for different delays
are independent and normally distributed [29, 39, 41, 42], with the true Wiener lter
tap weights as mean, and variance ms2d =2.
If all tap weights are being estimated, then the estimate of the asymptotic
performance is given by
lim Efen g
2
n!1
s2d

2 m
:
1 su
2
5:23
The excess in asymptotic performance (as compared to the Wiener lter) is entirely
due to the variance error in the estimated tap weights. (There is no bias error.)
Now consider the case where only a portion of the tap weights are being
estimated and the others are simply set at zero. Let the estimated weights have
indices ^tj j 1; . . . ; k, with k , . Denote this collection of indices as J and its
complement with respect to the full set of indices as J c . Let J1 Jo > J be the set of
the indices of Wiener coefcients that are being estimated. Denote J2 J c > J, the
collection of the indices of those Wiener coefcients that are not estimated. Let
J3 J > Joc , be the set of indices of those coefcients that are estimated but have
no corresponding nonzero Wiener coefcient. The asymptotic lter performance is
then
lim Efen g
2
n!1
s2d
s2u
!
X m
X
wo j2 :
s2u
2
j[J1 <J3
j[J2
5:24
Clearly, the asymptotic performance can be further reduced by making J3 the empty
set. If the Wiener coefcient is zero, it should not be estimated. The contribution of
the summation over J2 to the asymptotic performance is the bias error. The adapted
model set does not include the actual Wiener lter, hence the bias terms in the
asymptotic performance. The contribution of the bias error to the overall performance can be minimized by removing every index j [ J2 for which 2wo j2 . ms2d .
From the above, it follows that if a tap weight contributes less than the expected
parameter variance, it should not be estimated. This observation will guide the tap
selection procedure. It is clear that to implement the procedure, it will be necessary
to estimate the variance of the disturbance s2d . Moreover, in general, it transpires that
the best adaptive lter performance would be achieved by a lter that may have
fewer taps than the Wiener lter. Structural consistency is therefore not an essential
property to aim for, although structural consistency is denitely better than having
too many parameters estimated.
Remark 9 A similar heuristic argument, possibly involving also the convergence

cost criterion, can be mounted, regardless of which adaptive algorithm is used, be it
recursive least squares, LMS, or normalized LMS. The conclusion is always the
164
same: Either the asymptotic performance and/or the convergence cost benets from
eliminating all estimation of tap weights that contribute less than the expected noise
oor. Structural consistency does not lead to optimal LMS lter performance.
5.4.4
Active Tap Detection
The previous discussion suggests the active tap locations by considering the
following indicator:
P
n
Xn j
kj1 dk dkuk j
P
n j nkj1 u2 k j
2
:
5:25
Indeed, because of the discrete white noise character of u and the fact that the
disturbance d and the input u are uncorrelated, it follows that Xn j converges in
probability as n ! 1 to w2o js2u . It follows that for sufciently large n, the most
active taps can be simply ordered according to the size of Xn j. This is formally
established in [26].
Clearly, this activity measure does not provide a means of detecting how many
taps are to be used in an LMS adapted lter. It does, though, provide a means of
detecting the m most active taps over a total FIR horizon of , given both integers m
and .
Using a consistency argument, [26] argues and proves that the following test
provides a consistent activity measure.
Theorem 4 Under Assumptions 1 to 3, with known , which is the maximal delay in

the FIR lter that models the desired signal d, the m positions of the nonzero taps
tj , j 1; ; m are given by those indices i [ 0; 1 for which

K log n
lim inf Xn i
. 0;
n!1
n
5:26
where Xn j is given by equation (5.25) and K is any positive scalar.

The theorem indicates that for all values of K . 0 a consistent set of active taps
can be determined, at least asymptotically. Of course, in practice the test

K log n
Xn i
.0
n
is performed for nite values n. The constant K greatly inuences the transient
behavior of the test; hence, proper selection of K is of great importance.
165
Moreover, in light of the above discussion, from an overall lter performance

point of view, it is advantageous to implement the test

K log n
s2 s 2
Xn i
.m u d:
n
2
This test does not lead to a structurally consistent set of indices. Nevertheless, this
modied test eliminates only those active taps that are contributing less than the
noise oor, and hence can indeed be eliminated with a direct benet to the overall
LMS performance.
The law of large numbers implies that Xn i becomes Gaussian for large n. For
those indices i not belonging to the set of active taps Jo , Xn i has zero mean and
logn
has probability 1
variance s2dd =n. It follows that the event supn Xn i , 2s2dd
n
as n ! 1. (For a discussion of extremes of random sequences, see [49].) Selecting
K 2s2dd would guarantee that inactive taps would not be erroneously detected
with probability 1. Nevertheless, this may be a conservative choice, as it may take a
long time before the smaller active taps will achieve the threshold. This
consideration must be balanced with the observation that these smaller taps
contribute less to the performance of the adaptive lter. Based on a series of
simulation studies, [26] suggests K s2dd or its sample path realization:
n
1X
Kn
d j d j2 .
n j1
5.4.5
LMS with Active Tap Detection
Combining the active tap detection result with an LMS update algorithm may be
achieved as follows:
1. Step 1. Detection at time n:
(a) Construct for j [ 0; 1
P
n
Xn j
kj1 dk dkuk j
P
n j nkj1 u2 k j
(b) Construct
Kn
n
1X
d j d j2 :
n j1
(c) Construct
En
n
1X
e j2 :
n j1
2
:
166
(d) Let g be an dimensional vector. Find i 1 such that Xn i .

log n
log n
s2
for a structurally consistent test or Xn i . Kn
mEn u for
Kn
n
n
2
a test allowing some bias but potentially better asymptotic performance.
Let gi 1. All other entries of g are set to zero.
2. Step 2. LMS update at time n:
(a) Determine
en dn dn wnu n:
(b) Update the j entry of w, say, wj according to
wj n 1 a1g j wj n meng jun j:
Here a [ 0; 1 is a forgetting factor.
3. Step 3. Set n n 1; go to Step 1.
Remark 10 The forgetting factor is essential to achieve the structural consistency

(or appropriate structural deciency). The inactive taps must converge to zero.
Setting them to zero may cause unnecessary transients in case the detection failed to
be accurate, a very likely situation during the learning transient. The above
algorithms strike an acceptable compromise between detection and LMS learning.
All simulations in the next subsection are based on the above algorithm in its
structurally consistent incarnation.
Remark 11 Compared to the standard LMS algorithm, the active tap detectionenhanced LMS algorithm is obviously computationally more demanding. This is the
price one has to pay for better asymptotic ltering performance. The overhead for
the detection is a computation of logn=n and an extra multiplications per time
step to evaluate Xn j. The enhanced algorithm is thus roughly twice as expensive as
the standard LMS algorithm.
5.4.6
Colored Input Signal?
The above analysis critically depends on the whiteness of the input signal. In case
the input signal is not white, one could consider introducing an input prewhitening
lter before detection and LMS updates take place. This is common in the acoustic
echo-cancellation situation [28, 29]. As the introduction of prewhitening lters
signicantly increases the computational complexity, a more direct approach with a
modied active tap test may be advantageous in other applications.
The main difculty with the activity measure Xn j in the colored input case is
that the detection threshold (as dened in Theorem 3) is too low. Indeed the
167
autocorrelation (length) of the input signal increases the variance of Xn j. A better

threshold value, taking into account the autocorrelation length of the input signal,
may be calculated [28]. The threshold Tc applicable in the color input case, proposed
in [28], is
Tc
2s2dd
!
logn
2
1 :
P
n
s4u Lj1 R 1; j
Here L is the effective length of the autocorrelation function of the input signal. It
amounts to assuming that Efunun L jg 0 or is negligible for all integers
j 0. Note that in case the signal is white, the above threshold is identical to the
threshold discussed in Section 5.4.5.
Unfortunately, with the new threshold, some inactive taps will necessarily be
labeled as active (for a signal with an autocorrelation length of the order of , all taps
would be labeled active). Structural consistency is lost. In order to combat this, the
LMS estimated FIR lter weights can be used to obtain a better activity measure by
essentially eliminating the cross-correlation in the detection phase (this assumes that
the LMS estimation works, despite the extra estimated weights). Such a bootstrap
process appears to work in practice, but no formal result indicating the structural
consistency is available. Extensive simulations are reported in [28].
5.4.7
Simulation Examples
The following examples are based on the algorithm suggested in Section 5.4.6,
compared with a standard LMS algorithm, without active tap detection.
The design parameters in the active tap detection LMS algorithm are the
forgetting factor a 0:9 and the step-size m 0:001.
The unknown FIR lter is represented in Figure 5.5.
The performance of the active tap detection LMS algorithm is represented in
Figure 5.6. Figure 5.6a corresponds to a signal environment in which both the input
Figure 5.5 Unknown FIR lters parameters 300; number of active taps m 11.
168
Figure 5.6 kenk2 for the active tap detection LMS algorithm applied to the sparse FIR
lter of Figure 5.5.
Figure 5.7 kenk2 for the standard LMS algorithm applied to the sparse FIR lter of
Figure 5.5.
169
u and the disturbance signal are discrete white, and uncorrelated, zero mean unit
variance Gaussian processes. In Figure 5.6b the disturbance signal is a Gaussian
rst-order AR process with AR coefcient 0.8, driven by a unit variance white noise.
Figure 5.6 displays the evolution of the parameter estimation error kenk2 . This
gure should be compared with Figure 5.7, which displays the same information for
the standard LMS algorithm. Clearly, the asymptotic performance is signicantly
worse in the LMS algorithm case. Although not directly measurable, the asymptotic
performance is about m= 1=30 times better for the algorithm with detection. This
is in line with the theory. Observe also that the convergence time is about the same
for both algorithms. This clearly illustrates the observation that the dimension does
not affect the convergence cost in the white input case.
The active tap detection part of the algorithm is illustrated in Figure 5.8. For the
same lter circumstances as before, Figure 5.8a corrsponds to a signal-to-noise ratio
of 1, while Figure 5.8b corresponds to a signal-to-noise ratio of 10. In both cases the
input and the disturbance are zero mean white Gaussian signals. s2u 1, and in
Figure 5.8a s2d 1 and in Figure 5.8b s2d 0:1. Reasonably quickly, the correct
number of taps is estimated. As illustrated in Figure 5.6, the parameters converge
quickly to the correct values as well.
Figure 5.8 Estimated number of active taps for the detection-enhanced LMS algorithm
applied to the sparse FIR lter of Figure 5.5.
170
5.5
DISCUSSION
This chapter considered the convergence characteristics of the classical LMS

algorithm in the context of a stationary signal environment. A convergence cost
functional that captures the typical LMS performance was identied. This
convergence cost functional is only a function of the dimension and the input
signals autocorrelation properties. The more a signal differs from discrete white, the
longer its autocorrelation length, the more the LMS algorithms convergence is
penalized by lter or regressor dimension. Signal prewhitening lters are therefore
of great importance in the context of LMS adaptive algorithms.
When the total delay in the adaptive FIR lter is very large, in particular when
compared to the total number of active FIR taps, the notion of active taps, those taps
that contribute most signicantly to the FIR lters performance, can be used tp
reduce the curse of dimensionality. A structurally consistent detection methodology
is identied that enables LMS adaptive algorithms to work effectively when the FIR
lters dimension is large.
If the signal environment is subject to feedback, the analysis of the LMS adapted
FIR lter behavior becomes much more complicated, as feedback loop stability
becomes a central issue. For a discussion of what is possible, refer to [30] and [25].
REFERENCES
1. B. Widrow, S. Stearns, Adaptive Signal Processing, Englewood Cliffs, NJ, Prentice-Hall,
1985.
2. V. Solo, X. Kong, Adaptive Signal Processing Algorithms, Stability and Performance,
Englewood Cliffs, NJ, Prentice-Hall, 1995.
3. K. Wesolowski, C. M. Zhao, W. Rupprecht, Adaptive LMS transversal lters with
controlled length, IEE Proceedings-F, Vol. 139, pp. 233 239, 1992.
4. K. Fujii, J. Ohga, Equation for brief evaluation of the convergence rate of the normalised
LMS algorithm, IEICE Trans. Fundamentals, Vol. E76-A, pp. 2048 2051, 1993.
5. P. E. An, M. Brown, C. J. Harris, On the convergence rate performance of the
normalised least-mean square adaptation, IEEE Trans. on Neural Networks, Vol. 8, pp.
1211 1214, 1997.
6. T. E. Hunter, D. A. Linebarger, An alternative formulation for low rank transform
domain adaptive ltering, Proceedings of ICASSP2000, Vol. 1, pp. 29 32, Piscataway,
NJ, 2000.
7. S. F. Cotter, B. D. Rao, Matching pursuit based decision-feedback equalisers,
Proceedings of ICASSP2000, Vol. 5, pp. 27132716, Piscataway, NJ, 2000.
8. S. Gollamudi, S. Nagaraj, S. Kapoor, Y. F. Huang, Set-membership ltering with a setmembership normalised LMS algorithm with an adaptive step-size, IEEE Signal
Processing Letters, Vol. 5, pp. 111114, 1998.
9. S. Ariyavisitakul, N. R. Sollenberger, L. J. Greenstein, Tap-selectable decision feedback
equaliser, Proceedings of ICCC97, Vol. 3, pp. 1521 1526, New York, 1997.
REFERENCES
171
10. M. J. Lopez, A. C. Singer, S. L. Whitney, G. S. Edelson, A DFE coefcient placement

algorithm for underwater digital acoustic communications, Proceedings of Oceans99,
Vol. 2, pp. 996 1001, Washington, DC, 1999.
11. M. Stojanovic, L. Freitag, M. Johnson, Channel-estimation-based adaptive equalisation
of underwater acoustic signals, Proceedings of Oceans99, Vol. 2, pp. 985 990,
Washington, DC, 1999.
12. T. Gansler, S. L. Gay, G. M. M. Sondhi, J. Benesty, Double-talk robust fast converging
algorithms for network echo cancellation, IEEE Trans. on Speech and Audio Processing,
Vol. 8. pp. 656 663, 2000.
13. T. Schertler, Selective update of NLMS type algorithms, Proceedings IGASSP98, Vol.
3, pp. 1717 1720, New York, 1998.
14. S. C. Douglas, Adaptive lters employing partial updates, IEEE Trans. on Circuits and
Systems, Vol. 44, pp. 209 216, 1997.
15. T. Aboulnasr, K. Mayyas, Selective coefcient update of gradient-based adaptive
algorithms, Proceedings of ICASSP97, pp. 1929 1932, Munich, 1997.
16. T. Aboulnasr, K. Mayyas, Complexity reduction of the NLMS algorithm via selective
coefcient update, IEEE Trans. on Signal Processing, Vol. 47, pp. 1421 1424, 1999.
17. S. M. Kuo, J. Chen, Multiple microphone acoustic echo cancellation system with the
partial adaptive process, Digital Signal Processing, Vol. 3, pp. 54 63, 1993.
18. K. Dogancay, O. Tanrikulu, Selective partial update NLMS and afne projection
algorithms for acoustic echo cancellation, Proceedings of ICASSP2000, Vol. 1, pp.
448 451, Piscataway, NJ, 2000.
19. A. Sugiyama, K. Anzai, H. Sato, A. Hirano, A fast convergence algorithm for adaptive
FIR lters under computational constraint for adaptive tap-position control, IEEE Trans.
on Circuits and Systems II, Vol. 43, pp. 629 636, 1996.
20. A. Sugiyama, K. Anzai, H. Sato, A. Hirano, Cancellation of multiple echoes by multiple
autonomic and distributed echo canceller units, IEICE Trans. on Fundamentals, Vol.
E81-A, pp. 2361 2369, 1998.
21. J. H. Gross, D. M. Etter, Comparison of echo cancellation algorithms for the adaptive
delay lter, Proceedings 1992 Veh. Technol. Soc. Conf. (VTS), pp. 574 576, New York,
1992.
22. J. Homer, R. R. Bitmead, I. M. Y. Mareels, Quantifying the effects of dimension on the
convergence rate of the LMS adaptive FIR estimator, IEEE Transactions on Signal
Processing, Vol. 46, pp. 2611 2615, 1998.
23. J. Homer, I. M. Y. Mareels, R. R. Bitmead, B. Wahlberg, F. Gustafsson, LMS estimation
via structural detection, IEEE Trans. on Signal Processing, Vol. 46 (10), pp. 2651
2663, Oct. 1998.
24. J. Homer, I. M. Y. Mareels, R. R. Bitmead, Analysis and control of the signal dependent
performance of adaptive echo cancellers in 4-wire loop telephony, IEEE Trans. on
Circuits and Systems II, Vol. 42, pp. 377 392, 1995.
25. J. Homer, I. M. Y. Mareels, Echo canceller performance analysis in 4-wire loop systems
with correlated AR subscriber signals, IEEE Trans. on Information Theory, Vol. 41, pp.
322 329, 1995.
26. J. Homer, I. M. Y. Mareels, R. R. Bitmead, B. Wahlberg, F. Gustafsson, Improved LMS
estimation via structural detection, Proc. 1995 IEEE International Symposium on
Information Theory, Whistler, British Columbia, Canada, 17 22 Sept. 1995, p. 121.
172
27. J. Homer, B. Wahlberg, F. Gustafsson, I. M. Y. Mareels, R. R. Bitmead, LMS estimation

of sparsely parameterised channels via structural detection, Proceedings of the 33rd
IEEE Conference on Decision and Control, Orlando, FL, pp. 257 262, 1994.
28. J. Homer, Detection guided NLMS estimation of sparsely parameterised channels,
IEEE Trans. on Circuits and Systems II, Vol. 47, pp. 1437 1442, 2000.
29. J. Homer, Adaptive Echo Cancellation in Telecommunications, Ph.D. thesis, The
Australian National University, 1994.
30. I. M. Y. Mareels, R. K. Boel, A performance oriented analysis of a double hybrid
adaptive echo cancelling system, Journal of Mathematical Systems, Estimation and
Control, Vol. 2, pp. 71 94, 1992.
31. J. A. Sanders, F. Verhulst, Averaging Methods in Nonlinear Dynamical Systems, Applied
Mathematical Sciences 59, New York, Springer-Verlag, 1985.
32. I. Mareels, J. W. Polderman, Adaptive Systems: An Introduction, New York, Birkhauser,
1996.
33. A. Benveniste, M. Metivier, P. Priouret, Adaptive Algorithms and Stochastic
Approximations, New York, Springer, 1990.
34. V. Vapnik, Estimation of Dependencies Based on Empirical Data, New York, Springer,
1982.
35. R. R. Bitmead, B. D. O. Anderson, T. S. Ng, Convergence rate determination for
gradient based adaptive estimators, Automatica, Vol. 22, pp. 185 191, 1986.
36. B. Widrow, Adaptive noise cancellation: principles and application, Proceedings of the
IEEE, Vol. 63, pp. 1692 1712, 1975.
37. B. Widrow, J. M. McCool, M. G. Larimore, C. R. Johnson, Stationary and nonstationary learning characteristics of the LMS adaptive lter, Proceedings of the IEEE,
Vol. 64, pp. 1151 1162, 1976.
38. M. Tarab, A. Feuer, Convergence and performance analysis of the normalized LMS
algorithm with uncorrelated Gaussian data, IEEE Trans. Information Theory, Vol. 34,
pp. 680 691, 1998.
39. A. Feuer, E. Weinstein, Convergence analysis of LMS lters with uncorrelated Gaussian
data, IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 1, pp. 222 229, 1985.
40. J. E. Mazo, On the independence theory of equalizer convergence, Bell Systems
Technology Journal, Vol. 58, pp. 963993, 1979.
41. W. Gardner, Learning characteristics of stochastic gradient descent algorithms: a
general study, analysis and critique, Signal Processing, Vol. 6, pp. 113 133, 1984.
42. D. T. M. Slock, On the convergence behaviour of the LMS and the normalised LMS
algorithms, IEEE Trans. on Signal Processing, Vol. 41, pp. 2811 2825, 1993.
43. L. I. Horowitz, K. D. Senne, Performance advantage of complex LMS for controlling
narrow-bank adaptive arrays, IEEE Trans. on Acoustic, Speech and Signal Processing,
Vol. 29, pp. 722 735, 1991.
44. S. L. Marple, Jr., Digital Spectral Analysis with Applications, Englewood Cliffs, NJ,
45. L. Ljung, System Identication: Theory for the User, Englewood Cliffs, NJ, PrenticeHall, 1987.
46. L. Guo, L. Lung, G.-J. Wang, Necessary and sufcient conditions for stability of LMS,
IEEE Trans. on Automatic Control, Vol. 42, pp. 761 770, 1997.
REFERENCES
173
47. D. C. Farden, Stochastic approximation with correlated data, IEEE Trans. Information
Theory, Vol. 27, pp. 105 113, 1981.
48. S. K. Jones, R. K. Cavin, W. M. Reed, Analysis of error gradient adaptive linear
estimators for a class of stationary dependent processors, IEEE Trans. on Information
Theory, Vol. 28, pp. 318 329, 1982.
49. M. R. Leadbetter, G. Lindgren, H. Rootzen, Extremes and Related Properties of Random
Sequences and Processes, New York, Springer-Verlag, 1982.
CONTROL OF LMS-TYPE
ADAPTIVE FILTERS
NSLER
EBERHARD HA
Signal Theory Group, Darmstadt University of Technology, Darmstadt, Germany
GERHARD UWE SCHMIDT

Temic Speech Processing, Ulm, Germany
A teacher can lead you to the door; but learning is up to you.
Even if one studies to an old age, one will never nish learning.
Scholars are a nations treasure; learning is like a delicious feast.
Chinese proverbs
6.1
INTRODUCTION
Adaptive ltering is a very powerful tool in modern signal processing, and its
importance is still increasing. In the past few decades, several algorithms like fast
recursive least squares, fast Newton, and afne projection algorithms have been
developed in order to achieve fast convergence at low or moderate computational
complexity. Nevertheless, due to its simplicity and its numerical robustness, the
least-mean-square (LMS) algorithmespecially its normalized version, the NLMS
algorithmis still one of the most important adaptive algorithms.
In this chapter we will focus on control aspects of LMS-type adaptive lters. In
most real implementations the desired signal is distorted by measurement noise.
Depending on the application, the signal-to-noise ratio can even exceed the 0 dB
threshold. In order to achieve a large speed of convergence and a small steady-state
error in the presence of measurement noise, control is absolutely necessary. The
chapter is organized as follows:
In Section 1 we will mention briey the relation of system design and its
impacts on control. In particular, the choice of the processing and control
structure enables or disables several degrees of freedom, which can be
exploited for control purposes.

175
176
CONTROL OF LMS-TYPE ADAPTIVE FILTERS
In Section 2, convergence and stability issues of the NLMS algorithm are

investigated. In contrast to most other publications on this topic, the inuence
of measurement noise is considered here in more detail. At the end of the
section, it is shown that fast convergence as well as a small nal misadjustment of the lter cannot be achieved at the same time with xed (timeinvariant) control parameters.
For this reason, time-variant, pseudo-optimal control parameters are derived in
Section 3. The term pseudo is used here because several assumptions and
approximations were necessary in order to derive simple formulas, which can
easily be coded in real systems. Unfortunately, for computing optimal control
parameters, knowledge about nonaccessible signal and system parameters is
required.
Methods for estimating these nonaccessible signal and system parameters are
presented in Sections 4 to 6 on a real application: acoustic echo control. The
application was chosen to demonstrate how knowledge about the statistical
properties of the involved signals and systems can be exploited to enhance the
estimation and detection schemes.
At the end of this chapter, a short summary and an outlook are presented.
6.1.1
Notation
Among the enormous number of applications where the LMS or NLMS algorithm
can be utilized, we will address only system identication problems according to
[12] in this chapter. In Figure 6.1 the general setup as well as some notation issues
are depicted.
Several assumptions will be made in this chapter. Firstly, we will assume that the
system, which should be identied, can be modeled with sufcient accuracy as a
Figure 6.1 General setup.
6.1 INTRODUCTION
177
linear nite impulse response (FIR) lter. Its impulse response will be denoted by
hi n. The subscript i addresses the ith coefcient of the impulse response at time
index n. We will not assume that we have time-invariant systems; therefore, we need
a time index as well as a coefcient index. The impulse response of the FIR lter can
be written as a vector:
hn h0 n; h1 n; . . . ; hN1 nT :
6:1
The output of the unknown system yn consists of the desired signal dn and
additional measurement noise nn. We will distinguish here between stationary
measurement noise ns n and nonstationary noise nn n:
yn dn nn
dn ns n nn n:
6:2
The desired signal dn can be written as a convolution of the (unknown) impulse

response hi n and the excitation signal un:
dn
N
1
X
hi nun i
i0
6:3
h nun u nhn:
T
In the last line of Eq. 6.3, vector notation was also used for the excitation signal:
un un; un 1; . . . ; un N 1T :
In Table 6.1 the most important symbols as well as their meanings are listed.
TABLE 6.1 Notation
Symbol
Meaning
dn
d^ n
en
hn
nn
nn n
ns n
un
wn
wo
yn
D; Dn
e n
m ; m n
Desired signal
Output signal of the adaptive lter
Error signal
Impulse response vector of the unknown system
Measurement noise
Nonstationary part of the measurement noise
Stationary part of the measurement noise
Excitation signal
Impulse response vector of the adaptive lter
Wiener solution
Distorted output signal of the unknown system
Regularization parameter
System mismatch vector
Step-size
6:4
178
6.1.2
Control Structures
In this chapter, we will mention two possibilities for control purposes: weighting the
lter update by multiplication with a step-size and increasing the denominator of the
update term by regularization. Due to the scalar normalization within the NLMS
update, both forms of control can easily be exchanged (see Section 6.3.3).
Nevertheless, their practical implementations often differ. For this reason, we will
deal with both possibilities here. Furthermore, we will distinguish between a scalar
step-size and a step-size matrix.
6.1.2.1 Scalar Control Parameters For computing the lter update according to
the NLMS update, the error signal en is required:
en yn d^ n
hT nun nn wT nun:
6:5
For the impulse response of the adaptive lter wi n the same (vector) notation as for
the impulse response of the unknown system was used:
wn w0 n; w1 n; . . . ; wN1 nT :
6:6
The update of the adaptive lter vector wn is performed according to

wn 1 wn m
enun
:
kunk2 D
6:7
In the following two sections, a time-invariant step-size m and a time-invariant

regularization parameter D are assumed.
Regularization is usually applied in algorithms with matrix normalization [e.g.,
recursive least squares (RLS), afne projection] for numerically stabilizing the
solution. We will derive pseudo-optimal control parameters. Even if the derivation is
based on the NLMS algorithm, the results can also be used for other algorithms.
6.1.2.2 Vector Control Parameters Besides the usage of only scalar control
parameters, a step-size vector, respectively a diagonal step-size matrix, according to
2
m0
6 0
6
wn 1 wn 6
6 ..
4 .
0
0
m1
wn diagfm g
0
0
..
.
.
m N1
..
enun
kunk2 D
3
7
7 enun
7
7 kunk2 D
5
6:8
6.1 INTRODUCTION
179
can also be implemented. We will not consider nondiagonal step-size matrices. In

the case of a diagonal matrix the computational complexity of the NLMS algorithm
increases from 2N to 3N. The impacts of this additional degree of freedom depends
on the used processing structure (see Sec. 6.1.3). For the application of network echo
cancellation, for example, the impulse response concentrates its energy in certain
areas. If these areas can be detected, the adaptation can be enabled only in these
regions. This is the rst example of how prior knowledge of the unknown system
can be exploited. In Figure 6.2 an example of such an impulse response and the
associated step-size vector are depicted.
6.1.3
Control in Different Processing Structures
Besides selecting different control structures, the system designer also has the
possibility of choosing between different processing structures: fullband processing,
block processing, and subband processing.
Figure 6.2 Example of a vector step-size control.
180
TABLE 6.2 Control in Different Processing Structures

Processing Structure
Delay resolution
Time resolution
Fre. resolution
Delay resolution
Time resolution
Fre. resolution
Delay resolution
Subband
Fre. resolution
Block
Time resolution
Fullband
Scalar step-size
Vector step-size
Explanation
Sec. 6.1.3.1
Sec. 6.1.3.2
Sec. 6.1.3.3
Before these three processing structures and the related control possibilities are
described in more detail in the next three subsections, Table 6.2 gives an overview of
the advantages and disadvantages of the different structures. The possibility of using
different control parameters for each coefcient wi n of the lter vector wn is
mentioned with the term delay resolution.
6.1.3.1 Fullband Processing Fullband processing structures, according to
Figure 6.1, offer the possibility of adjusting the control parameters m n and Dn
differently in each iteration. For this reason, fullband processing has the best time
resolution of all processing structures. If a matrix step-size is utilized (Eq. 6.8), the
additional degree of freedom can be exploited to adapt each coefcient wi n of the
lter vector wn individually. Especially for impulse responses which concentrate
their energy on only a few coefcients (see Fig. 6.2), this is an important advantage
for control purposes.
In the left part of Figure 6.3 an impulse response of a loudspeaker-enclosuremicrophone system is depicted. Details of this kind of system are explained in
Section 6.4. We will use the impulse response in this and the next two subsections to
demonstrate the advantages and disadvantages of the different processing structures.
The two diagrams in the right part of Figure 6.3 show the control freedoms in the
delay-frequency domain. The term delay in this context represents the coefcient
index i of the impulse response hi n. If only scalar control parameters are used,
neither frequency-selective nor delay-selective control is possible. For this reason,
the delay-frequency domain is not segmented in the upper left diagram. If a matrix
step-size is applied, selectivity in delay direction is possible. The lower left diagram
is therefore segmented vertically. Even if fullband processing structures do not have
the possibility of frequency-selective control, they have the very basic advantage of
not introducing any articial delay into the signal paths. For some applications this is
a necessary feature.
6.1 INTRODUCTION
Figure 6.3
181
Control in fullband processing structures.
6.1.3.2 Block Processing Long time domain based adaptive lters require huge
processing power due to their large number of coefcients. For many applications,
such as acoustic echo or noise control, algorithms with low numerical complexity
are necessary. To solve the complexity problem, adaptive lters based on block
processing [34, 35] can be used.
In general, most block processing algorithms collect B input signal samples
before they calculate a block of B output signal samples. Consequently, the lter is
adapted only once every B sampling instants. To reduce the computational complexity, the convolution and adaptation are performed in the frequency domain (see
Fig. 6.4).
Besides the advantage of reduced computational complexity, block processing
also has disadvantages. Due to the computation of only one adaptation every B
samples, the time resolution for control purposes is reduced. if the signal-to-noise
ratio changes in the middle of the signal block, for example, the control parameters
can only be adjusted to the mean signal-to-noise ratio (averaged over the block
length). Especially for large block length B and therefore for a large reduction of
computational complexity, the impacts of the reduced time resolution clearly turn
out.
If a vector step-size is chosen and the lter update is performed in the frequency
domain, a new degree of freedom arises. Each frequency bin of the update of the
transformed lter vector Wb e j2p =Bm ; n can be weighted individually. Especially if
the system has low-pass, bandpass, or high-pass character and the involved signals
are stationary, the convergence speed can be increased. In the left part of Figure 6.5,
the magnitude of the Fourier transform of the impulse response of Figure 6.3 is
depicted. The dark area represents the basic control area if a matrix step-size is used.
182
Figure 6.4 Block processing structure. To reduce computational complexity, the convolution and the adaptation are performed in the frequency domain.
Figure 6.5
Control in block processing structures.
6.1 INTRODUCTION
183
As well as in Figure 6.3, two delay-frequency areas are depicted in the right part
of Figure 6.5. If the matrix step-size is applied in the frequency domain, the delayfrequency area is split horizontally, showing the control freedom for individual
control possibilities over the frequency.
Besides all advantages of block processing, another inherent disadvantage of this
processing structure should also be mentioned. Due to the collection of B samples, a
signicant delay is introduced in the signal paths.
6.1.3.3 Subband Processing Another possible way to reduce the computational

complexity is to apply subband processing. By using lter banks [7, 44, 45], the
excitation signal un and the (distorted) desired signal yn are split up into several
subbands (see Fig. 6.6). Depending on the properties of the low-pass and bandpass
lters, the sampling rate in the subbands can be reduced. According to this
reduction, the adaptive lters can be shortened. Instead of ltering the fullband
signal and adapting one lter at the high sampling rate, M (number of subbands)
convolutions and adaptations with subsampled signals are performed in parallel at
reduced sampling rates.
The nal output signal en is obtained by recombining and upsampling the
subband signals:
em n;
with m [ f0 . . . M 1g:
Figure 6.6
Subband structure.
184
In Figure 6.6 the subband signals are grouped in vectors. For example, the vector
usb n u0 n; u1 n; . . . ; uM1 nT
belongs to all subband excitation signals (channel 0 to M 1) at the subsampled
time index n.
In contrast to block (frequency domain) processing, the subband structure offers
the system designer an additional degree of freedom. Detectors and control
mechanisms can be implemented separately for each channel. If matrix step-sizes
are applied in each channel, delay-selective control is also possible. In Figure 6.7 a
delay-frequency analysis of the impulse response of Figure 6.3 is depicted.
Even without using matrix step-sizes, it is possible to control each subband
individually. For this reason, the delay-frequency area is even for the case of scalar
control parameters segmented horizontally. Also, the orders of the adaptive lters
can be adjusted individually in each channel according to the statistical properties of
the excitation signal, the measurement noise, and the impulse response of the system
to be identied. If matrix step-size control is applied, the delay-frequency area can
also be segmented vertically.
Using subsampled signals leads to a reduction of computational complexity. All
necessary forms of control and detection can operate independently in each channel.
The price to be paid for these advantages is a signicant delay introduced into the
signal path by the analysis and synthesis lter banks.
Figure 6.7
Control in subband processing structures.
6.1 INTRODUCTION
6.1.4
185
Control Principles
Besides the possibilities of different processing and control structures, the system
designer can choose between different control principles. In contrast to the processing and the control structure (which should be matched to the application), the
authors strongly recommend the usage of a state-dependent control strategy (as
described in Subsection 6.1.4.2 as well as in the rest of this chapter). Nevertheless, in
a very few applications, a binary (on off) control can be applied.
6.1.4.1 Binary Control Strategy One basic control principle that is often used in
real implementations is simply to switch the step-size or the regularization
parameter between two values: 0 and m fix , respectively 1 and Dfix . Even though this
method does not require explicit calculation of optimal control parameters, one
should have a reliable indication for the choice of values. An adaptation with the
nonzero step-size m fix or the noninnity regularization parameter Dfix should be
performed in those iterations where the distance between the unknown and the
adaptive system can be decreased on average; in all other cases, the lter should not
be adapted (see Sec. 6.2). We will see in Section 6.3 that an adaptation step is
successful according to the criterion mentioned above if the xed control parameters
are smaller than twice the optimal values for state-dependent control:
0 , m fix , 2m opt n;
0 , Dfix , 2Dopt n:
6:9
6.1.4.2 State-Dependent Strategy A more sophisticated strategy is to adjust the

control parameters which depend on the states of the system. Examples of states of a
system are the degree of convergence and the signal-to-noise ratio.
In the main part of this chapter, we will derive approximations for the optimal
values of the control parameters which depend on the states of the system.
6.1.5
Concluding Remarks
The aim of this section was to provide a basic introduction to control and processing
structures for the NLMS algorithm. The system designer should be aware of the
alternatives that he or she can choose from, even in an early state of the design
process. The optimal choice depends crucially on the application. Therefore, we will
give an application example and discuss the impacts of the choice of the processing
and control structure in Section 6.4. In the next sections, we will provide a general
understanding of the adaptation process in dependence on the control parameters
and the measurement noise.
186
6.2
STABILITY ANALYSIS OF LMS-TYPE ADAPTIVE FILTERS
In this section, convergence and stability issues for LMS-type adaptive lters are
discussed. In the literature, a variety of excellent convergence analyses and LMS
derivations (e.g. [9, 13, 23, 43]) can be found. Most of them are based on eigenvalue
analyses of the autocorrelation matrix of the excitation signal and provide insight in
the LMS (and NLMS) algorithm.
Our aim is to provide a basic understanding and a general overview of the
convergence properties in the presence of measurement noise. In particular, the
dependence of the convergence speed and the nal misadjustment on the control
parameters m and D are investigated. For a better understanding, we will mention only scalar, time-invariant control parameters. Furthermore, only system
identication problems according to [12] will be addressed here.
6.2.1
Choosing a Cost Function
In the left part of Figure 6.8 the general structure of a classical system identication
is depicted. Minimizing the expectation of the squared error signal
Efe2 ng ! min
6:10
leads to the well-known Wiener solution (here in its spectral form):

We jV jEfe2 ng!min Wo e jV
Suy V
:
Suu V
6:11
Figure 6.8 General structure of a classical system identication (left) and the denition of
the system mismatch vector (right).
6.2 STABILITY ANALYSIS OF LMS-TYPE ADAPTIVE FILTERS
187
The quantities Suu V and Suy V denote the auto power spectral density and the
cross power spectral density, respectively. If the measurement noise nn and the
excitation signal un are orthogonal, the frequency response of the optimal solution
will be
Wo e jV
Sud V
He jV :
Suu V
6:12
In this case, the optimal solution for the adaptive lter will be an ideal copy of the
unknown system. Furthermore it should be mentioned that stationarity of the signals
and time-invariant systems were assumed in the Wiener approach.
Even if the average power of the error signal en is a valuable criterion for
minimization purposes, it is not very useful if information about the convergence
statewhich is very important for control purposesis wanted. A large power of
the error signal may be due to poor system identication or may stem from a large
measurement noise power. Therefore, a better procedure would be to estimate the
power of the undistorted error. This signal is dened as (see Fig. 6.8)
eu n en nn dn d^ n:
6:13
If the power of the undistorted error is zero, the output of the adaptive lter d^ n will
be equal to the desired signal dn. Besides the fact that the undistorted error power
cannot be measured directly, a zero undistorted error does not necessarily mean that
both lters are identical. They may be different from each other at frequencies that
are not excited by un.
Another possible way of judging the convergence state is to monitor the system
mismatch vector, which is dened as the difference vector of the impulse responses
of the unknown and the adaptive lter:
e n hn wn:
6:14
In order to derive a scalar cost function, only the squared norm of this vectorcalled
the system distanceis utilized. This quantity is independent of the properties of the
excitation signal.
Therefore, in the following derivations, we will try to minimize the expected
system distance:
Efke nk2 g ! min:
6.2.2
6:15
Convergence in the Presence of Measurement Noise
Using the denition of the system mismatch vector and assuming that the system is
time-invariant hn 1 hn, the equation for the NLMS lter update can be
used to derive an iteration of the system mismatch vector:
e n 1 e n m
enun
:
kunk2 D
6:16
188
The aim of an adaptive algorithmn should be to minimize the expected squared norm
of the system mismatch vector. Using Eq. 6.16, the squared norm can be computed
recursively:
ke n 1k2 e T n 1e n 1
e T ne n 2m
ke nk2 2m
ene T nun
e2 nuT nun
m2
2
kunk D
kunk2 D2
6:17
eneu n
e2 nkunk2
m2
:
2
kunk D
kunk2 D2
The error signal en consists of its undistorted part eu n and the measurement noise
nn:
en eu n nn e T nun nn:
6:18
Using this denition and assuming further that un and nn are statistically
independent and zero-mean (mu 0, mn 0), the expected squared norm of the
system mismatch vector can be written as

e T nunuT ne n
Efke n 1k g Efke nk g 2m E
kunk2 D
T

e nunuT ne n n2 nkunk2
m 2E
:
kunk2 D2
2
6:19
For large lter orders N 1 @ 1 the squared norm of the excitation vector can be
approximated by a constant which is equal to N times the variance of the signal:
kunk2 N s 2u :
6:20
Inserting this approximation into Eq. 6.19 leads to
m 2 N s 2u
s 2n
N s 2u D2
!
m 2 N s 2u
2m

N s 2u D2 N s 2u D
Efke n 1k2 g Efke nk2 g
6:21
Efe T nunuT ne ng:

Next, we assume statistical independence of the excitation vector un and the
system error vector e n as well as white excitation un. Then the last expectation in
189
approximation (6.21) can be simplied:

Efe T nunuT ne ng s 2u Efke nk2 g:
6:22
If the excitation signal is not white, it is possible to prewhiten or decorrelate the

incoming excitation signal before passing it to the adaptive algorithm [23].
Especially in the eld of acoustic, hybrid, and network echo cancellation,
decorrelation lters are widely applied [14, 46, 48].
If the results of Eq. 6.22 are inserted into approximation 6.21, we get
!
2
4
2
m
N
s
2
m
s
u
u
Efke n 1k2 g 1

Efke nk2 g
N s 2u D2 N s 2u D
6:23
m 2 N s 2u
s 2n :
N s 2u D2
The rst row in Eq. 6.23 shows the contraction due to the undistorted adaptation
process. The factor
Am ; D; s 2u ; N 1
m 2 N s 4u
2m s 2u

N s 2u D2 N s 2u D
6:24
will be called contraction parameter and should be always smaller than 1. The
second row in Eq. 6.23 describes the inuence of the measurement noise. This signal
disturbs the adaptation process. After introducing the abbreviation
Bm ; D; s 2u ; N
m 2 N s 4u
;
N s 2u D2
6:25
which is called the expansion parameter, Eq. 6.23 can be written in a shorter form:
s2
Efke n 1k2 g Am ; D; s 2m ; N Efke nk2 g Bm ; D; s 2u ; N n2 :
|{z} s u
|{z}
Contraction parameter
6:26
Expansion parameter
The contraction and expansion parameters are quantities with dimension 1, and both
are dependent on the control variables m and D as well as on the lter order N 1
and the excitation power s 2u . If the inuence of the measurement noise should be
eliminated completely, the expansion parameter Bm ; D; s 2u ; N should be zero. This
can be achieved by setting the step-size to zero or the regularization parameter to
innity. Unfortunately, in the case of these choices, the lter will no longer be
adapted. In Figure 6.9 the value of the contraction and expansion parameters for a
lter length of N 100 and an input power of s 2u 1 are depicted.
190
Figure 6.9 Contraction and expansion parameters. The upper two diagrams show surface
plots of both parameter functions for xed excitation power s 2u 1 and a xed lter length of
N 100. In the lower two diagrams appropriate surface plots are depicted. For a fast
convergence, a compromise between fastest contraction (Am ; D; s 2u ; N ! min) and no
inuence of the measurement noise (Bm ; D; s 2u ; N ! 0) has to be found. The optimal
compromise will depend on the convergence state of the lter (see Sect. 6.3).
For a fast convergence, a compromise between the fastest contraction

Am ; D; s 2u ; N ! min
6:27
and no inuence of the measurement noise

Bm ; D; s 2u ; N 0
6:28
has to be found. The optimal compromise will depend on the convergence state of
the lter. In Section 6.3 this question will be answered. Here we will rst investigate
the convergence of the lter for xed control parameters. We can therefore solve the
191
difference equation, 6.26 (to be precise, the difference approximation):

Efke n 1k2 g Am ; D; s 2u ; Nn Efke 0k2 g
Bm ; D; s 2u ; N
n1
s 2n X
Am ; D; s 2u ; Nm :
s 2u m0
6:29
For jAm ; D; s 2u ; Nj , 1 the system distance will converge toward

lim Efke nk2 g Bm ; D; s 2u ; N
n!1
s 2n
1
s 2u 1 Am ; D; s 2u ; N
m N s 2n
:
2 m N s 2u 2D
6:30
For the uncontrolled adaptation (m 1, D 0), the nal system distance will be
equal to the inverse of the signal-to-noise ratio:
lim Efke nk2 gjm 1;D0
n!1
s 2n
:
s 2u
6:31
Finally, two simulation examples are presented in Figure 6.10 to indicate the validity
of approximation 6.29. In both cases, white noise was chosen as excitation and
measurement noise with a signal-to-noise ratio of 30 dB. In the rst simulation, the
parameter set m 1 0:7 and D1 400 has been used. According to approximation
6.30, a nal misadjustment of about 35 dB should be achieved. For the second
parameter set the values are m 2 0:4 and D2 800. With this set a nal
misadjustment of about 39dB is achievable.
In the lowest diagram of Figure 6.10, the measured system distance as well as its
theoretic progression are depicted. The theoretic and measured curves are mostly
overlaying; the maximal (logarithmic) difference is about 3dB.
6.2.3
Convergence in the Absence of Measurement Noise
The convergence speed is of special importance at the start of an adaptation and after
changes of the system hn. In these situations, we will assume that inuence of the
measurement noise can be neglected. Therefore, the recursive computation of the
expected system distance can be simplied to
Efke n 1k2 gjs 2n 0 Am ; D; s 2u ; NEfke nk2 g:
6:32
The average system distance is a contracting series as long as

jAm ; D; s 2u ; Nj , 1:
6:33
192
Figure 6.10 Simulation examples and theoretic convergence. In order to validate approximation 6.29, two simulations are presented. White noise was used as excitation (depicted
in the upper diagram) and measurement noise (presented in the middle diagram) with a signalto-noise ratio of 30 dB. In the lower diagram the measured system distance as well as its
theoretic progression are depicted. The theoretic and measured curves are mostly overlaying.
Also, the predicted nal misadjustments according to approximation (6.30) coincide with the
measured ones very well.
193
We will assume a nonnegative step-size (m 0) as well as a nonnegative

regularization (D 0). In this case, we can transform condition 6.33 into
m ,2
2D
:
N s 2u
6:34
Fastest convergence will be achieved if the contraction parameter is minimal:

@Am ; D; s 2u ; N
@Am ; D; s 2u ; N
6:35
0:

@m
@D
m opt
Dopt
After inserting the denition of contraction parameter (Eq. 6.24) and equating both
differentiations, we nally get maximal convergence speed if the step-size and the
regularization parameter are related as
m 1
D
:
N s 2u
6:36
In Figure 6.11 the convergence plane for the undistorted adaptation (s 2n 0) is

depicted in the range 0 m 4 and 0 D 100. The depicted contraction area is
based on a lter length of N 100 and an excitation power of s 2u 1. Optimal
contraction according to Eq. 6.36 can be achieved on the line which cuts the
convergence plane horizontally in half.
For the special case m 1 and D 0, the recursive computation of the expected
system distance in the absence of measurement noise (Eq. 6.32) can be simplied to

1
:
6:37
Efke nk2 g
Efke n 1k2 gjs 2n 0 1
N
s 2n 0
In this case, maximal convergence speed is achieved. At the beginning of a system
identication, the assumption that the measurement noise can be neglected is valid.
For m and D as given above, the number of iterations necessary to reduce the system
distance by 10 dB can be calculated approximately as follows:

1
1 n10dB
Efke nk2 g:
6:38
Efke n n10dB k2 g Efke nk2 g 1
10
N
Comparing the factors in front of the expected system distances and taking the
natural logarithm leads to

1
ln10 n10dB ln 1
:
6:39
N
For large lter orders N the second logarithm in Eq. 6.39 can be approximated
according to

1
1
6:40
:
ln 1
N
N
194
Figure 6.11 Convergence plane with the boundary conditions s 2n 0, s 2u 1 and

N 100. The gray plane depicts the control parameter area, where the system distance is
contracting mapping. Optimal contraction according to Eq. 6.36 is achieved on the line which
cuts the convergence plane horizontally in half.
Using this approximation as well as ln10 2, we nally get

n10dB 2N:
6:41
This means that after a number of iterations which are equal to two times the lter
order, a decrease of the system distance of about 10dB can be achieved. We can also
learn from approximation 6.39 that short lters converge much faster than long ones.
In order to elucidate this relationship, two convergences with different lter orders
(N 500 and N 1000) are presented in Figure 6.12.
To show the validity of approximation 6.41, triangles with edge length 10 dB and
2N are added to the convergence plots. Especially at the start of the convergence, the
theoretical decreases of the system distance t the measured ones very well.
6.2.4
Convergence Without Regularization
In some applications only step-size control is implemented. In these cases, the

approximated recursive computation of the expected system distance according to
195
Figure 6.12 Two convergences with different lter orders are depicted (upper diagram:
N 500; lower diagram: N 1000). White noise was used for the excitation signal as well as
for the measurement noise, with a signal-to-noise ratio of about 30 dB. To show the validity of
approximation 6.41, triangles with edge length 10 dB and 2N are added to the convergence
plots. Especially at the start of the convergence, the theoretical decreases of the system
distance t the measured ones very well.
approximation 6.23 can be simplied to

m 2 m
Efke nk2 gjD0
Efke n 1k2 gjD0 1
N
m 2 s 2n
:
N s 2u
6:42
For 0 m 2 the system distance converges for n ! 1 toward

lim Efke nk2 gjD0
n!1
m s 2n
:
2 m s 2u
6:43
196
In Figure 6.13 three convergences with different choices for the step-size (m 1,
m 0:5, and m 0:25) are presented. The same boundary conditions (white noise
for the excitation and the measurement noise, 30 dB signal-to-noise ratio, N 1000)
were used as the rst simulation series of subsection 6.2.2.
According to approximation 6.43, a nal misadjustment of the lter of
30dB, for the choice m 1,
34.8dB, for the choice m 12, and
38.5dB, for the choice m 14
should be achieved. The speed of the initial convergence can be computed as

m 2 m
dB
:
10 log10 1
N
Iteration
6:44
Figure 6.13 Three convergences without regularization but with different choices for the
step-size (m 1, m 0:5, and m 0:25) are depicted. The same boundary conditions (white
noise for the excitation and the measurement noise, 30 dB signal-to-noise ratio, N 1000)
were used as in the rst simulation series of Subsection 6.2.2. A nal misadjustment of the
adaptive lter smaller than the signal-to-noise ratio can be achieved only if the step-size is
smaller than 1. This leads to a decreased speed of convergence.
197
In Figure 6.13 the theoretical nal misadjustments as well as the approximated

convergence speeds are also depicted. A nal misadjustment of the adaptive lter
smaller than the signal-to-noise ratio (here 30 dB) can be achieved only if the stepsize is smaller than 1. This leads, however, to a decreased speed of convergence.
6.2.5
Convergence Without Step-Size Control (Only with Regularization)
By analogy with the previous subsection, the recursive approximation of the mean
squared norm of the system mismatch vector can be simplied if only regularization
control (m 1) is applied:
Efke n 1k gjm 1
2
!
N s 4u 2s 2u D
1
Efke nk2 gjm 1
N s 2u D2
6:45
N s 4u
s 2n
:
2 s2
2
N s u D
u
For D 0 the system distance converges for n ! 1 toward
lim Efke nk2 gjm 1
n!1
N s 2u
s 2n
:
N s 2u 2D s 2u
6:46
As in the previous section, three simulations with only regularization control are
presented. In Figure 6.14 the convergences with the same boundary conditions
except for the choice of the control parametersas in the previous example are
depicted.
The simulation runs were performed with the choices D 0, D 1000, and
D 4000. According to these values and approximation 6.46, nal misadjustments
of the adaptive lter of 30dB, 34.8 dB, and 39.5dB should be achievable. The
initial convergence speed (see recursion 6.45) can be stated as
10 log10
!
N s 4u 2s 2u D
dB
:
1
2
2
Iteration
N s u D
6:47
As well as being shown in Figure 6.13, the expected nal misadjustments and the
convergence speeds are depicted in Figure 6.14.
6.2.6
Concluding Remarks
In this section the convergence properties of the NLMS algorithm in the presence of
measurement noise were investigated. Approximations for the speed of convergence, as well as the nal misadjustment of the adaptive lter in dependence of
198
Figure 6.14 Three convergences without step-size control but with different choices for the
regularization parameter (D 0, D 1000, and D 4000) are depicted. The same boundary
conditions (white noise for the excitation and the measurement noise, 30dB signal-to-noise
ratio, N 1000) as in Figure 6.13 were used. A nal misadjustment of the adaptive lter
smaller than the signal-to-noise ratio can be achieved only if the regularization parameter is
chosen to be larger than 0. As in the case of step-size control, this leads to a decreased speed of
convergence.
the control parameters, were derived. It was shown that with parameter sets, which
lead to an optimal initial convergence speed (e.g., m 1, D 0), the nal
misadjustment cannot be smaller than the signal-to-noise ratio.
Especially for small signal-to-noise ratio, this might not be sufcient for adequate
system identication. Reducing the step-size or increasing the regularization
parameter leads to smaller nal misadjustments, but at the same time to a reduced
convergence speed.
A fast speed of convergence as well as good steady-state behavior can be
achieved only if time-variant control parameters are used. If only step-size control
(D 0) is implemented, one should start with a large step-size (m 1). The smaller
the system distance will become, the more the step-size should be reduced.
Similar strategies should be applied if only regularization control or hybrid
control is implemented. In the next section, optimal choices for the step-size
and the regularization parameter in dependence of the convergence state and the
signal-to-noise ratio will be derived.
199
6.3 DERIVATION OF PSEUDO-OPTIMAL CONTROL PARAMETERS
6.3
DERIVATION OF PSEUDO-OPTIMAL CONTROL PARAMETERS
In the previous section, two competitive demands were made on the step-size and
the regularization parameter:
In order to achieve a fast initial convergence or a fast readaptation after system
changes, a large step-size (m 1) and a small regularization parameter
(D 0) should be used.
To achieve a small nal misadjustment ke nk2 ! 0 a small step-size
(m ! 0) and/or a large regularization parameter (D ! 1) is necessary.
Both requirements cannot be fullled with xed (time-invariant) control
parameters. Therefore, a time-variant step-size and a time-variant regularization
parameter
m ! m n
D ! Dn
as well as optimal choices for them are introduced in this section. In most system
designs, only step-size or only regularization control is used. For this reason, in the
rst two subsections optimal choices for both control parameters are derived. As
suggested in the last section, the optimization criterion will be the minimization of
the expected system distance.
Both control strategies can easily be exchanged, which is shown in Subsection
6.3.3. Nevertheless, their practical implementations often differ. Even if in most
cases only step-size control is implemented, in some situations control by
regularization or a mixture of both is a good choice as well. Therefore, in the last part
of this subsection, some hints concerning the choice of the control structure are
given.
6.3.1
Derivation of a Pseudo-Optimal Step-Size
In the following, a pseudo-optimal step-size for the nonregularized NLMS algorithm

is derived. The adaptation rule of this form of the NLMS algorithm can be expressed
as follows:
wn 1 wn m n
enun
:
kunk2
6:48
Supposing that the system is time invariant (hn 1 hn), the recursion of
the expected system distance can be denoted as follows (see Eq. 6.17, Sec. 6.2.2,
200
for D 0):

eneu n
Efke n 1k g Efke nk g 2m nE
junk2
2

e n
:
m 2 nE
kunk2
2
6:49
For determining an optimal step-size [47], the cost function, that is, the squared
norm of the system mismatch vector, should decrease (to be precise, should not
increase) on the average for every iteration step:
Efke n 1k2 g Efke nk2 g 0:
6:50
Inserting the recursive expression for the expected system distance (Eq. 6.49) in
relation 6.50 leads to

e2 n
eneu n
m nE
2m nE
0:
kunk2
kunk2
2
6:51
Thus, the step-size m n has to fulll the condition

eneu n
E
kunk2
:
0 m n 2 2
e n
E
kunk2
6:52
The largest decrease of the system distance is achieved in the middle of the dened
interval. To prove this statement, we differentiate the expected system distance at
time index n 1. Setting this derivation to zero leadsdue to the quadratic surface
of the cost functionto the optimal step-size:

@Efke n 1k2 g
0:

@m n
m nm opt n
6:53
We assume that the step-sizes at different time instants are uncorrelated. Therefore
the expected system distance at time index n is not dependent on the step-size m n
(only on m n 1; m n 2; . . .). Using this assumption and the recursive
201
computation of Eq. 6.49, we can solve Eq. 6.53:

2

eneu n
e n
2
0;
m
nE
opt
kunk2
kunk2
2

e n
eneu n
m opt nE
E
;
kunk2
kunk2

eneu n
E
kunk2
m opt n 2
:
e n
E
kunk2
2E
6:54
If we assume that the squared norm of the excitation vector kunk2 can be
approximated by a constant, as well as that the signals nn and un (and therefore
also nn and eu n) are uncorrelated, the optimal step-size can be simplied as
follows:
m opt n
Efe2u ng
:
Efe2 ng
6:55
In the absence of measurement noise (nn 0) the distorted error signal en equals
the undistorted error signal eu n and the optimal step-size is 1. If the adaptive lter
is well adjusted to the impulse response of the unknown system, the power of the
residual error signal eu n is very small. In the presence of measurement noise, the
power of the distortion nn and therefore also the power of the distorted error signal
en increases. In this case, the numerator of Eq. 6.54 is visibly smaller than the
denominator, resulting in a step-size close to 0 so the lter is not or only
marginally changed in such situations. Both examples show that Eq. 6.55 is (at least
for these two boundary cases) a useful approximation.
To show the advantages of a time-variant step-size control, the simulation
example of Section 6.2.4 (see Fig. 6.13) is repeated. This time a fourth convergence
curve, where the step-size is estimated by
m^ opt n
e2u n
e2 n
6:56
is added. The terms e2u n and e2 n denote short-term smoothing (rst-order IIR
lters) of the squared input signals:
e2 n g e2 n 1 1 g e2 n;
6:57
e2u n g e2u n 1 1 g en nn2 :
6:58
202
The time constant was set to g 0:995. The resulting step-size m^ opt n is depicted
in the lower diagram of Figure 6.15. At the beginning of the simulation the shortterm power of the undistorted error is much larger than that of the measurement
noise. Therefore the step-size m^ opt n is close to 1 and a very fast initial convergence
(comparable to the case D 0 and m 1) can be achieved (see the upper diagram
of Fig. 6.15). The better the lter converges, the smaller is the undistorted error.
With decreasing error power the inuence of the measurement noise in Eq. 6.58
increases. This leads to a decrease of the step-size parameter m n.
Figure 6.15 Convergence using a time-variant step-size. To show the advantages of timevariant step-size control, the simulation example of Section 6.2.4 is repeated. This time a
fourth convergence curve, where the step-size is estimated as proposed in Eq. 6.56, is added.
The resulting step-size is depicted in the lower diagram. If we compare the three curves with
xed step-sizes (dotted lines) and the convergence curve with the time-variant step-size (solid
line), the advantage of a time-variant step-size control is clearly visible.
203
Due to the reduction of the step-size, the system distance can be further reduced.
If we compare the convergence curves with xed step-sizes and the curve with a
time-variant step-size (see the upper diagram of Fig. 6.15), the advantage of a timevariant control is clearly visible. A step-size control based on Eq. 6.55 is able to
achieve a fast initial convergence (and also a fast readaptation after system changes)
as well as a good steady-state performance.
Unfortunately, neither the undistorted error signal eu n nor its power is
accessible in most real system identication problems. In Section 6.5, methods for
estimating the power of this signal and therefore also for the optimal step-size are
presented for the application of acoustic echo control.
6.3.2
Derivation of a Pseudo-Optimal Regularization
After deriving an approximation of the optimal step-size in the previous section, we

will do the same in this part for the only regularized adaptation process. The only
regularized NLMS algorithm has the following update equation:
wn 1 wn
enun
:
kunk2 Dn
6:59
The regularization parameter Dn 0 increases the denominator of the update term

and can therefore also be used for control purposes. With this update equation the
system mismatch vector can be computed recursively:
e n 1 e n
enun
:
kunk2 Dn
6:60
It was assumed that the system is time invariant hn 1 hn. In order to

determine an optimal regularization, the same cost function as in the optimal stepsize derivationthe average of the squared norm of the system mismatch vector
(expected system distance)can be used. The expected system distance at time
index n 1 can be recursively expressed as follows:

eneu n
2
2
Efke n 1k g Efke nk g 2E
kunk2 Dn
6:61

e2 nkunk2
:
E
kunk2 Dn2
As well as in the derivation in Section 6.3.1, the denition of the undistorted error
signal eu n e T nun was inserted. Differentiation of the cost function leads to

@Efke n 1k2 g
eu nen
2E
@Dn
kunk2 Dn2
6:62

kunk2 e2 n
2E
:
kunk2 Dn3
204
As it was done in the previous section, we approximate the squared norm of the
excitation vector kunk2 by a constant, and we suppose the signals nn and un
(and therefore also nn and eu n) to be orthogonal. Therefore the derivative
simplies as follows:
@Efke n 1k2 g
Efe2u ng
2
@Dn
kunk2 Dn2
kunk2
2
Efe2 ng:
kunk2 Dn3
6:63
By setting the derivative to zero

@Efke n 1k2 g
0:

@Dn
DnDopt n
6:64
and using approximation 6.63, we can nd an approximation for the optimal

regularization parameter Dopt n:
Dopt n
Efe2 ng Efe2u ngkunk2

:
Efe2u ng
6:65
Due to the orthogonality of the distortion nn and the excitation signal un, the
difference of the two expectation operators can be simplied:
Dopt n
Efn2 ngkunk2
:
Efe2u ng
6:66
If the excitation signal is white noise and the excitation vector and the system
mismatch vector are assumed to be uncorrelated, the power of the undistorted error
signal can be simplied:
Efe2u ng Efe T nunuT ne ng
Efe T nEfu2 nge ng
6:67
Efu2 ngEfke nk2 g:

Inserting this result in approximation 6.66 as well as assuming the approximation (a
large lter length N @ 1 is taken for granted)
NEfu2 ng kunk2
6:68
205
leads to a simplied approximation of the optimal regularization parameter

Dopt n N
Efn2 ng
:
Efke nk2 g
6:69
As well as in the last subsection, the control based on approximation 6.69 will be
compared with xed regularization control approaches via a simulation. The power
of the measurement noise was estimated using a rst-order innite impulse response
(IIR) smoothing lter:
n2 n g n2 n 1 1 g n2 n:
6:70
The time constant g was set to g 0:995. Using this power estimation, an
estimation of the optimal regularization parameter was computed as
n2 n
:
D^ opt n N
ke nk2
6:71
Furthermore, the expected system distance was approximated by its instantaneous

value. In Figure 6.16 the three examples of Figure 6.14 with xed regularization
parameters as well as a convergence curve using Eq. 6.71 (time variant regularization) are presented. All boundary conditions like lter order, input signals, and
initialization values were maintained.
In the lower part of Figure 6.16 the regularization parameter resulting from Eq.
6.71 is depicted. With decreasing system distance, the regularization increases in
order to improve the convergence state even if the power of the undistorted error
signal is smaller than the one of the measurement noise.
6.3.3
Relationship Between Both Control Methods
If only step-size control or only regularization is applied (see Eq. 6.48 and Eq. 6.59),
the control parameters can easily be exchanged. The comparison of both update
terms
m nen
en
kunk2
kunk2 Dn
6:72
can be used to nd the following converting rules:
m n
kunk2
kunk2 Dn
Dn kunk2
1 m n
:
m n
6:73
6:74
206
Figure 6.16 Convergence using a time-invariant regularization parameter. The simulation

example of Section 6.2.5 is repeated. The dotted lines show the progression of the system
distance for different xed regularization parameters. Additionally, a fourth convergence
curve where the regularization parameter is adjusted according to Eq. 6.71 is presented (solid
line). The progression of the control value D^ opt n is depicted in the lower diagram. The
advantage of time-variant adaptation control is clearly visible.
Even if both control approaches are theoretically equivalent, their practical

implementations often differ. In most real systems only step-size control is applied.
The most important reason for that is the limited range of values which is required
for the step-size (m [ 0; 1). In contrast to this normalized range, the regularization
parameter has no upper bound (D [ 0; 1). Especially for implementations on
signal processors with xed-point arithmetic, this is a major obstacle.
Regularization control is often used for algorithms with matrix inversions like
recursive least-squares or afne projections. Originally regularization was applied in
207
these algorithms for numerically stabilizing the solution. As a second step this
stabilization can be utilized for control purposes. But also for the scalar inversion
case, as in NLMS control, regularization might be superior to step-size control in
some situations.
For the pseudo-optimal step-size, the short-term power of two signals has to be
estimated: s 2e n and s 2eu n. If rst-order IIR smoothing of the squared values as
presented in Subsection 6.3.1 or a rectangle window for the squared values is
utilized for estimating the short-term power, the estimation process has its inherent
inertia. After a sudden increase or decrease of the signal power, the estimation
methods follow only with a certain delay.
On the other hand, control by regularization, as proposed in Eq. 6.71, requires
only the power of the measurement noise. In applications with time-invariant
measurement noise but time-variant excitation, power control by regularization
should be preferred. In Figure 6.17 a simulation with stationary measurement noise
is presented. The excitation signal changes its power every 1000 iterations in order
to have signal-to-noise ratios of 20 dB and 20 dB, respectively. The excitation
signal and the measurement noise are depicted in the upper two diagrams of Figure
6.17.
The impulse response of the adaptive lter as well as that of the unknown system
are causal and nite. The order of both was N 1 999. After 1000 iterations the
power of the excitation signal is enlarged by 40 dB. Both power estimations of the
step-size control need to follow this increase. The power of the distorted error starts
from a larger initial value than the power estimation of the undistorted error. At the
rst few iterations the resulting step-size does not reach values close to 1 (optimal
value). As a consequence, the convergence speed is also reduced. Due to the
stationary behavior of the measurement noise, the regularization controlespecially
the power estimation for the measuremnt noisedoes not have inertia problems.
After 2000 iterations the power of the excitation signal is decreased by 40 dB.
Both power estimations utilized in the step-size computation decrease their value at
the rst few following iterations by nearly the same amount. This leads to a step-size
which is a little too large, resulting again in reduced convergence speed.
In the lowest diagram of Figure 6.17 the system distances resulting from step-size
as well as regularization control are depicted. The superior behavior of regularization control in this example is clearly visible. The advantages will disappear if
the measurement noise also shows nonstationary behavior.
6.3.4
Concluding Remarks
In this section we have derived approximations for optimal control parameters. In

order to fulll the requirements of maximal convergence speed as well as minimal
nal misadjustment of the adaptive lter, time-variant estimations have been
derived.
It should be mentioned, however, that there are two problems: the pseudooptimal computations of the step-size and the regularization parameter require the
208
Figure 6.17 Regularization versus step-size control. The excitation signal and the
measurement noise used in this simulation are depicted in the upper two diagrams. The
excitation signal varies its power every 1000 iterations in order to have signal-to-noise ratios
of 20 dB and 20 dB, respectively. In the lowest diagram, the system distances resulting from
step-size as well as regularization control according to Eq. 6.56 and Eq. 6.71 are depicted. The
superior behavior of regularization control in this example is clearly visible. The advantages
will disappear if the measurement noise also shows nonstationary behavior.
6.4 APPLICATION EXAMPLE: ACOUSTIC ECHO CONTROL
209
short-term power of inaccessible signals (s 2eu n, s 2n n) or inaccessible system

parameters (ke nk2 ).
Several methods for estimating these quantities exist. The quality of these
estimation methods increases the more knowledge about the involved signals and
systems is utilized in the estimation process. In the next three subsections, we will
present several detection and estimation schemes for the application of acoustic
echo cancellation.
6.4
APPLICATION EXAMPLE: ACOUSTIC ECHO CONTROL
In the next section, we will derive control methods based on different estimation and
detection principles for the application of acoustic echo control. This should serve as
an example of how signal and system properties can be exploited to develop robust
control mechanisms for the NLMS algorithm. Before we start with the description of
the control methods, a brief outline of acoustic echo control and the related signals
and systems is presented in this section.
6.4.1
Acoustic Echo Control
The problem of acoustic echo arises wherever a loudspeaker and a microphone are
placed such that the microphone picks up the signal radiated by the loudspeaker and
its reections at the boundaries of the enclosure [2, 4, 16]. In the case of
telecommunication systems, the users are annoyed by listening to their own speech
delayed by the round-trip time of the system. If both conversation partners are using
telephones with hands-free capabilities, the electro-acoustic circuit may furthermore
become unstable and produce howling.
To avoid these problems, an adaptive lter can be placed parallel to the
loudspeaker-enclosure-microphone (LEM) system (see Fig. 6.18). If one succeeds in
matching the impulse response wn of the lter exactly with the impulse response
hn of the LEM system, the signals un and en are perfectly decoupled without
any disturbing effects to the users of the electro-acoustic system.
For the application of hands-free telephones the measurement noise consists of
two components: the signal produced by the local speaker and background noise.
Sources for background noise in ofces can be air conditioning systems or computer
fans. In contrast to the speech component of the measurement noise, the latter type
of signal can be modeled as a stationary signal.
6.4.2
LEM Systems
The central elements in acoustic echo cancellation are a loudspeaker and a

microphone placed within one enclosure. For low sound pressure and no overload of
the A/D converters and the analog ampliers, this system may be modeled with
sufcient accuracy as a linear system. Moving objects or changing the temperature
results in a time-variant impulse response. Its specic shape depends on the size of
210
Figure 6.18 Structure of a hands-free telephone system.
the enclosure, the reection properties of its boundaries, and the position of
objectsespecially the loudspeaker and the microphonewithin the enclosure.
Depending on the application, it may be possible to design this system such that the
reverberation timedened as the time necessary for a 60dB decay of the sound
energy after switching off the sound sourceis small, resulting in a short impulse
response. Examples of this solution are telecommunication studios. On the other
hand, electronic means are the only tools to provide hands-free communication out
of ordinary ofce rooms or cars, for example.
In general, the acoustic coupling within an enclosure is formed by a direct path
between the loudspeaker and the microphone and a very large number of echo paths.
The impulse response can be described by a sequence of delta impulses delayed
proportionally to the geometrical length of the related path. Reectivity of the
boundaries of the enclosure and the path length determine the impulse amplitude [1].
The reverberation time of an ofce is typically on the order of a few hundred
milliseconds; that of the interior of a car is a few tens of milliseconds. The upper two
parts of Figure 6.19 show impulse responses of LEM systems measured in an ofce
and in a car. The microphone signals have been sampled at a rate of 8kHz. These
impulse responses are highly sensitive to any changes within the LEM system. This
is explained by the fact that, assuming a sound velocity of 343 m/s and 8 kHz
sampling frequency, the distance traveled between two sampling instants is 4.3 cm.
Therefore, a 4.3cm change in the length of an echo path moves the related impulse
by one sampling interval. Thus, the need for an adaptive echo cancellation lter is
evident.
The order N 1 of the lter should be chosen in dependence of the expected
reverberation time of the LEM system. If the coefcients of the adaptive lter match
211
Figure 6.19 Measured impulse responses and maximum echo reduction. In the two upper
diagrams, two impulse responses of LEM systems are depicted. The top one was measured in
an ofce with a reverbation time of about 300ms. The middle one is the impulse response of
the passenger cabin of a car (BMW 520) with a reverbation of about 60 ms. In the lowest
diagram the maximal echo reduction according to Eq. 6.76 is depicted.
212
perfectly the rst N coefcients of the LEM impulse response

wi n hi n for i [ f0; . . . ; N 1g;
6:75
the maximal echo attenuation can be computed as a function of the lter order
N 1:

Efe2 n; Ng
Efy2 ng nn0;wi nhi n;i[f0;...;N1g

2
P1
PN1
E
i0 hi nun i
i0 hi nun i
n P
2 o
1
E
i0 hi nun i
2 o
hi nun i
n P
2 o :
1
E
i0 hi nun i
E
n P
6:76
1
iN
If we further assume a white excitation, Eq. 6.76 can be simplied to

P1 2

Efe2 n; Ng
hi n
PiN
:
1

2
2
Efy ng nn0;wi nhi n;i[f0;...;N1g
i0 hi n
6:77
In the lowest part of Figure 6.19 this quantity is depicted. To achieve an echo
attenuation of 45 dB, as recommended by the International Telecommunication
Union (see Sec. 6.4.5), a lter order of about 1600 in ofces and 500 in cars is
required. The adaptation of high-order lters causes very high demands on the
computational power of the utilized hardware.
The logarithmic decay of the impulse responses can be exploited when a step-size
matrix is used. In [28, 29, 40] exponentially weighted step-size matrices are
investigated. For the rst coefcients large step-sizes are used, while the updates for
coefcients with large indices are weighted with a small parameter. Especially
during the initial convergence and after room changes, a signicant increase of the
adaptation speed can be achieved using these types of matrix step-sizes.
In most ofces and cars, high-frequency absorbing materials are used (e.g.,
carpets, curtains), leading to faster decay of the high-frequency components of the
echo signal. In Figure 6.20, a time-frequency analysis of the impulse response of an
ofce is shown.
If subband processing is used, these properties can be exploited. In the highfrequency subbands, the lter orders can be reduced and the unused memory and
computing power can be used for enlarged echo cancellation lters in the low
frequency subbands.
213
Figure 6.20 Time-frequency analysis of the impulse response of an ofce. Dark colors
belong to time-frequency areas with large energy; light colors represent low-energy areas.
Due to high-frequency absorbing materialswhich can be found in most ofcesechoes
decay faster at high than at low frequencies.
6.4.3
Speech Signals
The performance of adaptive algorithms (especially for system identication)

depends crucially on the properties of the input signals [23]. In the case of acoustic
echo control, speech signals form the excitation. Identifying systems with this type
of signals turns out to be very difcult. In the upper part of Figure 6.21 an example of
a speech sequencesampled at 8kHzis depicted.
Speech might be characterized by three different types of excitation: nearly
periodic (voiced) segments, noise-like (unvoiced) segments, and speech pauses.
Short-time stationarity may be assumed for intervals of only 10 to 20 ms [8]. If a
sample frequency of 8 kHz (telephony) is used, the mean spectral envelope may have
a range of more than 40 dB. If higher sampling rates (e.g., in a teleconferencing
system) are implemented, the variations increase further. To illustrate the spectral
variations as well as the nonstationarity, the mean power spectral density of a speech
signal as well as a time-frequency analysis are depicted in the middle and lower parts
of Figure 6.21.
The differences between voiced and unvoiced speech are spectrally clearly
visible. While voiced speech has a comb-like spectrum, unvoiced speech exhibits a
nonharmonic spectral structure. Furthermore, unvoiced segments have most of their
energy at high frequencies. The rapidly changing spectral characteristics of speech
motivate the utilization of signal processing in the subband or frequency domain. In
214
Figure 6.21 Example of a speech sequence. In the top part a 5-s sequence of a speech signal
is depicted in the time domain. The signal was sampled at 8 kHz. The middle diagram shows
the mean power spectral density of the entire sequence (periodogram averaging). In the
bottom diagram, a time-frequency analysis is depicted. Dark colors represent areas with high
energy; light colors display areas with low energy.
215
these processing structures frequency-selective power normalization is possible,

leading to a smaller eigenvalue spread [23] and therefore to faster convergence.
Besides this advantage, both structures offer better control possibilities. For the
application of hands-free telephony, the excitation signal as well as the measurement
noise are mutually independent speech signals. In double-talk situations (both
partners speak simultaneously), the signal-to-noise ratio varies largely over the
frequency. In fullband structures the step-size has to be reduced according to the
frequency regions with the smallest signal-to-noise ratio. In subband or frequency
domain (block processing) structures the step-sizes and the regularization
parameters can be adjusted differently over the frequency. Only those subbands
with a small signal-to-noise ratio are adapted using small step-sizes (large
regularization) or even a step-size of zero (a regularization close to innity). In the
other subbandswith large signal-to-noise ratiothe adaptation process can
continue without degradation.
6.4.4
Background Noise
Besides the signal of the local speaker and the echo signal, the microphone also
picks up local background noise. The background noise can be interpreted as a
second component of measurement noise. In the case of a hands-free telephone used
in an ofce, the noise of personal computer (PC) fans or air conditioning might
disturb the system identication process. If someone phones from a car, engine,
wind, or rolling noise might be sources for the distorting signal.
In contrast to speech signals, most background noises show nearly stationary
behavior. To show the typical power spectral densities of background noises, two
signals were analyzed. The results are depicted in Figure 6.22. The rst analyzed
signal was background noise measured in a car travelling on a motorway at 100 km/
h. The upper curve in Figure 6.22 shows the average power spectral density of this
car noise. Secondly, the noise of a PC fan and air conditioning systemboth
recorded in an ofcewas measured. The estimated average power spectral density
is also depicted.
Unless the far-end excitation signal un has the same spectral envelope as the
background noise, an adaptive lter structure which allows the use of a different
control parameter at different frequencies should be preferred.
6.4.5
Regulations
Besides the physical boundary conditions mentioned above, there are some
administrative restrictions. The characteristics of hands-free telephone systems are
specied by the International Telecommunication Union (ITU) as well as by the
European Telecommunication Standards Institute (ETSI). The most severe
restrictions for signal processing are the tolerable delays for front-end processing:
only 2 ms [26] are allowed for stationary telephones and only 39 ms [11] are allowed
for mobile telephones (GSM). Furthermore, an echo attenuation of about 45 dB in
216
Figure 6.22 Average power spectral densities of typical background noises. The upper
curve shows the average power density of background noise measured in a car on a motorway
at 100km/h. The lower curve shows the power spectral density of noise produced by a PC fan
and an air conditioning system.
the case of single-talk and 30dB in the case of double-talk (remote and local
speakers are talking) is required.
Due to the severe delay restriction for stationary phones, only fullband structures
or hybrid structures are applicable. Filter bank systems (consisting of an analysis and
a synthesis part) as well as overlapping discrete Fourier transforms (DFTs) are
introducing a delay considerably larger than 2ms. Therefore, at least the convolution
part has to be implemented in fullband. Hybrid structures allow the adaptation to be
performed in a domain other than the convolution. In these mixed processing
structures the adaptation (but not the convolution) process is delayed by a few
sample instants. The fullband lter impulse response is computed via dedicated
transformations [10, 33].
6.4.6
Concluding Remarks
The aim of this section was to introduce general aspects of acoustic echo control
with emphasis on the statistical properties of the involved signals and systems. It was
217
6.5 ESTIMATION AND DETECTION SCHEMES
stated that these properties should strongly inuence the choice of the processing
and control structure.
If only the characteristics of speech signals are considered, subband or frequency
domain implementations should be preferred. These structures are also the best
choice from the point of view of computational complexity if a large system order is
required. For hands-free telephones used in ofce or car environments this is
certainly true. The only drawback of these structures is the delay they introduce.
In the next section several detection and estimation methods will be presented.
Most of them use the signal and system properties presented in this section.
6.5
ESTIMATION AND DETECTION SCHEMES
According to the adaptation rule of the NLMS algorithm, the coefcients wi n are
modied intensively if the error signal en is rather large. In acoustic echo
cancellation a large error signal can have two reasons:
After an abrupt change of the system hn the adaptive lter wn is no longer
matched. Therefore, a good estimation d^ n of the desired signal dn is not
possible, leading to a large error signal. In these situations the lter wn
should be readapted using a large step-size and a small regularization
parameter as quickly as possible.
An increase in measurement noise due to activity of the local speaker also
leads to an increase in the error signal. In those situations, the adaptation steps
should be reduced in order to preserve the convergence state already reached.
A small step-size or a large regularization parameter should be used.
Distinguishing between both situations is a very challenging task. For estimation of
the optimal step-size according to Eq. 6.55, it is necessary to estimate the power of
the undisturbed error signal, which is not accessible. Since the LEM system and the
adaptive FIR lter provide a parallel structure, the signal eu n (see Eq. 6.18) can be
noted as
eu n uT ne n:
6:78
For white remote excitation un and statistical independence of the excitation

vector un and the system mismatch vector e n, the estimated power of the
undisturbed error signal can be noted as (see Eq. 6.67 in Sec. 6.3.2)
Efe2u ng Efu2 ngEfke nk2 g:
6:79
The second factor, the expected system distance, indicates the echo coupling
b n Efke nk2 g Efkhn wnk2 g
6:80
218
Figure 6.23 Model of the parallel arrangement of the LEM system and the echo
cancellation lter. To estimate the power of the undisturbed error signal, the parallel structure
of the adaptive lter plus the LEM system is modeled as a coupling factor.
after the echo cancellation. Figure 6.23 illustrates the idea of replacing the parallel
structure by a coupling factor. With this notation, the optimal step-size can be
written as:
m n
Efe2u ng Efu2 ngb n
:
Efe2 ng
Efe2 ng
6:81
Simulations, as well as real-time implementations, have shown that the above

equation also provides a good approximation for speech excitation. The problem of
estimating the step-size has thus been reduced to the estimation of the powers of the
remote excitation signal and of the error signal, as well as the estimation of the
coupling factor. As the excitation and the error signal are accessible, the major
problem is to estimate the coupling factor. Section 6.5.4 will focus on this problem.
For approximation of the optimal regularization parameter a slightly different
approach is used. According to Eq. 6.69, knowledge about the power of the sum of
all local signals is required. If orthogonality of the far-end signal un and the local
signal nn is assumed, we can estimate the power Efn2 ng with the same principle
already used during the step-size approximation:
Efn2 ng Efe2 ng Efe2u ng
Efe2 ng Efu2 ngb n:
6:82
The sum of all local signals nn consists of a (nearly) stationary background noise
ns n and nonstationary local speech nn n. If orthogonality of the latter two signals
219
is assumed, Efn2 ng should be at least as large as the power of the local background
noise Efn2s ng. This power can easily be estimated using techniques known from
noise reduction (see Sec. 6.5.3). To reduce the inuence of estimation errors, a
maximum function can be applied:
Efn2 ng maxfEfn2s ng; Efe2 ng Efu2 ngb n:
6:83
For the approximation of the regularization parameter we nally get

Dn N
6.5.1
maxfEfn2s ng; Efe2 ng Efu2 ngb ng

:
b n
6:84
Detection and Estimation Schemes
For approximation of the optimal control parameters according to Eq. 6.81 and Eq.
6.84, several detection and estimation methods are introduced in this section. Even if
not all of them can be utilized only for acoustic echo control (see the previous
section), a signicant number exploit the statistical and physical properties of the
signals and systems involved in acoustic echo control. All of the detectors and
estimators presented here can be grouped into ve classes:
Schemes for short-term power estimation
Schemes for estimating the local background noise
Basic principles for estimating the power of the undisturbed error signal eu n
and a coupling factor b n
Principles for the detection of local speech activity
Principles for detecting enclosure dislocations, called rescue detectors
Details about real-time implementation, computational complexity, and reliability
can be found in the corresponding references.
6.5.2
Short-Term Power Estimation
For estimation of the short-term power of a signal, a rst-order IIR lter can be
utilized, as indicated in Figure 6.24.
To detect rising signal powers (especially of the error signal en), very fast,
different smoothing constants for rising and falling signal edges are proposed
g r , g f :
u2 n 1 g nu2 n g nu2 n 1
6:85
with
g n
g r : if u2 n . u2 n 1;
g f : otherwise:
6:86
220
Figure 6.24
IIR lter.
Structure of the short-term power estimation applying a nonlinear rst-order
It should be mentioned that the above estimation contains a bias due to the different
smoothing constants. When comparing the powers of different signals or computing
their ratio, the knowledge of this bias is not necessary if the same method was used
for both power estimations.
Besides taking the squared input signal [6], the absolute value [25, 39] may be
applied. The advantage is the reduced dynamical range. Especially for xed-point
implementations with only a limited amount of processing power and memory,
smoothing of the magnitude is often preferred.
Besides other methods for estimating the short-term power of a signal, the
weighted squared norm of the excitation vector should be mentioned here. The
squared norm kunk2 can also be computed recursively:
1
1
kunk2 kun 1k2 u2 n u2 n N:
N
N
6:87
The squared norm of the excitation vector is already computed within the NLMS
algorithm. To use this method for other signals, the memory demand seems
prohibitive. For speech signals and appropriate chosen time constants g r ; g f and
lter orders N, all presented short-term power estimations do not really differ.
Figure 6.25 shows a speech sequence in the top diagram as well as the three
mentioned short-term power estimations in the lower three diagrams. The maximal
difference between the estimation methods is below 4dB.
6.5.3
Estimation of the Background Noise Level
For approximating the optimal regularization according to Eq. 6.84, it is necessary to

estimate the power of the local background noise ns n. In the literature, a variety of
schemes for background noise estimation exist [16, 31, 32, 38]. In most of them the
short-term power of the microphone signal y2 n is monitored. During speech pauses
this signal contains only background noise. Therefore, one tracks the minimal power
during an observation interval of a few seconds. It is thus assumed that the
221
Figure 6.25 Examples of short-term power estimations. In the top diagram a typical
example of a speech sequence is depicted. The lower three diagrams show the results of
different short-term power estimations: smoothing the squared signal (second diagram),
smoothing the absolute value of the signal (third diagram), and the squared norm of the signal
vector (vector length 1000).
222
background noise can be modeled as a weak stationary process for at least the
duration of the observation interval.
In practice, two basic schemes are often applied. The rst approach smoothes
either the short-term power estimation y2 n or the instantaneous values y2 n
respectively jynj with a rst-order IIR lter. The time constants are set according to
the result of a local speech activity detector. If local speech is detected, the
background noise estimation is stopped. Otherwise, the smoothing is performed in
such a way that a decrease of the short-term power is followed much faster than an
increase.
The second scheme is called the minimum statistic [31, 32]. In this approach a
minimum search over the last NMS values of the short-term power of the microphone
signal is performed. As in the rst approach, the search length NMS is chosen such
that an interval of a few seconds is covered. Usually a minimum search requires a
large amount of memory and the application of sorting algorithms. The method
described in [31] varies the interval order slowly over time, leading to a reduced
computational load and a large reduction of the required memory.
We will not go further into the details of background noise estimation. The
interested reader is referred to the cited references.
6.5.4
Estimation of the System Distance
For the determination of the control parameter according to Eqs. 6.81 and 6.84,
knowledge of the coupling factor or the system distance, respectively, is necessary.
In general, the system distance according to
ke nk2 khn wnk2
6:88
cannot be determined, as the real LEM impulse response hn is unknown. If an

additional articial delay is introduced into the LEM system, this delay is also
modeled by the adaptive lter. Thus, the section of the impulse response hn
corresponding to the articial delay is known to be zero and the system distance
vector equals the lter vector for this section:
e i n wi n i [ 0; . . . ; ND 1;
6:89
where ND denotes the number of lter coefcients corresponding to the articial

delay. Utilizing the known property of adaptive algorithms to spread the lter
mismatch evenly over all lter coefcients [47], the known part of the system
mismatch vector can be extrapolated to the system distance:
ke nk2 b D n
N
kwD nk2
ND
D 1
N NX
w2 n;
ND i0 i
6:90
223
where N denotes the length of the adaptive lter. The vector wD n consists of the
rst ND coefcients of the adaptive lter vector:
wD n w0 n; w1 n; . . . ; wND 1 nT :
6:91
The general structure of this estimation method is depicted in Figure 6.26. A stepsize or regularization control based on the estimation of the system distance with the
delay coefcients generally shows good performance. When the power of the error
signal increases due to local speech activity, the step-size is reduced and the
regularization parameter is increased as the denominator in Eq. 6.81 increases. Thus
divergence of the lter is avoided.
If the additional delay in signal path is not tolerable, a two-lter structure
according to [5] may be used instead. However, the determination of the control
parameters according to this method may lead to a freezing of the adaptation when
the LEM system changes. The phenomenon can be observed in the lowest diagram
of Figure 6.27, after 60,000 iterations. Freezing occurs because a change in the LEM
system also leads to an increase in the power of the error signal and consequently to
a reduced step-size; respectively, an increased regularization parameter. Thus, a new
adaptation of the lter and the delay coefcients is prevented and the system freezes.
To avoid this, we require an additional detector for LEM changes that either sets the
delay coefcients or the control parameters such that the lter can readapt.
Figure 6.26 General structure of the system distance estimation based on delay coefcients.
If an additional articial delay is introduced into the LEM system, this delay is also modeled
by the adaptive lter. Utilizing the known property of adaptive algorithms to spread the lter
mismatch evenly over all lter coefcients, the known part of the system mismatch vector can
be extrapolated according to Eq. 6.90 for estimating the system distance.
224
Figure 6.27 Simulation example for the estimation of the system distance or the coupling
factor. White noise was used for the excitation as well as for the local signal (see the two upper
diagrams). Double-talk took place between iterations 30,000 and 40,000. After 60,000
iterations, an enclosure dislocation was simulated (a book was placed midway between the
loudspeaker and the microphone).
Secondly, another method to estimate the coupling factor based on power

estimation is presented. The coupling factor here is determined by the ratio of the
powers of the undisturbed error signal and the excitation signal.
The undisturbed error signal eu n can be approximated by the error signal if one
ensures that sufcient excitation power is present and that the local speaker is not
active:
b P n
e2u n
u2 n
8
2
>
< g b n 1 1 g e n :
P

u2 n
>
:
b P n 1
:
if remote single-talk is detected;

otherwise:
6:92
225
Obviously, reliable single-talk detection is necessary to ensure a good estimation

with this method. However, even during remote single-talk, the error signal still
contains a disturbing component, due to the nonmodeled part of the impulse
response, thus limiting the estimated coupling factor. Because of this limitation, the
step-size chosen is too large and it is not possible to obtain a convergence as good as
that obtained with a step-size based on the estimation with the delay coefcients.
The advantage of the step-size control based on the coupling factor is that it does
not lead to freezing of the adaptation when the LEM system changes. Details about
this estimation method can be found in [27].
We conclude that both methods of estimating the system distance have strengths
and weaknesses. Both methods need an additional detector. The estimation of the
system distance with the delay coefcients requires a detector that can determine
LEM changes to avoid freezing of the lter. The estimation of the coupling factor
requires a reliable remote single-talk detector.
6.5.5
Detection of Remote Single-Talk
Remote single-talk detection is required for a reliable estimation of the coupling

factor. As the remote activity can be easily detected by comparing the signal power
with a threshold, the major problem we will focus on in this section is the detection
of local speech activity.
One possible way to detect local speech activity is to measure the similarity
between two signals by means of a correlation value [15, 25, 49]. Two different
structures are to be considered, an open-loop structure and a closed-loop structure.
The open-loop structure calculates a normalized correlation between the excitation
signal un and the microphone signal yn. The closed-loop structure is based upon a
correlation measure between the microphone signal yn and the output of the
adaptive lter d^ n. If the adaptive lter is adjusted sufciently, the latter structure
yields better estimations because of the high similarity between the real echo dn
and the estimated echo d^ n. However, the estimation quality depends on the
adjustment of the adaptive lter. In contrast, the rst approach is independent of the
adaptive lter.
A computationally effective way of estimating an open-loop normalized
correlation is a measure proposed in [25]:

P

NC 1
m0 un m lyn m
r OL n max PNC 1
:
l[0;LC
m0 jun m lyn mj
6:93
This measure has to be calculated for different delays l due to the time delay of the
loudspeaker-microphone path. The parameter LC has to be chosen such that the time
delay of the direct path between the loudspeaker and the microphone falls into the
interval 0; LC . Based on the assumption that the direct echo signal is maximally
correlated with the excitation signal, the open-loop correlation measure has its
maximum at that delay. In contrast, no delay has to be considered for the closed-loop
226
correlation measure:

P

NC 1 ^
m0 d n myn m
r CL n PNC 1
^
m0 jd n myn mj

P
NC 1 ^

m0 d n mdn m nn m
PNC 1
^
m0 jd n mdn m nn mj
6:94
This is due to the fact that both signals are synchronous if a sufciently adjusted
echo-cancelling lter is present. Both correlation values have to be calculated for a
limited number of samples NC , where a larger number ensures better estimation
quality. However, there is a trade-off between the estimation quality and the
detection delay. The latter can lead to instability.
A decision for remote single-talk can be easily generated by comparing the
correlation value with a predetermined threshold. In Figure 6.28, simulation results
for the correlation values are shown. It is clear that the closed-loop structure ensures
more reliable detection. However, in cases of misadjusted adaptive lters, this
detector provides false estimations (e.g., at the beginning, or after local dislocation
at sample 60,000).
Another possible way to detect remote single-talk is to compare the complex
cepstrum of two signals. The complex cepstrum x n of a signal xn is dened as the
inverse z-transform of the logarithm of the normalized z-transform of the signal xn
[37]:
log

1
X
Xz
x izi
X0
i1
Xz
1
X
xizi :
6:95
6:96
i1
The cepstrum exists if the quantity logXz=X0 fullls all conditions of a ztransformation of a stable series.
The cepstral distance measure is dened in [18] with a focus on the problem of
determining the similarity between two signals. A modied, truncated version
adapted to acoustic echo control problems can also be applied:
dc2 n
NX
cep 1
c y i; n c d^ i; n;
6:97
i0
where c y i; n and c d^ i; n denote the cepstra of the estimated autocorrelation

functions s^ yy i; n, respectively s^ d^ d^ i; n, of the signals to be compared at time index
n. The purpose of the cepstral distance measure is to determine the spectral
227
Figure 6.28 Simulation example for detecting remote single-talk with distance measure
principles. Three methodsa closed- and an open-loop correlation analysis as well as a
ceptral analysisare depicted in the lower three diagrams. Speech signals were used for the
excitation as well as for the local signal. Double-talk occurred during iterations 30,000 and
40,000. At iteration 60,000, the impulse response of the LEM system was changed, leading to
detection problems in the closed-loop correlation analysis.
228
differences of two signals by calculating the difference of the logarithmic spectral

densities. The truncation of the sum (Eq. 6.97) to Ncep additions can be interpreted as
smoothing the logarithmic spectral density functions. A variation of the LEM
system does not affect the measurement since, typically, local dislocations vary only
in the ne structure of the room frequency distribution. To avoid signal transformations when calculating the quantities, the signals can be modeled as autoregressive (AR) processes of lower order, and hence, the cepstral distance measure can
be determined by a prediction analysis of the process parameters.
The cepstral distance is calculated between the microphone yn and the
estimated echo signal d^ n. However, a distance measure between the microphone
yn and the excitation signal un is also possible, comparable to the open-loop
correlation analysis. Remote single-talk is detected if the cepstral distance remains
below the threshold.
Results of the simulation are depicted in Figure 6.28. It is obvious that, for a good
choice of the threshold, reliable detection of remote single-talk is possible. The
cepstral distance also rises when local dislocations are present, but not as much as in
the case of double-talk.
6.5.6
Detection of System Changes
As mentioned, methods estimating the system distance have to be combined with

rescue detectors, bewaring the adaptive lter of long-lasting misadjustment periods.
Problems arise due mainly to the critical distinction between situations of doubletalk and enclosure dislocations.
The main idea of two-lter approaches is to implement a second lter (shadow or
background lter) in parallel to the existing echo-cancelling lter (reference or
foreground lter), as depicted in Figure 6.29. For this two-lter structure, different
applications are possible. In [19, 36], only the shadow lter is adapted to the LEM
impulse response. The reference lter used for echo cancellation has xed coefcients. Coefcients are transferred from the shadow lter to the reference lter
whenever the shadow lter gives a better approximation of the echo path impulse
response than the reference lter. Another approach is to adapt both the reference
and the shadow lter, but with different step-sizes [42].
Here the shadow lter is used to detect enclosure dislocations [20, 41]. The
reference lter is adapted and controlled as in the single-lter case. The shadow lter
is adapted similarly to the reference lter. However, its step-size control is only
excitation-based; that is, adaptation is stopped if the remote excitation falls below a
predetermined threshold. Furthermore, only half or less of the number of coefcients
are used for the shadow lter, in comparison to the reference lter. These features
ensure a high convergence speed of the shadow lter in the case of remote singletalk. Of course, the lter diverges in case of local distortions. However, fast
convergence after enclosure dislocations is ensured because the step-size control is
independent of the methods that can freeze the adaptive lter in these situations.
Hence, the only situations in which the shadow lter is better adjusted to the LEM
echo path than the reference lter are enclosure dislocations. This is exploited to
229
Figure 6.29 Two-lter scheme (reference and shadow) for detecting enclosure dislocations.
Both lters are controlled independently. If one lter produces an error power much smaller
than that of the other, either the lter coefcients can be exchanged or the parameters of the
control mechanism can be reinitialized to enable convergence.
develop a detection mechanism: If the error signal of the shadow lter falls below
the error signal of the reference lter for several iterations, enclosure dislocations are
detected (in Fig. 6.29, ts n describes this detection result). The step-size is enlarged
to enable the adaptation of the reference lter toward the new LEM impulse
response.
In Figure 6.30, simulation results for this detector are shown. In the top graph, the
powers of the error signal of both the reference and the shadow lter are pictured.
Due to the fact that the number of coefcients for the shadow lter is smaller than for
the reference lter, a faster convergence of the shadow lter is evident. However, a
drawback of the decreased number of coefcients is the lower level of echo
attenuation. After 60,000 iterations, when an enclosure dislocation takes place, fast
convergence of the shadow lter can be observed, whereas the reference lter
converges only slowly. Therefore an enclosure dislocation is detected (second graph
in Fig. 6.30), which leads to a readjustment of the reference lter. At the beginning
of the simulation, enclosure dislocations are also detected. However, this conforms
with the requirements of the detector, because the beginning of the adaptation can
also be interpreted as an enclosure dislocation due to misadjustment of the lter.
A second detection scheme analyzes power ratios separately in different
frequency bands. The aim of this detector is to distinguish between two reasons for
increasing echo signal power: changes of the LEM impulse response or local speech
activity. In [30], it was shown that a typical change of the room impulse response
(e.g., caused by movements of the local speaker), mainly affects the higher
frequencies of the difference transfer function He jV We jV corresponding to
the system mismatch vector e n hn wn. The reason for this characteristic is
that movements of the local speaker may cause phase shifts up to 180 degrees for
230
Figure 6.30 Simulation examples for the detection of enclosure dislocations. Stationary
noise with the same spectral characteristics as speech (linear predictive analysis of order 40)
was used for the excitation signal as well as for the local signal. Double-talk takes place during
iterations 30,000 and 40,000. At iteration 60,000 the impulse response of the LEM system was
changed. For both methods (shadow lter and separate highpass and lowpass coupling
analyses), the detection results as well as the main analysis signals are depicted.
231
high frequencies of the transfer function He jV corresponding to the LEM system.

In contrast, only small phase shifts occur for low frequencies. The physical
explanation is that the wavelengths of lower frequencies are large compared to a
typical enclosure and that the propagation of low-frequency sound waves is only
marginally disturbed by local dislocations. Thus, the error signal generated by a
system change has larger high-frequency components than low-frequency components. Although these statements are mainly valid for white excitation and broadband LEM transfer functions, they can also be applied to speech excitation and real
LEM systems.
In contrast to the error signal caused by the system mismatch considered above,
the error signal generated by the local speaker inuences both the lower and the
higher frequencies. This difference can be used to detect local dislocations as a
larger increase of the power spectral density of the error signal for the high
frequencies than for the low frequencies. In order to be independent of the shape of
the LEM transfer function, the power spectral density of the error signal is
normalized by the power spectral density of the microphone signal (note: the
envelope of the transfer function is hardly inuenced by local dislocations).
Therefore, the two quotients
Vg
qLP n
0
Vg
0
See V; ndV
Syy V; ndV
p
and
See V; ndV
Vg
Syy V; ndV
qHP n pg
6:98
are analyzed, where the short time power spectral density is calculated by
recursively squared averaging. The cutoff frequency Vg should be chosen close to
700 Hz. A structure for the detector is proposed in Figure 6.31.
Figure 6.31 Highpass and lowpass coupling analyses for detection of enclosure
dislocations. Movements of persons mostly change the high-frequency characteristics of
the LEM system, whereas activity of the local speaker also affects the low-frequency range.
This relationship can be used to differentiate between increasing error powers due to doubletalk or to enclosure dislocations.
232
There are different ways to nally generate the information about local
dislocations. In [30], local dislocations are detected by processing differential values
of qLP n and qHP n to detect a change of the LEM transfer function. However, if
the peak indicating this change is not detected clearly, the detection of the LEM
change is totally missed. Another approach is based only on the current value of a
slightly smoothed quotient qLP n [3]. Our approach is to average the quotient
qHP n=qLP n by summing over the last 8000 samples. This procedure considerably
increases the reliability of the detector but introduces a delay in the detection of
enclosure dislocations.
Simulation results are depicted in Figure 6.30. In the third graph, the lowpass and
highpass power ratios, qLP n and qHP n, respectively, are shown. It can be observed
that both ratios rise close to 5dB at 30,000 iterations during double-talk periods. In
contrast, when a local dislocation occurs after 60,000 samples, there is a clear
increase in the highpass power ratio, whereas the lowpass power ratio is subject to
only a small increase. The fourth graph shows the detection result of the sliding
window for the quotient of the two power ratios. The enclosure dislocation is
detected reliably, but with a small delay.
6.5.7
Detector and Estimator Combinations
Having described some of the most important detection principles in the previous
section, we will now present an overview of the possibilities for combining these
detectors into an entire step-size or regularization control unit.
In Figure 6.32, possible combinations for building a complete step-size control
unit are depicted. The system designer has several choices, which differ in
computational complexity, memory requirements, reliability, dependence on some
types of input signals, and robustness in the face of nite word length effects.
Most of the proposed step-size control methods are based on estimations of the
short-term power of the excitation and error signals. Estimating these quantities is
relatively simple. The estimation of the current amount of echo attenuation is much
more complicated. This quantity was introduced at the beginning of this section as
the echo coupling b n, which is an estimation of the norm of the system mismatch
vector ke nk2 . Reliable estimation of this quantity is required not only for
estimating the power of the undisturbed error signal e2u n but also for the interaction
of the echo cancellation with other echo-suppressing parts of a hands-free telephone,
that is, loss control and postltering [20].
Using the delay coefcients method for estimating the system distance has the
advantage that no remote single-talk detection is required. Furthermore, the tail of
the LEM impulse response, which is not cancelled because of the limited order of the
adaptive lter, does not affect this method. The disadvantage of this method is the
articial delay which is necessary to generate the zero-valued coefcients of the
LEM impulse response. If ITU-T or ETSI recommendations [11, 26] concerning the
delay have to be fullled, the coupling factor estimation should be preferred or a
two-lter scheme has to be implemented. A second drawback of the delay
233
Figure 6.32 Overview of possibilities for combining several detectors in a complete

step-size control unit.
234
coefcients method is its insensitivity to enclosure dislocations. Without any

rescuing mechanism, the estimation of the system distance would freeze and the
lter would no longer converge. Finally, it should be noted that the computational
load of the coupling factor method based on power estimations is a few times
smaller than when delay coefcients are used.
Two methods for detecting enclosure dislocations were presented in Section
6.5.6. These detection principles also differ in their reliability, their computational
complexity, and their memory requirements. Even though the shadow lter can
operate in a reduced frequency range, this detection method needs much more
computational load and memory than the detection principle based on the
comparison of highpass and lowpass power ratios. Nevertheless, the shadow lter
can detect more types of enclosure dislocations. While movements of persons in the
local room can be detected with both methods, an increase or decrease in the gain of
the loudspeaker or the microphone, which is a frequency-independent gain
modication of the LEM impulse response, can be detected only with the shadow
lter principle.
If the coupling factor estimation is chosen, remote single-talk detection is
required. Here, several alternatives may be used by the system designer. Methods
which are based on feature extraction (correlation or cepstral analysis) and powerbased detectors are possible candidates for this detection. Remote single-talk
detection is often performed in a two-step procedure. The power of the excitation
signal is rst compared with a threshold. If remote speech activity is detected, one or
more of the detectors mentioned above are used to correct the rst decision in
double-talk situations and to disable the adaptation of the coupling factor.
Analogous to the possible combinations for building a step-size control, the same
design tree can be plotted for regularization controls. We will present here only a
reduced version where the local distortion is assumed to be stationary. In this case, it
is only necessary to estimate the power of the local noise as well as the system
distance. In Figure 6.33 the possible detector combinations are depicted.
Finally, it should be mentioned that in real-time implementations, additional
detectors are required which monitor the effects of limited arithmetical processor
precision. For example, in the case of low remote excitation power, the step-size
m n should also be reduced; the regularization parameter should be increased
correspondingly.
6.5.8
Concluding Remarks
The aim of this section was to show how the specic properties of the system, which
should be identied, and of the involved signals can be exploited to build a robust
and reliable adaptation control. For all necessary estimation and detection schemes,
the system designer has several possibilities to choose from. A compromise between
reliability, computational complexity, memory requirements, and signal delay
always has to be found.
6.6 EXAMPLE OF COMBINED ACOUSTIC ECHO CONTROL METHODS
235
Figure 6.33 Overview of possibilities for combining several detectors in a complete

regularization control unit. In order to keep the diagram simple, it was assumed that only local
background noise is disturbing the adaptation process.
6.6 EXAMPLE OF COMBINED ACOUSTIC ECHO

CONTROL METHODS
After introducing several detection and estimation methods in the previous section,
we will now analyze the performance of one selected combined control method.
For the estimation of the systems distance, the delay coefcients method was
implemented with ND 40 delay coefcients. Since this method was chosen, no
remote single-talk detection was needed. The order of the adaptive lter was set to
N 1 1023, and speech signals were used for the remote excitation as well as for
the local distortion. Both signals are depicted in the two upper sections of Figure
6.34; a double-talk situation appears during iteration steps 30,000 and 40,000.
The step-size was estimated in order to cope only with local speech distortion.
For that reason, the power of the distorted error signal was corrected by subtracting
the power of the local background noise. The inuence of the background noise was
236
Figure 6.34 Simulation example of an entire adaptation control unit. Speech signals were
used for excitation as well as for local distortion (see the top two diagrams). After 62,000
iterations a book was placed between the loudspeaker and the microphone. In the third
diagram, the real and estimated system distances are depicted. The lowest two diagrams show
the step-size and the regularization parameter.
6.7 CONCLUSIONS AND OUTLOOK
237
controlled by regularization. The control parameters were adjusted according to
m n
u2 nb D n
maxfe2 n n2s n; u2 nb D ng
Dn N
n2s n
:
b D n
6:99
6:100
The denominator in Eq. 6.99 was limited by u2 nb D n in order to avoid step-sizes

larger than 1 (because of possible errors in estimating the local background noise
n2s n). Even without detection of local speech activity, the step-size was reduced
during the double-talk situation. For estimating the power of the background noise,
minimum statistics according to [31] were used.
After 62,000 iterations an enclosure dislocation took place (a book was placed
between the loudspeaker and the microphone). To avoid freezing of the adaptive
lter coefcients, a shadow lter of order NS 1 255 was implemented. If the
power of the error signal of the shadow lter falls 12 dB below the error power of the
reference lter, an enclosure dislocation is detected and the rst ND lter coefcients
are reinitialized. In the middle part of Figure 6.34, the real and the estimated system
distances are depicted. The rescue mechanism needs about 3000 iterations to detect
the enclosure dislocation. During this time the step-size m n and the regularization
parameter Dn were set to very small and very large values, respectively (see the
lower parts of Fig. 6.34). After 65,000 iterations the lter converges again.
6.7
CONCLUSIONS AND OUTLOOK
Adaptive algorithms are typically analyzed in a clean laboratory environment:

stationary input signalspreferably white noisesand time-invariant systems.
Results gained there can hardly be transferred to the real world. The application we
used as an example in this chapter is characterized by instationary signals and highly
time-variant, high-order systems to be identied.
Using the NLMS algorithm for this task seems to be quite unreasonable. One can,
however, built on the simplicity and robustness of this algorithm. Even its slow
convergence may turn into a positive property in situations just after abrupt
changes of signal or system properties. During unavoidable delays for detecting
these changes, fast algorithms may cause complete divergence of the adaptive
lter.
The modesty with respect to computing resources of the NLMS algorithm allows
use of the residual computing power to implement a sophisticatedand necessary
control structure. It should be noted that, by the reasons just given, an even more
sophisticated control system becomes necessary if faster adaptive algorithms are
used.
The NLMS algorithm has often been declared to be dead. According to a popular
saying, this is an infallible sign of a very long life.
238
REFERENCES
1. J. B. Allen and D. A. Berkley, Image Method for Efciently Simulating Small-Room
Acoustics, J. Acoust. Soc. Am., vol. 65, pp. 943 950, 1979.
2. J. Benesty, T. Gansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay, Advances in Network
and Acoustic Echo Cancellation, Springer, Berlin, 2001.
3. C. Breining, A Robust Fuzzy Logic-Based Step Gain Control for Adaptive Filters in
Acoustic Echo Cancellation, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 2,
pp. 162 167, Feb. 2001.
4. C. Breining, P. Dreiseitel, E. Hansler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G.
Schmidt, and J. Tilp, Acoustic Echo Control, IEEE Signal Processing Magazine, vol.
16, no. 4, pp. 42 69, 1999.
5. T. Burger and U. Schultheiss, A Robust Acoustic Echo Canceller for a Hands-Free
Voice-Controlled Telecommunication Terminal, Proc. of the EUROSPEECH 93,
Berlin, vol. 3, pp. 1809 1812, Sept. 1993.
6. T. Burger, Practical Application of Adaptation Control for NLMS-Algorithms Used for
Echo Cancellation with Speech Signals, Proc. IWAENC 95, Roros, Norway, 1995.
7. R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing, Prentice Hall,
Inc., Englewood Cliffs, New Jersey, 1983.
8. J. Deller, J. Hansen, and J. Proakis, Discrete-Time Processing of Speech Signals, IEEE
Press, New York, 1993.
9. P. S. R. Diniz, Adaptive FilteringAlgorithms and Practical Implementations, Kluwer
Academic Publishers, Boston, 1997.
10. M. Dorbecker and P. Vary, Reducing the Delay of an Acoustic Echo Canceller with
Subband Adaptation, Proc. of the IWAENC 95, International Workshop on Acoustic
Echo and Noise Control, Roros, Norway, pp. 103 106, 1995.
11. ETS 300 903 (GSM 03.50), Transmission Planning Aspects of the Speech Service in the
GSM Public Land Mobile Network PLMS System, ETSI, France, March 1999.
12. P. Eykhoff, System IdenticationParameter and State Estimation, John Wiley & Sons,
Chichester, England, 1974.
13. A. Feuer and E. Weinstein, Convergence Analysis of LMS Filters with Uncorrelated
Gaussian Data, IEEE Transactions on Acoustics Speech, and Signal Processing, vol.
ASSP-33, no. 1, pp. 222 230, Feb. 1985.
14. R. Frenzel and M. Hennecke, Using Prewhitening and Stepsize Control to Improve the
Performance of the LMS Algorithm for Acoustic Echo Compensation, Proc. of the
ISCAS-92, IEEE International Symposium on Circuits and Systems, vol. 4, pp. 1930
1932, San Diego, CA, 1992.
15. T. Gansler, M. Hansson, C.-J. Ivarsson, and G. Salomonsson, A Double-Talk Detector
Based on Coherence, IEEE Trans. on Communications, vol. 44. no. 11. pp. 1421 1427,
1996.
16. S. L. Gay and J. Benesty (eds), Acoustic Signal Processing for Telecommunications,
Kluwer, Boston, MA, 2000.
17. G. Glentis, K. Berberidis, and S. Theodoridis, Efcient Least Squares Adaptive
Algorithms for FIR Transversal Filtering, IEEE Signal Processing Magazine, vol. 16,
no. 4, pp. 13 41, July 1999.
REFERENCES
239
18. A. H. Gray and J. D. Markel, Distance Measures for Speech Processing, IEEE Trans. on
Acoustic Speech and Signal Processing, vol. ASSP-24, no. 5, pp. 380 391, 1976.
19. Y. Haneda, S. Makino, J. Kojima, and S. Shimauchi, Implementation and Evaluation of
an Acoustic Echo Canceller Using Duo-Filter Control System, Proc EUSIPCO 96,
Trieste, Italy, vol. 2, pp. 1115 1118, 1996.
20. E. Hansler and G. Schmidt, Hands-Free TelephonesJoint Control of Echo
Cancellation and Postltering, Signal Processing, vol. 80, no. 11, pp. 2295 2305,
Nov. 2000.
21. E. Hansler, The Hands-Free Telephone ProblemAn Annotated Bibliography, Signal
Processing, vol. 27, no. 3, pp. 259 271, 1992.
22. E. Hansler, The Hands-Free Telephone ProblemAn Annotated Bibliography Update,
Annales des Telecommunications, Special Issue on Acoustic Echo Control, no. 49, pp.
360 367, 1994.
23. S. Haykin, Adaptive Filter Theory, 3rd Edition, Prentice Hall Inc., Englewood Cliffs,
New Jersey, 1996.
24. P. Heitkamper and M. Walker, Adaptive Gain Control and Echo Cancellation for HandsFree Telephone Systems, Proc. EUROSPEECH 93, Berlin, pp. 1077 1080, Sept. 1993.
25. P. Heitkamper, An Adaptation Control for Acoustic Echo Cancellers, IEEE Signal
Processing Letters, vol. 4, no. 6, pp. 170 172, 1997.
26. ITU-T Recommendation G.167, General Characteristics of International Telephone
Connections and International Telephone CircuitsAcoustic Echo Controllers, Helsinki,
Finland, March 1993.
27. A. Mader, H. Puder, and G. U. Schmidt, Step-Size Control for Acoustic Echo
Cancellation FiltersAn Overview, Signal Processing, vol. 80, no. 9, 1697 1719, Sept.
2000.
28. S. Makino and Y. Kaneda, Exponentially Weighted Step-Size Projection Algorithm for
Acoustic Echo Cancellers, IEICE Trans. Fundamentals, vol E75-A, no. 11, pp. 1500
1507, 1992.
29. S. Makino, Y. Kaneda, and N. Koizumi, Exponentially Weighted Step-Size NLMS
Adaptive Filter Based on the Statistics of a Room Impulse Response, IEEE Trans.
Acoustics, Speech, and Signal Processing, vol. 1, no. 1, pp. 101108, 1993.
30. J. Marx, Akustische Aspekte der Echokompensation in Freisprecheinrichtungen, VDIFortschritt-Berichte, Reihe 10, no. 400, Dusseldorf, 1996.
31. R. Martin, An Efcient Algorithm to Estimate the Instantaneous SNR of Speech
Signals, Proc. EUROSPEECH 93, Berlin, pp. 1093 1096, Sept. 1993.
32. R. Martin, Spectral Subtraction Based on Minimum Statistics, Signal Processing VII:
Theories and Applications Conference Proceedings, pp. 1182 1185, 1994.
33. R. Merched, P. Diniz, and M. Petraglia, A New Delayless Subband Adaptive Filter
Structure, IEEE Trans. on Signal Processing, vol. 47, no. 6, pp. 1580 1591, June 1999.
34. W. Mikhael and F. Wu, Fast Algorithms for Block FIR Adaptive Digital Filtering,
IEEE Trans. on Circuits and System, vol. 34, pp. 1152 1160, Oct. 1987.
35. B. Nitsch, The Partitioned Exact Frequency Domain Block NLMS Algorithm, a
Mathematical Exact Version of the NLMS Algorithm Working in the Frequency
Domain, International Journal of Electronics and Communications, vol. 52, pp. 293
301, Sept. 1998.
240
36. K. Ochiai, T. Araseki, and T. Ogihara, Echo Canceler with Two Echo Path Models,
IEEE Trans. on Communications, vol. COM-25, no. 6, pp. 589 595, 1977.
37. A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Inc.,
London, 1975.
38. H. Puder, Single Channel Noise Reduction Using Time-Frequency Dependent Voice
Activity Detection, Proc. IWAENC 99, Pocono Manor, Pennsylvania, pp. 68 71, Sept.
1999.
39. T. Schertler and G. U. Schmidt, Implementation of a Low-Cost Acoustic Echo
Canceller, Proc. IWAENC 97, London, pp. 49 52, 1997.
40. T. Schertler, Selective Block Update of NLMS Type Algorithms, 32nd Annual
Asilomar Conf. on Signals, Systems, and Computers, Conference Proceedings, pp. 399
403, Pacic Grove, California, Nov. 1998.
41. G. U. Schmidt, Step-Size Control in Subband Echo Cancellation Systems, Proc.
IWAENC 99, Pocono Manor, Pennsylvania, pp. 116 119, 1999.
42. W.-J. Song and M.-S. Park, A Complementary Pair LMS Algorithm for Adaptive
Filtering, Proc. ICASSP 97, Munich, vol. 3, pp. 2261 2264, 1997.
43. B. Widrow and S. Stearns, Adaptive Signal Processing, Prentice-Hall, Inc., Englewood
Cliffs, New Jersey, 1985.
44. P. P. Vaidyanathan, Mulitrate Digital Filter Banks, Polyphase Networks, and
Applications: A Tutorial, Proc. of the IEEE, vol. 78, no. 1, pp. 56 93, Jan. 1990.
45. P. P. Vaidyanathan, Mulitrate Systems and Filter Banks, Prentice-Hall, Inc., Englewood
Cliffs, New Jersey, 1992.
46. S. Yamamoto, S. Kitayama, J. Tamura, and H. Ishigami, An Adaptive Echo Canceller
with Linear Predictor, Trans. of the IECE of Japan, vol. 62, no. 12, pp. 851 857, 1979.
47. S. Yamamoto and S. Kitayama, An Adaptive Echo Canceller with Variable Step Gain
Method, Trans. of the IECE of Japan, vol. E 65, no. 1, pp. 1 8, 1982.
48. H. Yasukawa and S. Shimada, An Acoustic Echo Canceller Using Subband Sampling
and Decorrelation Methods, IEEE Trans. Signal Processing, vol. 41, pp. 926 930,
1993.
49. H. Ye and B.-X. Wu, A New Double-Talk Detection Algorithm Based on the
Orthogonality Theorem, IEEE Trans. on Communications, vol. 39, no. 11, pp. 1542
1545, 1991.
AFFINE PROJECTION
ALGORITHMS
STEVEN L. GAY
Bell Labs, Lucent, Murray Hill, New Jersey
7.1
INTRODUCTION
The afne projection algorithm (APA) [1] is a generalization of the well-known

normalized least-mean-square (NLMS) adaptive ltering algorithm [2]. Each tap
weight vector update of NLMS may be viewed as a one-dimensional afne projection. In APA the projections may be made in multiple dimensions. As the projection dimension increases, so does the convergence speed of the tap weight vector
and, unfortunately, the algorithms computational complexity. Using techniques
similar to those which led to fast (i.e., computationally efcient) recursive least
squares (FRLS) [3] from recursive least squares (RLS) [4], a fast version of APA,
fast afne projections (FAP), may be derived [5, 6, 7].
Like RLS and FRLS, APA requires the solution to a system of equations
involving the implicit inverse of the excitation signals covariance matrix. Although
with APA the dimension of the covariance matrix is the dimension of the projection,
N, not the length of the joint process estimation, L. This is advantageous because
usually N is much smaller than L.
Fast afne projection uses a sliding windowed FRLS [8] to assist in a recursive
calculation of the solution. Since sliding windowed FRLS algorithms easily
incorporate regularization of the covariance matrix inverse, FAP is regularized as
well, making it robust to measurement noise. The complexity of FAP is roughly
2L 20N multiplications per sample period. For applications like acoustic echo
cancellation, FAPs complexity is comparable to NLMSs (2L multiplications per
sample period). Moreover, FAP does not require signicantly greater memory than
NLMS.
Block-exact versions of APA and FAP [33] have also been introduced with
further reduced computational complexity by enabling the use of fast convolution
techniques.
241
242
AFFINE PROJECTION ALGORITHMS
The APA and its more efcient implementations have been applied to many
problems. It is especially useful in applications involving speech and acoustics. This
is because acoustic problems are often modeled by long nite impulse response
(FIR) lters and are often excited by speech which can be decorrelated with a
relatively low-order prediction lter. The most natural application is in the acoustic
echo cancellation of voice [6, 38, 14, 41]. More recently, APA and its descendants
have debuted in multichannel acoustic echo cancellation [50, 37, 38, 23]. It is also
useful in network echo cancellation [47], a problem that also has long adaptive FIR
lters.
APA has also been used in equalizers for data communications applications [19,
54], active noise control [28], and neural network training algorithms [18].
7.2
THE APA
The APA is an adaptive FIR lter. An adaptive lter attempts to predict the most
recent outputs, fdn; dn 1; . . . ; dn N 1g of an unknown system, wsys ,
from the most recent system inputs, fun; un 1; . . . ; un L N 1g and the
previous system estimate, wn 1. This arrangement is shown in Figure 7.1.
The two equations that dene a relaxed and regularized APA are as follows. First,
the system prediction error is calculated:
en dn Unt wn 1
7:1
and then the new coefcient update is made

wn wn 1 m UnUnt Un d I1 en;
7:2
where the superscript t denotes transpose, I denotes the identity matrix, and the
following denitions are made:
1. un is the excitation signal and n is the time index.
Figure 7.1
Adaptive lter system.
7.2 THE APA
243
2.
7:3
un un; un 1; . . . ; un L 1t
is the L length excitation or tap-delay line vector.
3.
7:4
an un; un 1; . . . ; un N 1t
is the N length excitation vector.
4.
2
6
Un un; un; . . . ; un N 1 6
4
ant
an 1t
..
.
3
7
7
5
7:5
an L 1t
is the L by N excitation matrix.
5.
wn w0 n; w1 n; . . . ; wL1 nt
7:6
is the L length adaptive coefcient vector where wi n is the ith adaptive tap
weight or coefcient at time n.
6.
wsys w0;sys ; w1;sys ; . . . ; wL1;sys t
7:7
is the L length system impulse response vector where wi;sys is the ith tap
weight or coefcient.
7. yn is the measurement noise signal. In the language of echo cancellation, it
is the near-end signal which consists of the near-end talkers voice and/or
background noise.
8.
yn yn; yn 1; . . . ; yn N 1t
7:8
is the N length noise signal vector.

9.
dn dn; dn 1; . . . ; dn N 1t Unt wsys yn
7:9
is the N length desired or the system output vector. Its elements consist of the
echo plus any additional signal added in the echo path.
244
10.
en dn unt wn
7:10
is the a priori signal or the residual echo.

11.
en en; en 1; . . . ; en N 1t
7:11
is the N length a priori error vector.

12. m is the relaxation or step-size parameter, typically it is positive and slightly
less than 1.
13. d is the regularization parameter.
We justify (7.1) and (7.2) as an adaptive lter that minimizes the length of its
coefcient update vector,
rn wn wn 1;
7:12
under the constraint that the new coefcients yield an N length a posteriori error
vector, dened as,
e1 n dn Unt wn;
7:13
that is element by element a factor of 1 m smaller than the N length a priori error
vector,
en dn Unt wn 1:
7:14
e1 n m en Unt rn:
7:15
By using (7.12) in (7.13):
Using a Lagrange multiplier, we may express the cost function,

C d rnt rn e1 n2 ;
7:16
where d is the Lagrange multiplier. We nd the rn that minimizes (7.16) by setting

the derivative of C with respect to rn to zero and solving for rn, yielding
rn m UnUnt Un d I1 en;
which is precisely the coefcient update term in (7.1).
7:17
7.3 PROJECTIONS ONTO AN AFFINE SUBSPACE
245
It is interesting to note that by setting N 1 in (7.1) and (7.2) we get

en dn unt wn 1
wn wn 1 m ununt un d 1 en;
7:18
7:19
which is the familiar NLMS algorithm. Thus, we see that APA is a generalization of
NLMS.
7.3
PROJECTIONS ONTO AN AFFINE SUBSPACE
We now show that the APA as expressed in (7.1) and (7.2) indeed represents a
projection onto an afne subspace. In Figure 7.2a we show the projection of a vector,
wn 1, onto a linear subspace, S, where we have a space dimension of L 3 and a
subspace dimension of L N 2. Note that an L N dimensional linear subspace
is a subspace spanned by any linear combination of L N vectors. One of those
combinations is where all of the coefcients are 0; so, a linear subspace always
includes the origin. Algebraically, we represent the projection as
g Qwn 1;
7:20
where Q is a projection matrix of the form

2
0
Q V4 0
0
3
0 0
1 0 5Vt
0 1
7:21
and V is a unitary matrix (i.e., a rotation matrix). In general, the diagonal matrix in
(7.21) has N 0s and L N 1s along its diagonal.
Figure 7.2
subspace.
(a) Projection onto an afne subspace. (b) Relaxed projection onto an afne
246
Figure 7.2b shows a relaxed projection. Here, g ends up only partway between
wn1 and S. The relaxed projection is still represented by (7.20), but with
2
1m
Q V4 0
0
0
1
0
3
0
0 5Vt
1
7:22
In Figure 7.2b m is 1/3.

An afne subspace, S0 , as shown in Figure 7.3a, is dened as a subspace parallel
to a linear subspace, offset by a perpendicular vector, f. Note that the afne subspace
does not include the origin. Algebraically, the projection onto the afne subspace is
represented as
g0 Qhn 1 f;
7:23
where f is in the null space of Q; that is, Qf equals an all-zero vector. Figure 7.3b
shows a relaxed projection onto the afne subspace. As before, m 1=3.
Manipulating (7.1), (7.2), and (7.9) and assuming that yn 0 and d 0, we can
express the APA tap update as
wn I m UnUnt Un1 Unt wn 1
m UnUnt Un1 Unt wsys :
7:24
Dene
Qn I m UnUnt Un1 Unt
Figure 7.3
subspace.
7:25
(a) Projection onto an afne subspace. (b) Relaxed projection onto an afne
7.3 PROJECTIONS ONTO AN AFFINE SUBSPACE
247
and let UnUnt have an eigendecomposition of VnSnVnt . Then,

2
6
6
6
6
6
Qn Vn6
6
6
6
4
1m
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
1m
1
..
.
1
and if m 1,
2
6
6
6
6
6
Qn Vn6
6
6
6
4
0
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
.
0
1
..
7:26
1
where there are N 0s and L N 1s in the diagonal matrix. Similarly, dene
Pn m UnUnt Un1 Unt
2
m
6
..
6
.
6
6
6
m
Vn6
6
0
6
6
..
6
4
.
7:27
3
7
7
7
7
7
7Vnt ;
7
7
7
7
5
7:28
0
and if m 1,
2
6
6
6
6
6
Pn Vn6
6
6
6
4
1
..
7
7
7
7
7
t
7Vn ;
7
7
7
5
.
1
0
..
.
0
7:29
248
where there are N 1s and L N 0s in the diagonal matrix. That is, Qn and Pn
represent projection matrices onto orthogonal subspaces when m 1 and relaxed
projection matrices when 0 , m , 1. Note that the matrix Qn in (7.26) has the
same form as in (7.21). Using (7.25) and (7.27) in (7.24), the APA coefcient vector
update becomes
wn Qnwn 1 Pnwsys ;
7:30
which is the same form as the afne projection dened in (7.23), where now Q
Qn and f Pnwsys . Thus, (7.1) and (7.2) represent the relaxed projection of the
system impulse response estimate onto an afne subspace which is determined by
(1) the excitation matrix Un (according to 7.25 and 7.27) and (2) the true system
impulse response, wsys (according to 7.30).
7.4
REGULARIZATION
Equation (7.30) gives us an intuitive feel for the convergence of wn to wsys . Let us
assume that m 1. We see that as N increases from 1 toward L, the contribution to
wn from wn 1 decreases because the nullity of Qn is increasing, while the
contribution from wsys increases because the rank of Pn is increasing. In principle,
when N L, wn should converge to wsys in one step, since Qn has a rank of 0 and
Pn a rank of L. In practice however, we usually nd that as N approaches L, the
condition number of the matrix, Unt Un begins to grow. As a result, the inverse of
Unt Un becomes more and more dubious and must be replaced with either a
regularized or a pseudo-inverse. Either way, the useful, i.e., signal-based rank of
Pn ends up being somewhat less than L. Still, for moderate values of N, even when
the inverse of Unt Un is regularized, the convergence of wn is quite impressive,
as we shall demonstrate.
The inverse of Unt Un can be regularized by adding the matrix d I prior to
taking the inverse. The matrix I is the N by N identity matrix and d is a small
positive scalar. Where Unt Un may have eigenvalues close to zero, creating
problems for the inverse, Unt Un d I has d as its smallest possible eigenvalue,
which, if large enough, yields a well-behaved inverse. The regularized APA tap
update is then
wn wn 1 m UnUnt Un d I1 en:
7:31
The coefcient error vector is dened as

Dwn wsys wn:
7:32
Using (7.1) and (7.27), we can express the coefcient error update as
Dwn I PnDwn 1 m UnUnt Un d I1 yn:
7:33
7.4 REGULARIZATION
249
We now dene the coefcient error reduction matrix,

MDw I Pn;
7:34
and the noise amplication matrix,

My UnUnt Un d I1 :
7:35
The coefcient error vector update of (7.33) can be rewritten as

Dwn MDw Dwn 1 My yn:
7:36
The excitation matrix Un can be expanded using the singular valued decomposition (SVD) to
Un FnSnVnt ;
7:37
where Fn is dened as an L by L unitary matrix, Vn is an N by N unitary matrix,

and Sn is an L by N matrix of all 0s except along its main diagonal, where it has N
so-called singular values, r i n for 1 i N, which are dened as the positive
square roots of the eigenvalues of Unt Un.
The rst vector on the right hand side of (7.36)call it TDw may now be
expressed as
TDw I m UnUnt Un d I1 Unt Dwn
FnI m SnSnt Sn d I1 Snt Fnt Dwn:
7:38
7:39
Multiplying (7.39) from the left by Fnt and dening a rotated coefcient error
vector,
Dgn Fnt Dwn;
7:40
we have the noiseless rotated coefcient error update vector,

TDg Fnt TDw I m SnSnt Sn d I1 Snt Dgn:
7:41
250
Each element of Dgn has its own convergence gain factor, the ith one being
t Dg r i 1 m
r 2i
d
r 2i
1 m r 2i d
;
r 2i d
7:42
where r i is the ith singular value of Un.

The second vector in (7.36) is the noise vector, denoted as Ty :
Ty m UnUnt Un d I1 yn:
7:43
We may express its rotation as

Tz Fnt Ty m SnSnt Sn d I1 zn;
7:44
where we have dened

zn Vnt yn:
7:45
Since Vnt is unitary, the magnitude of zn is the same as yn. Furthermore, if yn

is assumed to be zero-mean Gaussian, then zn is as well.
The ith modal noise amplication factor is
t z r i m
ri
:
r 2i d
7:46
p
p
p
The maximum
p of t z is 1=2 d , occurring at r i d . For r i @ d , t z 1=r i , and
for r i ! d , t z r i =d .
Figures 7.5 and 7.4 show the shape of t Dg and t z as a function of r i for a xed
regularization, d . In Figure 7.5 d 300, and in Figure 7.4 d 0. In both gures the
step-size m 0:98.
In both gures t Dg is at or approaches 1 m and t z behaves as 1=r i for large r i .
A t Dg , 0 dB means that the coefcient error would decrease for this mode if the
noise were sufciently small. We will return to this thought in the next section.
In Figure 7.4, where there is no regularization, d 0, the noise amplication
factor, t z , approaches innity, and the coefcient error convergence factor, t Dg
remains very small as the excitation singular value, r i , approaches zero. This means
that for modes with little excitation, the effect of noise on the coefcient error
increases without bound as the modal excitation singular value approaches zero.
Contrast this with the behavior of t z and t Dg when d 300, as in Figure 7.5. The
noise amplication
p factor, t y , becomes much smaller, and t Dg approaches 0 dB as r i
drops below d 17:3. This means that for modes with little excitation, the effect
of noise on the coefcient error is suppressed, as is any change in the coefcients.
7.4 REGULARIZATION
251
Figure 7.4 The nonregularized modal convergence and noise amplication factors as a
function of the modal input signal magnitude.
Figure 7.5 The regularized modal convergence and noise amplication factors as a function
of the modal input signal magnitude.
252
One can say that regularization turns

p off the coefcient update for eigenmodes with
excitation singular values below d .
7.5
APA AS A CONTRACTION MAPPING
One may prove the convergence of an adaptive ltering algorithm if it can be shown
that each iteration, or coefcient update, is a contraction mapping on the norm of the
coefcient error vector. That is, the norm of Dwn should always be less than or
equal to the norm of Dwn 1. In this section we show that this indeed is a property
of APA when there is no noise [6, 32], and that when noise is present, we show the
conditions under which the contraction mapping continues to hold [6]. We begin by
rewriting the coefcient error update, Dwn, as the sum of two mutually orthogonal
parts,
Dwn I P_ nDwn 1
P_ n PnDwn 1 m UnUnt Un d I1 yn;
7:47
where
P_ n UnUnt Un1 Un;
7:48
and we have used the identity

Dwn 1 I P_ nDwn 1 P_ nDwn 1:
7:49
Multiplying from the left by Fn and applying (7.41), we can write the ith element
of Dgn; Dgi n, as
Dgi n 1 k i Dgi n 1

r i n2
r i n
Dg
n

1

m
z
n
;
ki m
i
i
r i n2 d
r i n2 d
7:50
where

ki
1
0
1iN
:
N,iL
7:51
and zi n is the ith element of zn as dened in (7.45).

Also, multiplying (7.49) from the left by Fnt , we have the ith element of Dgn,
Dgi n 1 1 k i Dgi n 1 k i Dgi n 1:
7:52
7.5 APA AS A CONTRACTION MAPPING
253
To demonstrate that APA satises a contraction mapping on the modal coefcient

error, we need to show for each mode, i, that
0 , kDgi n 1k kDgi nk
7:53
holds. It twill be instructive, however, to also consider the case where a slight
amount of expansion, denoted by a small positive number, CG (the G stands for
growth), is allowed. This is expressed by the inequality
CG , kDgi n 1k kDgi nk:
7:54
We will refer to this as an expansion control mapping. This approach will allow us to
investigate the behavior of the contraction mapping of APA when the excitation
singular value is very small. Note that by simply setting CG 0 we once again get
the contraction mapping requirement. Using (7.50) and (7.52), we write the
requirement for the expansion control for mode i as
CG , k i kDgi n 1k

r i n2
r i n

Dg
n

1

m
z
n
k i k i m
:
i
i

r i n2 d
r i n2 d
7:55
From now on, we will drop the use of the k i with the understanding that we are only
concerned with the ith mode where 1 i N. Assuming that Dgi n 1 0
(assuming otherwise yields the same result) and then manipulating (7.55), we obtain

r n2 d
m r i nDgi n 1 CG i
r i n

2 m r i n2 2d
r n2 d
:
, m zi n ,
Dgi n 1 CG i
r i n
r i n
7:56
First, let us consider the case where r i n2 @ d . By dropping small terms, we may
simplify inequality (7.56) to
m r i nDgi n 1 CG r i n , m zi n
, 2 m r i nDgi n 1 CG r i n:
7:57
Concentrating on the right-hand inequality, the more restrictive of the two, and
considering the case where CG 0, we see that as long as the noise signal
magnitude is smaller than the residual echo magnitude for a given mode, the
inequality is upheld, implying that there is a contraction mapping on the coefcient
error for that mode. Allowing some expansion, CG . 0, we see that the noise can be
larger than the residual error by CG r i n=m . Since we have assumed that r i n is
254
large and we know that 0 , m 1, then for a little bit of expansion we gain a great
deal of leeway in additional noise.
The inequalities of (7.57) also provide insight into the situation where there is no
regularization and the modal excitation is very small. Then the noise power must
also be very small so as not to violate either the contraction mapping or expansion
control constraints.
If, however, we allow regularization and r i n2 ! d , inequality (7.56) becomes
CG
d
d
, zi n , 2Dgi n 1 CG
;
m r i n
m r i n
7:58
which we may also express as

CG ,
m r i n
zi n , 2Dgi n 1 CG :
d
7:59
In the inequalities of (7.59) as r i n gets smaller, the noise term in the middle
becomes vanishingly small, meaning that CG , the noise expansion control constant,
may also become arbitrarily small.
Of course, one may look at both sets of inequalities, (7.57) and (7.58), and
conclude that decreasing m would have the same effect as increasing d . But if one
also observes the coefcient error term, one sees that there is a greater price paid in
terms of slowing the coefcient error convergence when m is lowered as opposed to
increasing d . Recalling the modal coefcient error reduction factor of (7.42),
t Dg r i
1 m r 2i d
:
r 2i d
7:60
For modes where r i n2 @ d , a small m will slow the modal coefcient error
convergence by making ti;coeff 1. On the other hand, a m close to unity will speed
the convergence by making ti;coeff d =r i n2 a very small value, given our
assumption.
The inequalities (7.57) and (7.58) show that the regularization parameter plays
little part in the noise term for those modes with large singular values but heavily
inuences the noise term for those modes with small singular values. So, in
analyzing the effect of the regularization parameter, it is useful to focus attention on
the lesser excited modes. Accordingly, we observe that the maximum allowable
noise magnitude is directly proportional to the regularization parameter, d :
max jzi nj CG
d
:
m r i n
7:61
Therefore, if the noise level increases, the regularization level should increase by the
same factor to maintain the same degree of regularization.
7.6 THE CONNECTION BETWEEN APA AND RLS
7.6
255
THE CONNECTION BETWEEN APA AND RLS
Using the matrix inversion lemma, we can show the connection between APA and
RLS. The matrix inversion lemma states that if the nonsingular matrix A can be
written as
A B CD;
7:62
where B is also nonsingular, then its inverse can be written as

A1 B1 B1 CI DB1 C1 DB1 :
7:63
From this we see that

UnUnt d I1 Un UnUnt Un d I1 :
7:64
Using (7.64) in (7.31), we see that regularized APA can be written as

wn wn 1 m UnUnt d I1 Unen:
7:65
Consider the vector en. By denition

en dn Unt wn 1

dn unt wn 1
;
n 1t wn 1
d n 1 U
7:66
n 1 has dimension L N 1 and consists of the N 1

where the matrix U
leftmost (newest) columns of Un 1 and the N 1 length vector d n 1 consists
of the N 1 upper (newest) elements of the vector en 1.
We rst address the lower N 1 elements of (7.66). Dene the a posteriori
residual echo vector for sample period n 1, e1 n 1 as
e1 n 1 dn 1 Un 1t wn 1
en 1
7:67
1
m Un 1 Un 1Un 1 Un 1 d I en 1:
t
We now make the approximation

Un 1t Un 1 d I Un 1t Un 1;
7:68
256
which is valid as long as d is signicantly smaller than the eigenvalues of

Un 1t Un 1. Using this approximation, we have
e1 n 1 1 m en 1:
7:69
Recognizing that the lower N 1 elements of (7.66) are the same as the upper N 1
elements of (7.67), we see that we can use (7.69) to express en as
en dn Unt wn 1

dn unt wn 1
1 m en 1

en
:
1 m en 1
7:70
Then, for m 1,

en en Unt wn 1

en
:
0
7:71
Using (7.71) in (7.65), we see that

wn wn 1 UnUnt d I1 unen:
7:72
Equation (7.72) is very similar to RLS. The difference is that the matrix which is
inverted is a regularized, rank-decient form of the usual estimated autocorrelation
matrix. If we let d 0 and N n, (7.72) becomes the growing windowed RLS.
7.7
FAST AFFINE PROJECTIONS
First, we write the relaxed and regularized afne projection algorithm in a slightly
different form,
en dn Unt wn 1
1
zn Un Un d I en
wn wn 1 m Unzn;
t
7:73
7:74
7:75
where we have dened the N-length normalized residual echo vector, zn

z 0 n; . . . ; z N1 nt .
The complexity of APA is 2LN Kinv N 2 multiplies per sample period, where
Kinv is a constant associated with the complexity of the inverse required in Eq.
(7.74). If a generalized Levinson algorithm is used solve the systems of equations in
7.7 FAST AFFINE PROJECTIONS
257
(7.74), Kinv is about 7. One way to reduce this computational complexity is update
the coefcients only once every N sample periods [9], reducing the average
complexity (over N sample periods) to 2L Kinv N multiplies per sample period.
This is known as the partial rank algorithm (PRA). Simulations indicate that when
very highly colored excitation signals are used, the convergence of PRA is
somewhat inferior to that of APA. For speech excitation, however, we have found
that PRA achieves almost the same convergence as APA. The main disadvantage of
PRA is that its computational complexity is bursty. So, depending on the speed of
the implementing technology, there is often a delay in the generation of the error
vector, en. As will be shown below, FAP performs a complete N-dimensional APA
update each sample period with 2L ON multiples per sample without delay.
7.7.1
Fast Residual Echo Vector Calculation
Earlier, we justied the approximation in relation (7.69) on the assumption that the
regularization factor d would be much smaller than the smallest eigenvalue in Utn Un .
In this section we examine the situation where that assumption does not hold, yet we
would like to use relation (7.69) anyway. This case arises, for instance, when N is
selected to be in the neighborhood of 50, speech is the excitation signal, and the
near-end background noise signal energy is larger than the smaller eigenvalues of
Utn Un .
We begin by rewriting (7.68) slightly:
e1 n 1 I m Un 1t Un 1Un 1t Un 1 d I1 en 1: 7:76
The matrix Un 1t Un 1 has the similarity decomposition
Un 1t Un 1 Vn 1Ln 1Vn 1t ;
7:77
where Vn 1 is an N by N unitary matrix and Ln 1 is a N by N diagonal

matrix with its ith diagonal element being the ith eigenvalue of
Un 1t Un 1; l i n 1.
Dening the a priori and a posteriori modal error vectors,
e0 n 1 Vn 1t en 1
7:78
e01 n 1 Vn 1t e1 n 1;
7:79
and
respectively, we can multiply (7.76) from the left by Vn 1t and show that the ith
a posteriori modal error vector element, e01;i n 1, can be found from the ith a priori
258
modal error vector element, e0i n 1, by

m l i n 1 0
e01;i n 1 1
e n 1:
d l i n 1 i
7:80
From (7.80) it can be shown that

(
e01;i n
1
1 m e0i n 1 l i n 1 @ d
e0i n 1
l i n 1 ! d
7:81
Assume that d is chosen to be approximately equal to the power of yn. Then, for
those modes where l i n 1 ! d , e0i n 1 is mainly dominated by the background
noise and little can be learned about hsys from it. So, suppressing these modes by
multiplying them by 1 m will attenuate somewhat the background noises effect
on the overall echo path estimate. Applying this to (7.81) and multiplying from the
left by Vn 1, we have
e1 n 1 1 m en 1;
7:82
and from this (7.70). From (7.76) we see that approximation (7.70) becomes an
equality when d 0, but then, the inverse in (7.76) is not regularized. Simulations
show that by making adjustments in d the convergence performance of APA with
and without approximation (7.76) can be equated. We call (7.82) the FAP
approximation as it is key to providing the algorithms low complexity. Further
justication of it is given in Section 7.7.7.
The complexity of (7.76) is L operations to calculate en and N 1 operations to
update 1 m en 1. For the case where m 1, the N 1 operations are
obviously unnecessary.
7.7.2
Fast Adaptive Coefcient Vector Calculation
In many problems of importance the overall system output that is observed by the
user is the error signal. In such cases, it is permissible to maintain any form of wn
that is convenient as long as the rst sample of en is not modied in any way. This
is the basis of FAP. The delity of en is maintained at each sample period, but wn
^ n, is maintained, where only the last column of Un is
is not. Another vector, w
^ n in each sample period [10]. Thus, the
weighted and accumulated into w
computational complexity of the tap weight update process is no more complex than
NLMS, L multiplications.
259
One can express the current echo path estimate, wn, in terms of the original
echo path estimate, w0, and the subsequent Uis and zis:
wn w0 m
n1
X
Un izn i:
7:83
i0
Expanding the vector/matrix multiplication, we have

wn w0 m
n1 X
N 1
X
un j iz j n i;
7:84
i0 j0
where un j i is the jth column of matrix Un i and z j n i is the jth

element of vector zn i. If we assume that the excitation vectors, uks, are
nonzero only for 1 k n, then we can apply a window to (7.89) without changing
the result,
wn w0 m
n1 X
N1
X
un j ij 1 j iz j n i;
7:85
i0 j0
where

j 1 j i
1
0
0jin1
elsewhere:
7:86
Changing the order of summations and applying the change of variables, i k j to

(7.85) yields
wn w0 m
N 1 n1j
X
X
j0
un kj 1 kz j n k j:
7:87
kj
Then, applying the denition of j 1 k in (7.87), we can modify the second

summation to
wn w0 m
N1
n1
XX
un kz j n k j:
7:88
j0 kj
Now we break the second summation into two parts, one from k j to k N 1
and one from k N to k n 1, with the result
wn w0 m
N 1 X
N 1
X
j0 kj
un kz j n k j m
n1 X
N 1
X
un kz j n k j;
kN j0
7:89
260
where we have also changed the order of summations in the second double sum.
Directing our attention to the rst double sum, let us dene a second window as

1 0kj
j 2 k j
7:90
0 elsewhere:
Without altering the result, we can use this window in the rst double sum and begin
the second summation in it at k 0 rather than k j:
N1 X
N1
X
un kj 2 k jz j n k j
j0 k0
N1 X
N1
X
un kz j n k j:
7:91
j0 kj
Now we again exchange the order of summations and use the window, j 2 k j, to
change the end of the second summation to j k rather than j N 1:
N
1 X
k
X
un kz j n k j
k0 j0
N 1 N
1
X
X
un kj 2 k jz j n k j:
7:92
k0 j0
Applying (7.91) and (7.92) to (7.89), we nally arrive at

wn w0 m
N 1
X
un k
k
X
z j n k j m
j0
k0
n1
X
un k
kN
N 1
X
z j n k j:
j0
7:93
We dene the rst term and the second pair of summations on the right side of (7.93)
as
^ n w0 m
w
n1
X
un k
N
1
X
z j n k j
7:94
j0
kN
and recognize the rst pair of summations in (7.93) as a vector-matrix multiplication,

UnEn
N 1
X
un k
k0
k
X
z j n k j;
7:95
j0
where
2
z 0 n
6 z 1 n z 0 n 1
En 6
4 ...
z N1 n z N2 n 1 z 0 n N 1
3
7
7:
5
7:96
261
If we dene E n as an N 1 length vector consisting of the uppermost N 1

elements of En, we can write the recursion

0
En zn
:
En 1
7:97
Then we can express (7.93) as

^ n 1 m UnEn:
wn w
7:98
It is easily seen from (7.94) that

^ n w
^ n 1 m un N 1
w
N
1
X
z j n N 1 j
7:99
j0
^ n 1 m un N 1EN n:
w
7:100
where EN n is the Nth element of En.

Using (7.100) in (7.98), we see that we can alternately express the current system
impulse response estimate as
nE n;
^ n m U
wn w
7:101
n is an L by N 1 matrix consisting of the N 1 leftmost columns in

where U
Un.
We now address the relationship between en and en 1. From (7.70) we have
en dn Unt wn 1

dn unt wn 1

1 m en 1

en
:
1 m en 1
7:102
Unfortunately, wn 1 is not readily available to us. But we can use (7.101) in the
rst element of (7.102) to get
n 1E n 1
^ n 1 m unt U
en dn unt w
e^ n m r~ nt E n 1;
7:103
7:104
where
^ n 1;
e^ n dn unt w
7:105
262
and
n 1 r~ n 1 xna n un La n L;
r~ n unt U
7:106
where a n is an N 1 length vector consisting of the last N 1 elements of an.
7.7.3
Fast Normalized Residual Echo Vector Calculation
To efciently compute (7.74), we need to nd a recursion for the vector

zn Unt Un d I1 en:
7:107
Rn Unt Un d I
7:108
Dene
and let an and bn denote the respective optimum forward and backward linear
predictors for Rn and let Ea n and Eb n denote their respective prediction error
n and R
~ n as the upper left and lower right N 1 by
energies. Also, dene R
N 1 matrices within Rn, respectively. Then, given the identities

0t
1
anant
~ n1
Ea n
R
"
#
~ n1 0
1
R
bnbnt
t
E
n
b
0
0
Rn1
0
0
7:109
7:110
and the denitions

~ n1 e~ n
z~ n R
7:111
n1 e n
z n R
7:112
(where e n and e~ n consist of the upper and lower N 1 elements of en,

respectively), we can multiply (7.109) from the right by en and, using (7.107) and
(7.111),

1
0
zn ~
anant en:
zn
Ea n
7:113
Similarly, multiplying (7.110) from the right by en and using (7.107) and (7.112),

1
z n
zn
bnbnt en:
Eb n
0
7:114
263
Solving for z n=0 we have

1
z n
bnbnt en:
zn
0
Eb n
7:115
The quantities Ea n, Eb n, an, and bn can be calculated efciently (complexity

10N) using a sliding windowed FRLS algorithm (see the appendix).
It can easily be shown that
~ n R
n 1:
R
7:116
Using (7.116), the denition of e~ n, e n, (7.116), and (7.70) we have

n 11 1 m en 1 m z n 1:
~ n1 e~ n R
z~ n R
7.7.4
7:117
FAP
The FAP algorithm with regularization and relaxation is given in Table 7.1. Step 1 is
of complexity 10N when the FTF (fast transversal lter, an FRLS technique) is used.
Steps 3 and 9 are both of complexity L, steps 2, 6, and 7 are each of complexity 2N,
TABLE 7.1 FAP with Regularization and Relaxation
Step Number
0
1
2
3
4
5
6
7
8
9
10
Computation
Initialization: Ea 0 Eb 0 d
a0 1; 0t
b0 0; 1t
Use sliding windowed FRLS to update
Ea n, Eb n, an, and bn
r~ n r~ n 1 una~ n un La~ n L
^ n 1
e^ n dn unt w
en e^ n m r~ nt E n 1

en
en
1 m en 1

1
0
zn ~
anant en
zn
Ea n

1
z n
zn
bnbnt en
0
Eb n

0
En zn
En 1
^ n w
^ n 1 m un N 1EN n
w
z~ n 1 1 m z~ n
Equation Reference
See Appendix
7.106
7.105
7.103
7.70
7.113
7.114
7.97
7.100
7.117
264
and steps 5, 6, 9, and 11 are of complexity N. This gives us a overall complexity of

2L 20N.
If we eliminate relaxation, that is, set m to 1, we can realize considerable savings
in complexity. For example, in step 10 we can see that z~ n will always be 0.
Therefore z n need not be calculated. Thus, steps 7 and 10 may be eliminated.
Furthermore, step 5 is no longer needed since only the rst element in en is
nonzero. Steps 6 and 8 may also be combined into a single complexity N calculation,

en
0
En
an:
En 1
Ea n
7:118
FAP without relaxation is shown in Table 7.2. Here, steps 3 and 6 are still
complexity L, step 2 is of complexity 2N, and steps 4 and 5 are of complexity N.
Taking into account the sliding windowed FTF, we now have a total complexity of
2L 14N.
7.7.5
Simulations
Figure 7.6 shows a comparison of the convergence of NLMS, FTF, and FAP
coefcient error magnitudes. The excitation signal was speech sampled at 8 kHz; the
system impulse response of length, L 1000, was xed; and the white Gaussian
additive noise, yn, was 30 dB down from the system output. Soft initialization was
used for both algorithms. For FTF, Ea 0 and Eb 0 were both set to 2s 2u (where s 2u is
the average power of un ) and l , the forgetting factor, was set to 3L 1=3L. For FAP,
Ea 0 and Eb 0 were set to d 20s 2u and N was 50. FAP converges at roughly the
same rate as FTF with about 2L complexity versus 7L complexity, respectively. Both
FAP and FTF converge faster than NLMS.
TABLE 7.2 FAP with Regularization But Without Relaxation

Step Number
0
1
2
3
4
5
6
Computation
Initialization: Ea 0 Eb 0 d
a0 1; 0t
b0 0; 1t
Use sliding windowed FRLS to update
Ea n, Eb n, an, and bn
r~ n r~ n 1 una~ n un La~ n L
^ n 1
e^ n dn unt w
en e^ n m r~ nt E n 1

en
0
En
an
En 1
Ea n
z~ n 1 1 m z~ n
Equation Reference
See Appendix
7.106
7.105
7.103
7.70
7.117
Figure 7.6
excitation.
265
Comparison of coefcient error for FAP, FTF, and NLMS with speech as
Another important implementation consideration is memory. FAP requires about

the same amount of memory as NLMS, about 2L locations, whereas FTF requires
about 5L locations.
In Figure 7.7 we show the convergence of NLMS and PAP with various orders of
projections. Once again, speech was the excitation, the length of the lter was 1000
samples, and the signal-to-noise ratio was 30 dB. We see that quite a bit of improvement is gained with just N 2 and that increasing N to 10 does not improve
the speed of convergence signicantly. However, if N is further increased to 50,
there is again a signicant gain in the speed of convergence. Note that for FAP, the
increase from to N 2 to N 50 does not signicantly increase the computational
complexity. Thus, large increases in convergence are realized with only moderate
increases in computational complexity.
7.7.6
Numerical Considerations
FAP uses the sliding window technique to update and downdate data in its implicit
regularized sample correlation matrix and cross-correlation vector. Errors introduced by nite arithmetic in practical implementations of the algorithm therefore
cause the correlation matrix and cross-correlation vector to take random walks with
respect to their innite precision counterparts. A stabilized sliding windowed FRLS
algorithm [11] has been introduced, with complexity 14N multiplications per sample
period (rather than 10N for nonstabilized versions). However, even this algorithm is
266
Figure 7.7
Comparison of FAP for different orders of projection, N, with speech excitation.
stable only for stationary signals, a class of signals which certainly does not include
speech. Another approach, which is very straightforward and rather elegant for FAP,
is to periodically start a new sliding window in parallel with the old sliding window,
and when the data are the same in both processes, replace the old sliding window
based parameters with the new ones. Although this increases the sliding window
based parameter calculations by about 50 percent on average (assuming that the
restarting is done every L N sample periods), the overall cost is small since only
those parameters with computational complexity proportional to N are affected. The
overall complexity is only 2L 21N for FAP without relaxation and 2L 30N for
FAP with relaxation. Since this approach is basically a periodic restart, it is
numerically stable for all signals.
7.7.7
The FAP Approximation
We now explore the effect of the FAP approximation of (7.82) on the noise term of
the coefcient update. Returning to the noise term of the APA update as expressed in
(7.43), we have
Ty;APA m UnUnt Un d I1 yn:
7:119
FAP has a similar update, except that the noise vector is weighted with the diagonal
matrix
Dm diagf1; 1 m ; . . . ; 1 m N1 g;
7:120
267
which gives
Ty;FAP m UnUnt Un d I1 Dm yn:
7:121
The norm of Ty;FAP can be upper bounded by using the Schwartz inequality,
kTy;FAP k m kUnUnt Un d I1 kkDm ynk:
7:122
Then, using the mean absolute norm,

kukMA
N
1
X
Efjui jg;
7:123
i0
where Efg is the expectation operator, we may write

kTy;FAP kMA m kUnUnt Un d I1 kMA kynkMA
1 1 m N
:
m
7:124
Taking the same approach with APA, we nd

kTy;APA kMA m kUnUnt Un d I1 kMA kynkMA N:
7:125
Taking the ratio of the FAP to the APA noise term upper bounds, we get
kTy;FAP kMA 1 1 m N
:
kTy;APA kMA
Nm
7:126
This expression represents the proportional decrease in noise due to the FAP
approximation compared to APA. As mentioned above, to maintain the same level
of regularization, the FAP regularization must be multiplied by the same factor.
Thus,
dF
1 1 m N
d A;
Nm
7:127
where d A is the APA regularization parameter and d F is the FAP regularization

parameter.
In Figure 7.8, a comparison of NLMS, APA, and FAP coefcient error
convergence curves is shown. In this experiment L 512 samples, N 32,
m 0:95, and the system response signal-to-background noise is 40dB. The
excitation signal is highly colored, giving a wide spread in its covariance matrixs
eigenvalues as shown in Figure 7.9. The regularization of FAP, d F , is set according
to (7.126). The convergence curve of FAP lies right on top of APAs.
268
Figure 7.8
Comparison of APA and FAP convergence for m 0:95.
Figure 7.9 The eigenvalues of, Ruu , the N by N excitation covariance matrix of the
experiment of Figure 7.8, along with the noise and regularization levels.
7.8 BLOCK EXACT APA
7.8
269
BLOCK EXACT APA
This section discusses block exact methods for FAP [33] and APA. Block exact
methods were rst introduced by Benesty and Duhamel [55]. They are designed to
give the so-called block adaptive ltering algorithms, whose coefcients are updated
only once every M samples (the block length), the same convergence properties as,
per-sample algorithms, those whose coefcients are updated every sample period.
The advantage of block algorithms is that since the coefcients remain stationary
over the block length, fast convolution techniques may be used in both the error
calculation and the coefcient update. The disadvantage of block algorithms is that
because the coefcients are updated less frequently, they are slower to converge.
The block exact methods eliminate this disadvantage.
In this section we consider block exact FAP updates for a block size of length M.
The goal of the block exact version is to produce the same joint-process error
sequence of en dn ut nwn as the per-sample version of Table 7.1. First,
consider the calculation of the FAP joint-process error signal, e^ n and begin with an
example block size of M 3. At sample period n 2,
^ n 3:
e^ n 2 dn 2 un 2t w
7:128
At sample period n 1,
^ n 2
e^ n 1 dn 1 un 1t w
^ n 3
dn 1 un 1t w
7:129
m un 1 un N 1EN n 2;
t
^ n update of step 9 of Table 7.1 has been applied.

where the denition of the w
Finally, for sample period n,
^ n 1
e^ n dn ut nw
^ n 3
dn ut nw
7:130
u nun NEN n 1 un N 1EN n 2:

t
The FAP coefcient vector update from sample period n 3 to n can be written as
^ n w
^ n 3 un N 1EN n
w
un NEN n 1
un N 1EN n 2:
7:131
270
Equations (7.128), (7.129), and (7.130) can be combined in matrix form,

3
e^ n
7
6
e^ 3 n 4 e^ n 1 5
e^ n 2
3 2 t
2
3
u n
dn
7 6
7^
6
n 3
4 dn 1 5 4 ut n 1 5w
2
7:132
dn 2
u n 2
3
2
32
EN n
0 ut nun N ut nun N 1
6
76
7
m4 0 0
ut n 1un N 1 54 EN n 1 5;
0 0
0
EN n 2
t
where the subscript 3 in e^ 3 n denotes the length of the block. Dene

3 2
3 2 t
3
dn
u n
e^ n
7 6
6
7 6 t
7^
7
^
n 3
e 3 n 6
4 e n 1 5 4 dn 1 5 4 u n 1 5w
t
dn 2
u n 2
e^ n 2
2
7:133
and
ri n ut nun i:
7:134
Then (7.132) becomes

2
0
4
e^ 3 n e 3 n m 0
0
rN n
0
0
32
3
rN1 n
EN n
rN n 1 54 EN n 1 5:
0
EN n 2
7:135
A useful correlation vector denition is

rN;M n rN n; . . . ; rNM2 nt
un N; . . . ; un N M 2t un:
7:136
It can be updated recursively from the previous sample period,

rN;M n rN;M n 1 unun N; . . . ; un N M 2t
un Lun N L; . . . ; un N M L 2t :
7:137
Equation (7.132) suggests an efcient method for calculating e^ n 2 through

e^ n. The idea is to start with the lowest row and work upward. The procedure is as
follows:
7.8 BLOCK EXACT APA
271
^ n 3, rN;3 n 3, and EN n 3 are available from the previous

1 Initially, w
block.
2. Calculate e e n.
3. Update rN;3 n 2 from rN;3 n 3 using (7.137).
4. e^ n 2 is just e 3 n 2.
5. Use steps 1, 2, 4 through 8, and 10 of the per-sample FAP algorithm to
calculate EN n 2 and en 2.
6. Update rN;3 n 1 from rN;3 n 2 using (7.137) and extract rN n 1.
7. e^ n 1 e n 1 m rN n 1EN n 2.
calculate EN n 1 and en 1.
9. Update rN;3 n from rN;3 n 1 using (7.137) extract rN n, and rN1 n.
10. e^ n e n m rN nEN1 n 1 m rN1 nEN n 2.
calculate EN n and en.
^ n from w
^ n 3 using (7.131).
12. Update w
This can be written more compactly if a few more denitions are made. Dene
the vector
Fn EN n; EN n 1; . . . ; EN n M 1t
7:138
and the M 1 by M matrix Ji with elements

fJi gj;k
1
0
jik
ji=k
0 j , M; 0 k M:
7:139
So, Ji has its ith elements equal to unity and all others, zero. For M 3,
0
0
1
0
0
1
1
J0
0
0
1
0
0
J1

7:140
and

7:141
Also, dene
dM n dn; dn 1; . . . ; dn M 1t ;
7:142
UM n un; un 1; . . . ; un M 1 ;
a N;M n un N; un N 1; . . . ; un N M 2t ;
7:143
7:144
272
TABLE 7.3
Block Fast Afne Projection
Step Number
Computation
^ n M , rN ;M n M ,
Initially, w
and EN n M are available
from the previous block
^ n M
e M n dM n UtM nw
for i M 1 to 0
rN ;M n i rN ;M n i 1
un ia N ;M n i
un i La N ;M n i L
e^ n i e n i m rtN ;m n iJi1 Fn
Calculate En i and en i using Table 7.1, steps 1, 2, 4
through 8, and 10
end of for-loop
^ n w
^ n M m UM n N 1Fn
w
2
3
4
5
6
7
Now we can write the block algorithm for arbitrary block size, M. It is shown in
Table 7.3.
Note that in step 5, that part of Fn which has yet to be calculated at step i lies in
the null space of Ji1 , so there is no problem of needing a value that is not yet
available. The complexity of steps 2 and 8 is each about 2ML multiply/adds. Steps 3
through 7 have about 2:5M 2 20MN multiply and/or adds. So the average
complexity per sample period is 2L 2:5M 20N multiply and/or adds. We can
reduce the complexity of steps 2 and 8 by applying fast convolution techniques using
either fast, FIR ltering (FFF) or the FFT. For example, consider the use of the FFF
method; then, using the formulas given by Benesty and Duhamel, the complexity of
steps 2 and 8 become 23=2r R 3M 5 1=M multiplications/sample
223=2r 1R 43=2r 2M 3 additions/sample, where r log2 M, and
R L=M. If L 1024 and M 32, the BE-FAP average complexity for steps 2 and
8 would be 577 multiplications and 996 additions compared to 2048 multiplications
and additions for the comparable FAP calculation. Letting N M, the remaining
calculations (steps 3 through 8) of BE-FAP amount to an average of about 720
multiplies per sample. For standard FAP, the remaining complexity is about 640
multiplications. So, whereas FAP would have a complexity of 2048 640 2688
multiplies per sample, BE-FAP can achieve a lower complexity of 577 720
1297 multiplies per sample. Rombouts and Mooner [27] have introduced sparse
block exact FAP and APA algorithms. The idea is to change the constraint on the
optimization problem from making the N most recent sample periods a posteriori
error zero to N of every kth sample periods a posteriori error zero. So, instead of
dealing with sample periods fn; . . . ; n N 1g, one deals with sample periods
fn; n k; . . . ; n kN 1g. The advantage is that since speech is only correlated
over a relatively short time, the excitation vectors of the new Xn are less correlated
with each other, so Xt nXn needs less regularization and the algorithm will
achieve faster convergence.
7.9 PARTIAL UPDATE APA
7.9
273
PARTIAL UPDATE APA
Very often, many channels of network echo cancellers are grouped together at VoIP
(voice over Internet protocol networks) gateways. VoIP increases the round-trip
delay of the voice signal, greater echo canceller performance, ERLE (echo return
loss enhancement), is required to prevent the user from being annoyed by echo. In
addition, price pressure requires that the cost in memory and multiply/accumulate
cycles be lower than in previous implementations. In [13] both of these requirements
are addressed. The complexity of the coefcient updates is lowerered by updating
only part of the coefcients in each sample period. On the other hand, the
convergence is accelerated by using afne projections to update the selected
coefcients.
Let us break up the coefcient vector, wn, and the coefcient update vector,
rn, into M blocks of length N, where we dene M and N such that L MN,
wn w0 nt ; w1 nt ; . . . ; wM1 nt t
7:145
rn r0 nt ; r1 nt ; . . . ; rM1 nt t
7:146
and
0 ; . . . ; 0 ; ri n ; 0 ; . . . ; 0 ;
t
t t
7:147
where in (7.147) we use the fact that the update vector is zero the ith block, the one
that is to be updated. It is also useful to dene data blocks:
Un U0 nt ; U1 nt ; . . . ; UM1 nt t :
7:148
Recall that in APA we minimized length of the coefcient update vector, rn, under
the constraint that the a posteriori error vector was zero. We do the same here, but we
restrict ourselves to updating only the block of coefcients that yield the smallest
update subvector, ri n. First, let us derive the update vector for an arbitrary block i.
The ith cost function is
Ci d ri nt ri n e1 n2 ;
7:149
where d is the Lagrange multiplier and

e1 n dn Unt wn
7:150
en Unt rn
en Ui nt ri n;
where in the last step we have used (7.147).
7:151
274
Using (7.151) in (7.149) we may take the derivative of Ci with respect to ri n, set
it equal to 0, and solve for ri n, yielding,
ri n Ui nUi nt Ui n d I1 en:
7:152
As we stated earlier, we update the block that has the smallest update size. That is,
we seek
i min krj nk
0 j,M
min kent Uj nt Uj nt d I1 enk;

0 j,M
7:153
7:154
where in the last step we assumed that d was small enough to ignore. The coefcient
update can be expressed as
wi n wi n 1 m Ui nUi nt Ui n d I1 en
i min kent Uj nt d I1 enk:
7:155
0 j,M
7.10
CONCLUSIONS
This chapter has discussed the APA and its fast implementations including the FAP.
We have shown that APA is an algorithm that bridges the well-known NLMS and
RLS adaptive lters. We discussed APAs convergence properties and its performance in the presence of noise. In particular, we discussed appropriate methods
of regularization.
When the length of the adaptive lter is L and the dimension of the afne
projection (performed each sample period) is N, FAPs complexity is either 2L
14N or 2L 20N, depending on whether the relaxation parameter is 1 or smaller,
respectively. Usually N ! L. We showed that even though FAP entails an approximation that is not entirely valid under regularization, the same convergence as
for APA may be obtained by adjusting the regularization factor by a predetermined
scalar value. Simulations demonstrate that FAP converges as fast as the more
complex and memory-intensive FRLS methods when the excitation signal is speech.
The implicit correlation matrix inverse of FAP is regularized, so the algorithm is
easily stabilized for even highly colored excitation.
7.11
APPENDIX
In this appendix we derive an N-length sliding windowed fast-recursive-leastsquares algorithm (SW-FRLS). The FRLS algorithms usually come in two parts.
One is the Kalman gain part, and the other is the joint process estimation part. For
7.11 APPENDIX
275
FAP we only need to consider the Kalman gain part since that gives us the forward
and backward prediction vectors and energies. However, for completeness, we
derive both parts in this appendix. Therefore, let us say that a desired signal, dN n, is
generated from
dN n hsys an yn;
7:156
where an is dened in (7.4), yn is the measurement noise, and dN n is the desired

signal. Furthermore, we dene hsys as an N-length system coefcient vector. We use
h rather than the usual w for the coefcient vector to emphasize that here we are
considering an N-length system vector rather than the usual L-length one. The least
squares estimate of hsys at time instant n is given by
hn Rn1 rdu n;
7:157
where Rn is the N by N sample covarience matrix of the excitation signal un at

time instant n and rdu n is the N-length cross-correlation vector between un and
the desired signal dn. Various windows can be applied to the samples used to
create the covarience matrix and cross-correlation vector. An exponential window
allows for rank one updating of the Rn from sample period to sample period.
Specically,
Rn
1
X
l i an ian it l Rn 1 anant
7:158
i0
and the cross-correlation vector update is
rdu n
1
X
l i dn ian i l rdu n 1 dnan:
7:159
i0
If a rectangular window is used, then one can apply the sliding window technique to
update the matrix using a rank two approach. That is,
Rn
L1
X
an ian it
i0

Rn 1 an; an L
ant
an Lt
7:160
276
and for the cross-correlation vector,

rdu n
N 1
X
dn ian i
i0

dn
rdu n 1 an; an L
:
dn L
7:161
If we dene the matrices,

Bn an; an L;
7:162
dn dn; dn Lt ;
7:163
and

1 0
;
J
0 1
7:164
Rn Rn 1 BnJBnt
7:165
rdu n rdu n 1 BnJdn:
7:166
Pn Rn1 :
7:167
then we can rewrite (7.160) as
and (7.161) as
Let
Using this in (7.165) and applying the matrix inversion lemma, we have
Pn Pn 1 Pn 1BnI JBnt Pn 1Bn1 JBnt Pn 1:
7:168
We now dene the two by two likelihood matrices. The rst is found in the
denominator of (7.168),
Vn JBnt Pn 1Bn
JBnt K0 n
JK0 nt Bn
7:169
7:170
7:171
7.11 APPENDIX
277
where
K0 n Pn 1Bn k0 n; k0 n L:
7:172
Here the a priori Kalman gain matrix, K0 n, has been used. It is composed of two a
priori Kalman gain vectors dened as
k0 n Pn 1an
7:173
k0 n L Pn 1an L:
7:174
and
The notation in (7.174) is slightly misleading in that one may think that k0 n L
should equal Pn L 1an L in order to maintain complete consistency with
(7.173). We permit this inconsistency, however, for the sake of simplied notation
and trust that it will not cause a great deal of difculty. In a similar fashion, the a
posteriori Kalman gain vectors are
k1 n Pnan
7:175
k1 n L Pnan L
7:176
and
and the a posteriori Kalman gain matrix is

K1 n PnBn k1 n; k1 n L:
7:177
The second likelihood variable matrix takes into account the entire inverted matrix
in (7.168):
Qn I JBnt Pn 1Bn1
I Vn1 :
7:178
7:179
Inverting both sides and solving for Vn yields

Vn Qn1 I
7:180
Qn1 I Qn
7:181
I QnQ1
7:182
278
or
QnVn VnQn I Qn:
7:183
Thus, (7.179) through (7.183) show the relationships between the two likelihood
matrices.
We now examine the relationship between the a priori and a posteriori Kalman
gain matrices. From (7.168), (7.172), and (7.178) it is clear that
Pn Pn 1 K0 nQnJK0 nt :
7:184
Multiplying from the right by Bn and using (7.171), (7.172) and (7.177), we get
K1 n K0 n K0 nQnVn:
7:185
Applying (7.183) we see that

K1 n K0 nQn:
7:186
Now we explore the methods of efciently updating the a posteriori and a priori
Kalman gain vectors from sample period to sample period. We start with the identity

Pn
0 P^ n
"
0t
0
P n
0
0
0
1
anant
Ea n
7:187
1
bnbnt ;
Eb n
7:188
where
an is the N-length forward prediction vector,
Ea n is the forward prediction error energy,
bn is the N-length backward prediction vector, and
Eb n is the backward prediction error energy.
We recognize that
P~ n P n 1:
7:189
In the same manner, the tilde and bar quantities derived below provide the
bridge from sample period n 1 to n. First, we derive a few additional denitions.
Implicitly dene B~ n and B n as

un; un 1
B n
Bn
:
B~ n
un L 1; un L
7:190
279
7.11 APPENDIX
This naturally leads us to dene the tilde and bar Kalman gain matrices,
~ 0 n P~ n 1B~ n;
K
7:191
0 n P n 1B n;
K
7:192
~ 1 n P~ nB~ n;
K
7:193
and
1 n P nB n:
K
7:194
Multiplying Pn 1 from the right with Bn and then using (7.187) and (7.188) at
sample n 1 rather than n, we get the relationship between the a priori Kalman gain
matrix and its tilde and bar versions.
K0 n Pn 1Bn
"
#
0; 0
1
an 1e0;a nt
Ea n 1
~ 0 n
K
"
0 n
K
0; 0
1
bn 1e0;b nt ;
Eb n 1
7:195
7:196
where
e0;a n Bnt an 1
7:197
e0;b n Bnt bn 1
7:198
and
are the a priori forward and backward linear prediction errors, respectively. The a
posteriori prediction errors are
e1;a n Bnt an
7:199
e1;b n Bnt bn:
7:200
and
280
Relationships similar to (7.195) and (7.196) can be found for the a posteriori Kalman
gain matrix using identity (7.187) and (7.188) for Pn, yielding
K1 n Pn1 Bn

0; 0
1
ane1;a nt ;
~
E
K1 n
a n
"
#
1 n
1
K
bne1;b nt :
E
0; 0
b n
7:201
7:202
We can see the relationships between the linear prediction errors, the expected
squared prediction errors, and the rst and last Kalman gain matrix elements by rst
equating rst coefcients in (7.195) and (7.201), yielding,
k0;1 n; k0;1 n L
1
e0;a nt
Ea n 1
7:203
1
e1;a nt
Ea n
7:204
and
k1;1 n; k1;1 n L
and then, equating the last coefcients in (7.196) and (7.202), yielding
k0;N n; k0;N n L
1
e0;b nt
Eb n 1
7:205
1
e1;b nt :
Eb n
7:206
and
k1;N n; k1;N n L
The likelihood matrices also have tilde and bar counterparts. Starting with (7.169) in
a straightforward manner, we dene
~ n JB~ nt P~ n 1B~ n JB~ nt K
~ 0 n;
V
7:207
n JB nt P n 1B n JB nt K

0 n:
V
7:208
~ n I V
~ n1
Q
7:209
It can be seen that
7.11 APPENDIX
281
and
n I V
n1 :
Q
7:210
Also, (7.180) through (7.183) hold for their tilde and bar counterparts. For example,
the counterparts for (7.183) are
~ n I Q
~ n
~ nV
~ n V
~ nQ
Q
7:211
nV
n I Q
n:
n V
nQ
Q
7:212
and
In addition, (7.186) holds true for the tilde and bar versions. For example,
~ n:
~ 1 n K
~ 0 nQ
K
7:213
The relationship between Vn and its tilde and bar variants can be seen by rst
multiplying (7.195) and (7.196) from the left by JBnt , yielding
~ n
Vn V
1
Je0;a ne0;a nt
Ea n 1
~ n Je0;a nk0;1 n; k0;1 n L

V
7:214
7:215
and
n
Vn V
1
Je0;b ne0;b nt
Eb n 1
n Je0;b nk0;N n; k0;N n L:

V
7:216
7:217
Adding I to both sides of (7.216) yields (see (7.179))

I Vn I V
1
Je0;b ne0;b nt
Eb n 1
7:218
n1
Qn1 Q
1
Je0;b ne0;b nt
Eb n 1
7:219
n1 Qn1
Q
1
Je0;b ne0;b nt :
Eb n 1
7:220
282
Inverting both sides and simplifying, we obtain

Qn Qn I
1
1
t
Je0;b ne0;b n Qn ;
Eb n 1
7:221
n.
giving us a useful relationship between Qn and Q
~ n. Multiplying (7.201) by
We now nd a relationship between Qn and Q
t
JBn n from the left and using (7.170), (7.186), and (7.183) gives
~ n
I Qn I Q
1
Je1;a ne1;a nt :
Ea n
7:222
Solving for Qn, we have

~ n
Qn Q
1
Je1;a ne1;a nt
Ea n
~ n Je1;a nk1;1 n; k1;1 n L;

Q
7:223
7:224
~ n.
the relationship between Qn and Q
n starting
Similarly, we can show another relationship between Qn and Q
from (7.202) and using the same steps we used to derive (7.223):
n
Qn Q
1
Je1;b ne1;b nt
Eb n
n Je1;b nk1;N n; k1;N n L:

Q
7:225
7:226
The expected forward prediction error energy, Ea n, update can be derived by rst
multiplying (7.160) from the right by an 1:
Rnan 1 Rn 1 BnJBnt an 1
Ea n 1w BnJe0;a n;
7:227
where w t 1; 0; . . . ; 0. Now, multiplying from the left by Pn, we get

an 1
Ea n 1
an K1 nJe0;a n:
Ea n
7:228
Equating the rst coefcients in (7.228) yields

1
Ea n 1
k1;1 n; k1;1 n LJe0;a n:
Ea n
7:229
283
7.11 APPENDIX
Now, multiplying through by Ea n and using (7.204), we get

Ea n Ea n 1 e0;a nt Je1;a n:
7:230
We now derive the update for the forward linear predictor, an, using the a priori
prediction errors and a posteriori tilde Kalman gain matrices. Using (7.229) solved
for Ea n 1=Ea n in (7.228) yields
an 1 1 k1;1 n; k1;1 n LJe0;a nan
K1 nJe0;a n

0
an
Je0;a n;
~ 1 n
K
7:231
7:232
where we have used (7.204) and (7.201). Solving for an, we have the result
"
an an 1
0; 0
~ 1 n
K
#
Je0;a n:
7:233
The a posteriori forward linear prediction errors can be found from the a priori
~ n. First, using (7.207), (7.213), and (7.211), we
forward prediction errors using Q
have
~ n:
~ 1 n I Q
JB~ nt K
7:234
Multiplying (7.233) from the left with JBnt , we have

Je1;a n Je0;a n JBnt

0; 0
Je0;a n
~ 1 n
K
7:235
~ nJe0;a n:
Q
~ n. From (7.235) we write
We can nd another relation between Qn and Q
~ n1 Je1;a n
Je0;a n Q
~ nJe1;a n:
I V
7:236
284
Then, adding I to (7.214) yields

1
~ nJe1;a ne0;a nt
I V
Ea n 1

1
t
~
Je1;a ne0;a n :
I Vn I
Ea n 1
~ n
I Vn I V
7:237
Finally, inverting, we get

Qn I
1
Je1;a ne0;a nt
Ea n 1
1
~ n:
Q
7:238
The forward linear prediction vector can also be updated using the a posteriori
prediction errors and the a priori tilde Kalman gain matrix. Using (7.213) we can
write (7.233) as follows:
"
an an 1
#
0; 0
~ nJe0;a n:
Q
~ 0 n
K
7:239
Then, using (7.235),

"
an an 1
#
0; 0
Je1;a n:
~ 0 n
K
7:240
Using methods analogous to those used above for the forward-prediction-related

quantities, we now derive the backward-related quantities. Following the method of
the derivation of (7.228), we have
bn 1
Eb n 1
bn K1 nJe0;b n
Eb n
7:241
Equating the last coefcients gives us

1
Eb n 1
k1;N n; k1;N n LJe0;b n:
Eb n
7:242
From which we conclude that

Eb n Eb n 1 e1;b nt Je0;b n:
7:243
Similar to (7.235), it can be shown that

nJe0;b n:
Je1;b n Q
7:244
7.11 APPENDIX
285
Finally, we can derive the backward predictor updates as

"
bn bn 1
1 n
K
0; 0
Je0;b n
7:245
Je1;b n:
7:246
and
"
bn bn 1
0 n
K
0; 0
Alternatively, from (7.241) and (7.242), we can write

bn 1 1 k1;N n; k1;N n LJe0;b nbn K1 nJe0;b n
7:247
Then, solving for bn, we get

bn
1
bn 1 K1 nJe0;b n:
1 k1;N n; k1;N n LJe0;b n
7:248
We now relate the a posteriori residual echo to the a priori residual echo. This is
done merely for completeness. The FAP algorithm generates its own residual echo
based on the longer vector un . We begin by writing the a priori desired signal
estimate:
d^ 0 nt d^ 0 n; d^ 0 n L
7:249
hn 1t Bn
7:250
rn 1 Pn 1Bn
rn 1t K0 n:
7:251
7:252
Similarly, the a posteriori estimate is

d^ 1 nt rnt K1 n:
7:253
Applying (7.166) and (7.186), we have

d^ 1 nt rn 1t dnt JBnt K0 nQn
d^ 0 nt dnt VnQn
7:254
7:255
dnt e0 nt Qn;
7:256
e0 n dn Bnt wn 1:
7:257
where
286
This implies that

e1 nt e0 nt Qn:
7:258
Taking the transpose of each side, we have

e1 n Qnt e0 n:
7:259
Qnt JQnJ;
7:260
Je1 n QnJe0 n:
7:261
Since it can be shown that
we can write
The echo canceller coefcient update can be found from the solution of the leastsquares problem:
wn Pnrn
7:262
TABLE 7.4 The Rectangular Windowed Fast Kalman Algorithm

Step Number
1
2
3
4
5
6
7a
7b
8
9
10
Computation
Part 1: Kalman Gain Calculations
e0;a n Bnt an 1

0; 0
an an 1
Je0;a n
~ 1 n
K
t
e1;a n Bn an
Ea n Ea n 1 e0;a nt Je1;a n

0; 0
1
K1 n
ane1;a nt
~
E
K1 n
a n
extract last coefcients, [k1;N n; k1;N n L]
e0;b n Bnt bn 1
x 1 k1;N n; k1;N n LJe0;b n1
bn xbn 1 K1 nJe0;b n
"
#
0 n
K
1
bn 1e0;b nt
K0 n
Eb n 1
0; 0
Part 2: Joint Process Extension
e0 n dn Bnt wn 1
wn wn 1 K1 nJe0 n
Equation Reference
7.197
7.233
7.199
7.230
7.201
7.198
7.248
7.248
7.196
7.257
7.265
287
7.11 APPENDIX
Using (7.184) and (7.161), this expands to

wn Pn 1 K0 nQnJK0 nt rn 1 BnJdn
wn 1 K1 nJd^ n K0 nQnJdn
wn 1 K1 nJe0 n:
7:263
7:264
7:265
Using (7.186) and (7.261), we can express the coefcient update alternatively as
wn wn 1 K0 nJe1 n:
7:266
We are now ready to write the FRLS algorithms. The rectangular windowed fast
Kalman algorithm is shown in Table 7.4, and the sliding windowed stabilized fast
transversal lter algorithm is shown in Table 7.5. The algorithms are separated
TABLE 7.5 The Sliding Windowed Stabilized Fast Transversal Filter

Step Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Computation
Equation Reference
Part 1: Kalman Gain Calculations

e0;a n Bnt an 1
~ nJe0;a n
Je1;a n Q
1
k0;1 n; k0;1 n L
e0;a nt
Ea n 1
7.197
7.235
7.203
Ea n Ea n 1 e0;a nt Je1;a n

0; 0
1
K0 n
an 1e0;a nt
~ 0 n
Ea n 1
K
7.195
extract last coefcients, [k0;N n; k0;N n L]

e0;b nt Eb n 1k0;N n; k0;N n L
"
#
0 n
K
1
bn 1e0;b nt
K0 n
n
1
E
0; 0
b
7.230
7.205
7.196
~ n 1 Je1;a ne1;a nt
Qn Q
Ea n

1
1
~ n Qn I
Q
Je0;b ne0;b nt Qn
Eb n 1
7.223
Eb n Eb n 1 e1;b nt Je0;b n

0; 0
an an 1
Je1;a n
~ 0 n
K
"
#
0 n
K
bn bn 1
Je1;b n
0; 0
7.243
Part 2: Joint Process Extension

e0 n dn Bnt wn 1
Je1 n QnJe0 n
wn wn 1 K0 nJe1 n
7.221
7.240
7.246
7.257
7.261
7.266
288
into their Kalman gain and joint process extension parts. Only the Kalman gain
parts are used in the FAP algorithms. The joint process extensions are given for
completeness.
REFERENCES
1. K. Ozeki, T. Umeda, An Adaptive Filtering Algorithm Using an Orthogonal Projection
to an Afne Subspace and Its Properties, Electronics and Communications in Japan,
Vol. 67-A, No. 5, 1984.
2. B. Widrow, S. D. Stearns, Adaptive Signal Processing, Prentice-Hall, Inc., Englewood
Cliffs, N.J., 1985.
3. J. M. Ciof, T. Kailath, Fast, Recursive-Least-Squares Transversal Filters for Adaptive
Filtering, IEEE Trans. on Acoustics, Speech, and Signal Proc., Vol. Assp-32, No. 2,
April 1984.
4. S. J. Orfanidis, Optimum Signal Processing: An Introduction, Macmillan, New York,
1985.
5. S. L. Gay, A Fast Converging, Low Complexity Adaptive Filtering Algorithm, Third
Intl. Workshop on Acoustic Echo Control, 7 8 Sept. 1993, Plestin les Grevs, France.
6. S. L. Gay, Fast Projection Algorithms with Application to Voice Excited Echo
Cancellers, Ph.D. Dissertation, Rutgers University, Piscataway, N.J., Oct. 1994.
7. M. Tanaka, Y. Kaneda, S. Makino, Reduction of Computation for High-Order
Projection Algorithm, 1993 Electronics Information Communication Society Autumn
Seminar, Tokyo, Japan (in Japanese).
8. J. M. Ciof, T. Kailath, Windowed Fast Transversal Filters Adaptive Algorithms with
Normalization, IEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-33,
No. 3, June 1985.
9. S. G. Kratzer, D. R. R. Morgan, The Partial-Rank Algorithm for Adaptive
Beamforming, SPIE, Vol. 564, Real Time Signal Processing VIII, 1985.
10. Y. Maruyama, A Fast Method of Projection Algorithm, Proc. 1990 IEICE Spring
Conf., B-744, 1990.
11. D. T. M. Slock, T. Kailath, Numerically Stable Transversal Filters for Recursive Least
Squares Adaptive Filtering, IEEE Trans. on Signal Processing, Vol. 39, No. 1, Jan.
1991.
12. R. D. DeGroat, D. Begusic, E. M. Dowling, Dare A. Linebarger, Spherical Subspace and
Eigen Based Afne Projection Algorithms, Proc. of IEEE Intl. Conf. on Acoustics,
Speech and Signal Processing, Vol. 3, pp. 2345 2348, 1997.
13. K. Dogancay, O. Tanrikulu, Adaptive Filtering Algorithms with Selective Partial
Updates, IREE Trans. on Circuits and SystemsI. Analog and Digital Signal
Processing, Vol. 48, No. 8, Aug. 2001.
14. A Ben Rabaa, R. Tourki, Acoustic Echo Cancellation Based on a Recurrent Neural
Network and a Fast Afne Projection Algorithm, Proc. of the 24th Annual Conf. of the
IEEE Industrial Electronics Society, Vol. 3, pp. 1754 1757, 1998.
15. M. Muneyasu, T. Hinamoto, A New 2-D Adaptive Filter Using Afne Projection
Algorithm, Proc. of the ISCAS 1998, Vol. 5, pp. 90 93, 1998.
REFERENCES
289
16. R. A. Soni, W. K. Jenkins, K. A. Gallivan, Acceleration of Normalized Adaptive

Filtering Data-Reusing Methods Using the Tchebyshev and Conjugate Gradient
Methods, Proc. of the ISCAS 1998, Vol. 5, pp. 309 312, 1998.
17. G. Glentis, K. Berberidid, S. Theororidis, Efcient Least Squares Adaptive Algorithms
for FIR Transversal Filtering, IEEE Signal Processing Magazine, Vol. 16, Issue 4, pp.
13 41, 1999.
18. S. Miyoshi, K. Ikeda, K. Nakayama, Convergence Properties of Symmetric Learning
Algorithm for Pattern Classication, IEEE World Congress on Computational
Intelligence, Vol. 3, pp. 2340 2345, 1998.
19. S. Miyoshi, K. Ikeda, K. Nakayama, Normalized Sliding Window Constant Modulus
and Decision-Directed Algorithms: A Link between Blind Equalization and Classical
Adaptive Filtering, IEEE Trans. on Signal Processing, Vol. 45, No. 1, Jan. 1997.
20. Z. Ciota, Efcient Algorithm of Adaptive Filtering for Real-Time Applications, IEEE
2000 Adaptive Systems for Signal Proc., Comm., and Control Symposium, pp. 299 303,
2000.
21. M. L. R. de Campus, P. S. R. Dinitz, J. A. Apolinario, Jr., On Normalized Data-Reusing
and Afne-Projections Algorithms, IEEE Intl. Conf. on Electronics, Circuits and
Systems, Vol. 2, pp. 843 846, 1999.
22. K. Nishikawa, H. Kiya, New Structure of Afne Projection Algorithm Using a Novel
Subband Adaptive System, Third IEEE Signal Processing Workshop on Signal
Processing Advances in Wireless Communications, TaoYuan, Taiwan, March 20 23,
2001.
23. M. Kimoto, T. Nishi, T. Furukawa, A Multichannel Echo Canceling Algorithm Using
Input Signals of all Channels, TENCON 99, Proc. of IEEE Region 10 Conference, Vol.
1, 1999.
24. A. Muller, J. M. H. Elmirghani, A Novel Approach to Robust Channel Estimation and Its
Performance with Uncoded and Chaotic Coded Speech, IEEE GLOBECOM 00, Vol. 3,
pp. 1654 1658, 2000.
25. D. Linebarger, B. Raghothaman, D. Begusic, E. Dowling, R. DeGroat, S. Oh, Low Rank
Transform Domain Adaptive Filtering, Conference Record of the 31st Asilomar
Conference on Signals, Systems, and Computers, Vol. 1, pp. 123 127, 1997.
26. G. Rombouts, Marc Moonen, A Fast Exact Frequency Domain Implementation of the
Exponentially Windowed Afne Projection Algorithm, IEEE Adaptive Systems for
Signal Processing, Communications, and Control Symposium, pp. 342 346, 2000.
27. G. Rombouts, Marc Moonen, A Sparse Block Exact Afne Projection Algorithm, IEEE
Trans. on Speech and Audio Processing, Vol. 10, pp. 100 108, 2002.
28. S. C. Douglas, The Fast Afne Projection Algorithm for Active Noise Control,
Conference Record of the 29th Asilomar Conference on Signals, Systems, and Computers,
pp. 1245 1249, 1996.
29. T. Gansler, S. L. Gay, M. M. Sondhi, J. Benesty, Double-Talk Robust Fast Converging
Algorithms for Network Echo Cancellation, IEEE Trans. on Speech and Audio
Processing, Vol. 8, No. 6, Nov. 2000.
30. J. Apolinario, Jr., M. L. R. Campos, P. S. R. Diniz, Convergence Analysis of the
Binormalized Data-Reusing LMS Algorithm, IEEE Trans. on Signal Processing, Vol.
48, No. 11, Nov. 2000.
290
31. R. A. Soni, K. A. Gallivan, W. K. Jenkins, Convergence Properties of Afne Projection

and Normalized Data Reusing Methods, Conference Record of the 32nd Asilomar
Conference on Signals, Systems, and Computers, Vol. 2, pp. 1166 1170, 1998.
32. M. Rupp, Contraction Mapping: An Important Property in Adaptive Filters, Sixth IEEE
Digital Signal Processing Workshop, pp. 273 276, 1994.
33. M. Tanaka, S. Makino, J. Kojima, A Block Exact Fast Afne Projection Algorithm,
IEEE Trans. on Speech and Audio Processing, Vol. 7, No. 1, Jan. 1999.
34. M. J. Reed, M. O. J. Hawksford, Acoustic Echo Cancellation with the Fast Afne
Projection, IEE Colloquium on Audio and Music Technology: The Challenge of Creative
DSP (Ref. No. 1998/470), pp. 16/1 16/8, 1998.
35. S. C. Douglas, Efcient Approximate Implementations of the Fast Afne Projection
Algorithm using Orthogonal Transforms, Proc. of IEEE Intl. Conf. on Acoustics, Speech
and Signal Processing, Vol. 3, pp. 1656 1659, 1996.
36. S. L. Gay, Dynamically Regularized Fast RLS with Application to Echo Cancellation,
Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol. 2, pp. 957
960, 1996.
37. F. Amand, J. Benesty, A. Gilloire, Y. Grenier, A Fast Two-Channel Projection
Algorithm for Stereophonic Acoustic Echo Cancellation, Proc. of IEEE Intl. Conf. on
Acoustics, Speech and Signal Processing, pp. 949 952, May 1996.
38. S. Makino, K. Strauss, S. Shimauchi, Y. Haneda, A. Nakagawa, Subband Stereo Echo
Canceller Using the Projection Algorithm with Fast Convergence to the True Echo Path,
Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, pp. 299
302, April 1997.
39. S. Oh, D. Linebarger, B. Priest, B. Raghothaman, A Fast Afne Projection Algorithm for
an Acoustic Echo Canceller Using a Fixed-Point DSP Processor, Proc. of IEEE Intl.
Conf. on Acoustics, Speech and Signal Processing, Vol. 5, pp. 4121 4125, April
1997.
40. K. Maouche, D. T. M. Stock, A Fast Instrumental Variable Afne Projection
Algorithm, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol.
3, pp. 1481 1484, 1998.
41. T. Ansahl, I. Varga, I. Kremmer, W. Xu, Adaptive Acoustic Echo Cancellation Based on
FIR and IIR Filter Banks, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal
42. K. Mayyas, T. Aboulnasr, A Fast Weighted Subband Adaptive Algorithm, Proc. of
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol. 3, pp. 1249 1252,
1999.
43. T. Creasy, T. Aboulnasr, A Projection-Correlation Algorithm for Acoustic Echo
Cancellation in the Presence of Double Talk, Proc. of IEEE Intl. Conf. on Acoustics,
Speech and Signal Processing, Vol. 1, pp. 436 439, 2000.
44. Y. Jung, J. Lee, Y. Park, D. Youn, A New Adaptive Algorithm for Stereophonic
Acoustic Echo Canceller, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal
45. T. E. Hunter, D. A. Linebarger, An Alternative Formulation for Low Rank Transform
Domain Adaptive Filtering, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal
REFERENCES
291
46. H. Ding, A Stable Fast Afne Projection Adaptation Algorithm Suitable for Low-Cost
Processors, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol.
1, pp. 360 363, 2000.
47. T. Gansler, J. Benesty, S. L. Gay, M. M. Sondhi, A Robust Proportionate Afne
Projection Algorithm for Network Echo Cancellation, Proc. of IEEE Intl. Conf. on
Acoustics, Speech and Signal Processing, Vol. 2, pp. 793 796, 2000.
48. M. Rupp, A Family of Adaptive Filter Algorithms with Decorrelating Properties, IEEE
Trans. on Signal Proc., Vol. 46, No. 3, March 1998.
49. S. Werner, J. A. Apolinario, Jr., M. L. R. de Campos, The Data-Selective Constrained
Afne-Projection Algorithm, Proc. of the Intl. Conference on Acoustics, Speech, and
Signal Processing, Vol. 6, pp. 3745 3748, 2001.
50. J. Benesty, P. Duhamel, Y. Grenier, A Multichannel Afne Projection Algorithm with
Applications to Multichannel Acoustic Echo Cancellation, IEEE Signal Processing
Letters, Vol. 3, No. 2, Feb. 1996.
51. S. G. Sankaran, A. A. Beex, Convergence Analysis Results for the Class of Afne
Projection Algorithms, Proc. of the Intl. Symposium on Circuits and Systems, Vol. 3, pp.
251 254, 1999.
52. S. G. Sankaran, A. A. Beex, Convergence Behavior of Afne Projection Algorithms,
IEEE Trans. on Signal Proc., Vol. 48, No. 4, April 2000.
53. N. J. Bershad, D. Linebarger, S. McLaughlin, A Stochastic Analysis of the Afne
Projection Algorithm for Gaussian Autoregressive Inputs, Proc. of IEEE Intl. Conf. on
Acoustics, Speech and Signal Processing, Vol. 6, pp. 3837 3840, 2001.
54. R. A. Soni, K. A. Gallivan, W. K. Jenkins, Afne Projection Methods in Fault Tolerant
Adaptive Filtering, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal
55. J. Benesty, P. Duhamel, A Fast Exact Least Mean Square Adaptive Algorithm, Proc. of
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, 1990.
PROPORTIONATE
ADAPTATION: NEW
PARADIGMS IN ADAPTIVE
FILTERS
ZHE CHEN
Communications Research Lab, McMaster University, Hamilton, Ontario, Canada
STEVEN L. GAY
Bell Laboratories, Lucent Technologies, Murray Hill, New Jersey
and
SIMON HAYKIN
Communications Research Lab, McMaster University, Hamilton, Ontario, Canada
8.1
8.1.1
INTRODUCTION
Motivation
In 1960, two classic papers were published on adaptive lter theory. One concerns
Bernard Widrows least-mean-squared (LMS) lter in signal processing area [33];
the other deals with the Kalman lter, named after Rudolph E. Kalman [23], in the
control area. Although both of them are rooted in different backgrounds, they
quickly attracted the world attention and have survived the test of time for over forty
years [34, 17].
The design of adaptive, intelligent, robust, and fast converging algorithms is
central to adaptive lter theory. Intelligent means that the learning algorithm is able
to incorporate some prior knowledge of a specic problem at hand. This chapter is
an effort aimed at this goal.
Presently on leave at MIT Lincoln Laboratory, Lexington, Massachusetts.

293
294
PROPORTIONATE ADAPTATION: NEW PARADIGMS IN ADAPTIVE FILTERS
8.1.2
Why Proportionate Adaptation?
A new kind of normalized LMS (NLMS) algorithm, called proportionate normalized least mean square (PNLMS) [10], has been developed at Bell Laboratories
for the purpose of echo cancellation. The novelty of the PNLMS algorithm lies in the
fact that an adaptive individual learning rate is assigned to each tap weight of the
lter according to some criterion, thereby attaining faster convergence [12, 10, 3].
Based on the PNLMS algorithm and its variants, PNLMS, SR-PNLMS, and
PAPA (see [3] for a complete introduction and nomenclature), the idea can be
extended to derive some new learning paradigms for adaptive lters, which we call
proportionate adaptation [17]. Proportionate adaptation means that learning the
sparseness of incoming data from the solution is a key feature of the algorithm. The
merits of proportionate adaptation are two: one is that the weight coefcients are
assigned different learning rates which are adjusted adaptively in the learning
process; the other is that the learning rates are proportional to the magnitude of the
coefcients.
8.1.3
Outline of the Chapter
The chapter is organized as follows: Section 8.2 briey describes the PNLMS
algorithm, some established theoretical results, sparse regularization, and physical
interpretation, as well as some new proposed proportionate adaptation paradigms.
Section 8.3 examines the relationship between proportionate adaptation and Kalman
ltering with a time-varying learning rate matrix. In Section 8.4, some recursive
proportionate adaptation paradigms are developed based on Kalman lter theory and
the quasi-Newton method. Some applications and discussions are presented in
Sections 8.5 and 8.6, respectively, followed by concluding remarks in Section 8.7.
8.2 PROPORTIONATE ADAPTATION: BEYOND THE PNLMS

ALGORITHM
8.2.1
Notations
Throughout this chapter, only real-valued data are considered: We denote un as the
N-dimensional (N-by-1) input vector, wn w1 n; . . . ; wk n; . . . ; wN nT as the
N-by-1 tap-weight vector of the lter, wo as the desired (optimal) weight vector, yn
as the estimated response, dn as the desired response which can be represented
by dn uT nwo e n, and dn yn en, where en is the prediction
innovation error and e n is the unknown noise disturbance.1 The superscript T
denotes transpose of matrix or vector, tr(.) denotes the trace of a matrix, m m n is a
constant (time-varying) learning rate scalar, m n is a time-varying positive-denite
learning rate matrix, and I is an identity matrix. The other notations will be given
1
Note that our assumption of the real-valued data in this chapter can be readily extended to the complex
case with little effort.
8.2 PROPORTIONATE ADAPTATION: BEYOND THE PNLMS ALGORITHM
295
wherever necessary. A full list of notations is given at the end of the chapter
(Appendix G).
8.2.2
PNLMS Algorithm
The adaptation of the PNLMS algorithm is formulated in [10, 12, 3] as follows:

Dwk n m
gk n enuk n
;
g n kunk2
8:1
gk n maxfr maxfd ; jw1 n 1j; . . . ; jwN n 1jg; jwk n 1jg

g n
N
1X
gk n;
N k1
8:2
8:3
where g n is the mean of gk n at time step n. The parameter m is a small-valued

scalar called the step-size, and r and d are small positive parameters that play the
role of regularization (d prevents division by zero at initialization; r prevents the
training from stalling in the case where all coefcients are zero). In matrix form, Eq.
(8.1) is rewritten as [3]
en dn uT nwn 1;
Dwn
m n
m Gn
uT nGnun
m Gn
uT nGnun
8:4
unen ; m nunen;
8:5
8:6
where Dwn wn wn 1, Gn diagfg1 n=gn; . . . ; gN n=gng is a

diagonal matrix with the elements gk n dened above, and a is a small positive
scalar. When a 0 and Gn I, Eq. (8.4) reduces to the NLMS algorithm. The
normalized term gk n=gn distributes different amounts of energy to different
coefcients being estimated and removes the sensitivity of misadjustment noise to
the exact shape of Gn [10].
8.2.3
Some Established Results
Proposition 1
smoothing.
Proof:
The PNLMS algorithm is actually a form of a posteriori Kalman
Rewriting Eq. (8.5)
wn wn 1
m Gn
undn uT nwn 1
uT nGnun a
8:7
296
and rearranging terms, we obtain

wn 1 wn
m Gnundn uT nwn
a m uT nGnun
uT nGnun
wn m 1 n unuT n1 undn uT nwn
8:8
wn Knundn uT nwn;
where the second step in Eq. (8.8) uses the matrix inverse lemma.2 We may thus
write
wn wn 1 Knundn uT nwn;
where
Kn m n
m nunm nunT
:
1 uT nm nun
8:9
Proposition 2 The PNLMS algorithm is H 1 optimal under the following

assumptions: (1) fm n1=2 ung is exciting in the sense that
lim
!1
uT nm nun 1;
n0
and (2) 0 , u nm nun , 1 and 0 , m n , unuT n1 ; l min m n . 0,

1
1
, where l minmax is the minimum (maximum)
l max m n ,
2
T n
trunu
kunk
eigenvalue of the positive-denite learning rate matrix m n.
T
Proof: The proof is given in Appendix B. See also Appendix A for some
preliminary background on H 1 norm and H 1 ltering.
Proposition 3
The PNLMS algorithm [12] is also H 1 optimal.
Proof: The proof is similar to that of Proposition 2. The essence of the proof is to
distinguish two components of the weight update equation and treat them
differently, one for PNLMS and the other for NLMS.
Proposition 4
shown by
The error-energy bound for the PNLMS algorithm always holds, as
kwo wnT m 1 nwo wnk2 kek22

1:
kwo wn 1T m 1 nwo wn 1k2 ke k22
2
Given the matrix A and vector B, A BBT 1 A1 A1 BI BT A1 B1 A1 BT .
8:10
Proof:
297
Rewriting Eq. (8.5) as

wn wn 1 m nundn uT nwn 1;
subtracting wo from both sides of the equation, and squaring, we obtain

jwo wnj2 jwo wn 1 m nundn uT nwn 1j2
jwo wn 1 m nunep n e nj2 ;
8:11
where the second equality above follows the fact that

dn uT nwn 1 uT nwo e n uT nwn 1
ep n e n;
where ep n uT nwo uT nwn 1 is the a priori ltering error (see also
Denition 2 in Appendix A).
~ n wo wn, expanding the last equality in Eq. (8.11), we have
Denoting w
~ T nw
~ n w
~ T n 1w
~ n 1 2w
~ T n 1m nunep n e n
w
uT nm nm nunjep n e nj2 :
~ T n 1un ep n in the second term
By timing m 1 n in both sides, replacing w
of the right-hand side, and rearranging terms, we may further obtain
~ n w
~ T n 1m 1 nw
~ n 1 jep nj2 je nj2
~ T nm 1 nw
w
uT nm nun 1jep n e nj2 :
By condition (2) in Proposition 2, we know that the last term on the right-hand side
in the above equation is nonpositive; hence
~ n jep nj2 w
~ T n 1m 1 nw
~ n 1 je nj2 :
~ T nm 1 nw
w
By calculating the sum of the above equation from n 0 to 1, we obtain the error
bound as Eq. (8.10).
Proposition 5 The PNLMS algorithm belongs to the risk-sensitive lter family. In

particular, provided that wo n w0 v N 0; m n, and e n v N 0; I, the
solution to the minimization problem that

ke k22
arg min log E exp
;
z^ n
2
298
where the expectation E is taken over wo n and fe ng subject to observing

fd0; . . . ; dng, is given by the algorithm3
z^ n uT nwn
and
wn wn 1 m nundn uT nwn 1
wn 1
m n
unen;
m nuT nun 1
where m n is a time-varying learning-rate scalar.

Proof: The proof follows that of the NLMS algorithm given in [15]. The
difference in Proposition 5 from the conventional NLMS algorithm is that the
optimal weight solution wo may be time-varying in a nonstationary environment
(thus written as wo n). Thus the covariance matrix of (wo n w0) is also timevarying, given by the (diagonal or full) matrix m n.4
Proposition 5 is valid for a large class of H 1 ltering algorithms. The PNLMS
algorithm is only one of the members of this family.
8.2.4
Bayesian Priors and Sparse Regularization
In the following, we show that the motivation of PNLMS is closely related to prior
knowledge of weight parameters in the context of Bayesian theory. Observing the
entries of the matrix Gn, which distinguishes PNLMS from NLMS, the elements
are proportional to the L1 norm of the weighted weight vector w (for the purpose of
regularization). To simplify the analysis, we assume that a 0 and the inputs xn
will not be all zero simultaneously. The diagonal elements of matrix Gn are of the
form (neglecting the time index)
gk L1 fr jw1 j; . . . ; r jwN j; jwk jg:
8:12
Note that the product term r d in Eq. (8.2) is neglected in Eq. (8.12) purposely for
ease of analysis. Since r , 1, at the initialization stage, it is expected that gk is
proportional to the absolute value of wk , namely, gk / jwk j.
From the Bayesian perspective, suppose that we know a priori the probability
density function of w as pw (e.g., w is sparsely distributed). Hence we may
attribute the value of gk by the negative logarithmic probability:
gk ln pwk :
8:13
3
The conditional joint probability of optimal weight wo n and e n given the observations is thus given by
P
pwo n;e 0;. .. ;e njd0; . .. ;dn / exp 12 n jdn^znj2 wo nw0m 1 nwo nw0.
4
The diagonal matrix implies the fact that the individual components are stochastically independent.
299
TABLE 8.1 Bayesian Priors, Probability

Density Models, and gk
pw
Prior
Uniform
Gaussian
Laplacian
Cauchy
Supergaussian
constant
expw2
expjwj
1
1 w2
1= coshw
gk
1
w2k
jwk j
ln1 w2k
lncoshwk
Denoting the diagonal matrix Gn before averaging as Gw diagfg1 ; . . . ; gN g, we

may nd that in the PNLMS algorithm, gk is implicitly assumed to be Laplacian
distributed.5 The other forms of Bayesian priors, pw and their forms of gk , are
summarized in Table 8.1. In light of the above analysis, we see that, by attributing a
weighted diagonal (which implies independence) matrix using prior knowledge of
the parameters, the weighting stochastic gradient r^ E 0w can be represented as
^ E 0 Gwk r^ E wk ;
r
wk
8:14
where r^ denotes the instantaneous gradient which approximates the true gradient
operator in a limiting case. Equation (8.14) states that different tap-weight elements
are assigned different scaling coefcients along their search directions in the
parameter space.
On the other hand, the weighted instantaneous gradient is related to regularization
theory. It is well established that Bayesian theory can well handle the prior knowledge of unknown parameters to implement regularization [5]. The advantage of this
sparse regularization lies in its direct efciency since the constraint of the weight
distribution is imposed on the gradient descent instead of an extra complexity term
in the loss function.
8.2.5
Physical Interpretation
8.2.5.1 Langevin Equation Studying the LMS lter in the context of the
Langevin equation was rst established in [17] (chap. 5). We can also use the
Langevin equation to analyze the PNLMS algorithm.
As developed in [17], the number of natural modes constituting the transient
response of the LMS lter is equal to the number of adjustable parameters in the
lter. Similarly for the PNLMS algorithm with multiple step-size parameters, the kth
(k 1; 2; . . . ; N) natural mode of the PNLMS lter is given as
y k n 1 1 m k nl k y k n f k n;
5
8:15
From a computation point of view, this is an efcient choice among all of the non-Gaussian priors.
300
where l k is the eigenvalue of the diagonalized correlation matrix of the input and
f k n is a driving force accounting for an unknown stochastic disturbance. From Eq.
(8.15), it follows that
Dy k n y k n 1 y k n
m k nl k y k n f k n;
8:16
which consists of two parts: a time-varying damping force m k nl k y k n and a

stochastic force f k n. Equation (8.16) is essentially a discrete-time version of the
Langevin equation in statistical physics.
8.2.5.2 Newton Mechanics In the continuous-time case, an N-length weight
vector w can be imagined as an N-dimensional coordinate system with each
direction represented by one element wk . In particular, the motion of a particle can
be described by the Newtonian motion law
dw
m r^ w Ew;
dt
8:17
where r^ w represents an instantaneous gradient operator with respect to the weight

vector. Comparing Eq. (8.17) with the Newtonian equation for a point mass m
moving in a viscous medium with friction coefcient j under the inuence of a
conservative force eld with potential energy Ew:
m
d2 w
dw
m r^ w Ew;
j
dt2
dt
8:18
it is obvious that Eq. (8.17) is a special case of Eq. (8.18) for a massless particle. By
discretizing Eq. (8.18) (i.e., dt Dt, dw wtDt wt ) and assuming Dt 1 for
simplicity, after some rearrangement, we obtain the following difference equation
[29]:
wn 1 wn
1 ^
m
rw Ewn
wn wn 1:
mj
mj
1
m
and momentum h
,
mj
mj
the momentum is zero when the mass of particle m 0 [29].
In light of the above analogy, the PNLMS algorithm offers a nice physical
interpretation: Physically speaking, the convergence of the adaptive algorithm is
achieved when the potential energy Ewn 0 and the velocity of the particle
dw
approaches zero (i.e.,
0; the weights will not be adjusted). The diagonal
dt
element learning rate m k is related to the friction coefcient j k in a medium along a
particular direction, which can be nonuniform but isotropic, or nonuniform and
By comparison, we obtain the learning rate m
301
anisotropic. j k can be proportional to its distance from the starting origin (hence
jwk j) or to the velocity of the particle (hence jDwk j). The PNLMS algorithm belongs
to the rst case. Intuitively, one may put more energy (bigger step-size) to the
particle along the direction which has a bigger friction coefcient.
8.2.6
The PANLMS Algorithm
To generalize the PNLMS algorithm, we propose another proportionate adaptation

NLMS (PANLMS) paradigm in which the learning rate (diagonal) matrix is allowed
to be recursively updated. The idea is motivated by the LMS algorithm with adaptive
gain (see [17], chap. 14). In particular, PANLMS can be described by two coupled
difference equations:
wn wn 1 m nr^ E w wn;
m n 1 m n expm r^ E m m t ;
8:19
8:20
where m is a small-valued step-size parameter. Note that m n is updated by a

multiplicative formula. The entries of m k n are dened by
m k n
jwk nj
/ jwk nj;
w
8:21
where w is a proper normalizing factor. From Eq. (8.21), it follows by virtue of the
derivative chain
rE m m t;n w sgnwk nrE w wk n;
8:22
where sgnu is a signum function, which equals 1 when u . 0, 0 when u 0, and

1 when u , 0, respectively. Hence we can obtain the update of the PANLMS
algorithm
wn wn 1
m nunen
;
kunk2 a
m k n 1 m k n expm w uk nsgnwk nen:
8:23
8:24
Neglecting the w term, we may rewrite Eq. (8.24) in the matrix form
6
m n 1 m n expm un W sgnwnen;
8:25
where the symbol W represents elementwise multiplication; m 0 is initialized by d I

with a small scalar d .
At this point in the discussion, several remarks on PANLMS are in order:
Compared to PNLMS, PANLMS allows a recursive update of the diagonal
learning rate matrix m n (i.e., which depends on its previous value m n 1).
This is done by relating r^ E m to r^ E w via the derivative chain. The update of
m n may guarantee the decrease of E.
6
This is trivial if we scale m appropriately.
302
There is no need to worry about the situation where all coefcients are zeros;
hence the regularization parameters r and d in PNLMS are avoided.
The multiplicative update of the diagonal learning rate matrix m n in (8.25)
can also be substituted for by an additive form:
m n 1 m n diagfm un W sgnwneng;
which we refer to as the PANLMS-II algorithm, in contrast to the previous
form.
Since, in the limit of convergence, it should be that limn!1 En 0, and
m k n should also be asymptotically zeros in the limit (see the proof in
Appendix C), whereas in PNLMS, the m k n are not all guaranteed to decrease
to zeros when convergence is achieved as n ! 1; hence the convergence
process will be somehow unstable. To alleviate this problem, the adaptation
of PNLMS and PANLMS may employ
a learning rate annealing
schedule

1
[18] by multiplying a term kn e.g., kn
which satises
n=t 1
limn!1 kn ! 0.
8.2.7
Extended Proportionate Adaptation Paradigms
In recent years, a new kind of stochastic gradient algorithm, the exponentiated

gradient (EG) algorithm, has been developed in the adaptive lters family [24].
The EG algorithm and its variants have shown some potential value in many
applications, with success especially when the solution is sparse (see [24, 19, 25]).
However, in the previous EG-like algorithms, only the learning rate scalar is used.
Intuitively, we can develop proportionate adaptation versions of EG-like algorithms,
the proportionate adaptation EG (PAEG). For instance, the proportionate
adaptation normalized EG algorithm is given as follows:
diagwn 1 expm nunen
wn PN
;
k1 wk n 1 expm k nuk nen
m n 1 m n expm un W sgnwnen;
8:26
8:27
where the diagonal learning rate matrix is updated similarly to the PANLMS (or
PANLMS-II) algorithm.
8.3
PROPORTIONATE ADAPTATION AND KALMAN FILTERING
In the following, we examine the relationship between proportionate adaptation and

Kalman ltering.
303
8.3 PROPORTIONATE ADAPTATION AND KALMAN FILTERING
Formulated in a state-space model, the linear ltering problem is written as

wn 1 wn n1 n;
dn uT nwn 1 n2 n;
8:28
8:29
where wn is the state vector, un is the measurement matrix, and n1 n; n2 n are

process noise and measurements noise, respectively. Denote the covariance matrices
of 1 n wn wnjn 1, n1 n and n2 n as Pn, Q1 n and Q2 n,
respectively. By virtue of Kalman ltering [23, 17], we have
wn 1 wn Kn 1dn 1 uT n 1wn;
8:30
Pn 1 Pn Q1 n Kn 1uT n 1Pn Q1 n
8:31
where Kn is the Kalman gain:

Kn 1
Pn Q1 nun 1
:
Q2 n 1 uT n 1Pn Q1 nun 1
8:32
Hence, the Kalman gain depends on Pn, Q1 n, and Q2 n 1.7

Substituting Eq. (8.32) into Eq. (8.30), we have
wn 1 wn
Pn Q1 nun 1
8:33
dn 1 u n 1wn:
T
By comparing Eq. (8.33) with Eq. (8.5), we have the relationship
m n 1
Pn Q1 n
:
8:34
That is, updating the learning rate matrix by gradient descent is equivalent to
updating the Kalman gain in the Kalman lter, which is dependent on the
covariances of the state error and the process noise [8].
At this point in the discussion, several remarks are in order:
As indicated in Proposition 1, the PNLMS algorithm is actually the a posteriori
form of Kalman smoothing, which is consistent with the result presented here.
As observed in Eq. (8.33), when the covariances of Pn and Q1 n increase,
the update of wn also increases; in the stochastic gradient descent
algorithms, the increase of the learning rate also increases the update.
7
If the process noise is assumed to be zero, the term (Pn Q1 n) in Eqs. (8.31) and (8.32) reduces to
Pn.
304
In the conventional LMS algorithm, the learning rate is a scalar m , or

m n m I; in the PNLMS algorithm, the learning rate is a diagonal matrix
m n with different values of components. In the Kalman lter, provided that
we assume the state error, process noise and measurement noise are
independent and identically distributed (i.i.d) Gaussian distributed, that is,
Pn s 21 I, Q1 n s 2n1 I, Q2 n s 2n2 I, then KnuT n is also a diagonal
matrix. On the other hand, if these variables are independent but not
identically distributed, (Pn Q1 n) is no longer an identity matrix.
The learning rate (scalar or matrix) can be viewed as playing the role of
memory. In the Kalman lter, by increasing the sum of (Pn Q1 n), Kn
also increases and thereby introduces a bigger change of wn, which means
that more importance is given to more recent measurement.
It is well known that the higher the learning rate, the less stable the learning
behavior; also, the stability of the Kalman lter depends on the covariance
matrices. For this reason, learning rate can be used as a regularization
parameter controlling the trade-off between convergence speed and stability;
also, process noise can be used as a regularization parameter [8]. Interestingly,
there exists a functional equivalence between adjusting the learning rate and
the magnitude of weights [31], the latter of which corresponds to the weight
decay or Laplace prior regularizer [5]. The studies of process noise adjustment
and weight decay regularization are presented in [8].
According to the convergence (stability) theory, when the time-varying
learning rate is used, we should have n ! 1, m n ! 0 (or m n ! 0);
similarly, the annealing process noise should also satisfy the condition that
when n ! 1, n1 n ! 0 and consequently Q1 n ! 0.
When m n is a full matrix, the Kalman gain tends to approximate the inverse
of the Hessian matrix, as we will discuss in the next section.
8.4
8.4.1
RECURSIVE PROPORTIONATE ADAPTATION

On-line Learning
In contrast to off-line learning, on-line learning offers a way to optimize the expected
risk directly, whereas batch learning optimizes the empirical risk given a nite
sample drawn from the known or unknown probability distribution [4]. The
estimated parameter set fwng is a Markovian process in the on-line learning
framework. Proving the convergence of the on-line learning algorithm toward a
minimum of the expected risk provides an alternative to the proofs of consistency of
the learning algorithm in off-line learning [4].
8.4.2
The RPNLMS Algorithm
A recursive PNLMS (RPNLMS) algorithm8 can be derived in two ways:
8.4 RECURSIVE PROPORTIONATE ADAPTATION
305
1. Riccati recursion-based Kalman ltering rooted in the analysis of Section 8.3.

2. Quasi-Newton method rooted in optimization theory, where the learning rate
matrix is devised to approximate the inverse of Hessian,
P that is,
m n H1 n. In the linear case, the Hessian reduces to n unuT n,
and the on-line Hessian is approximated by Hn Hn 1 unuT n.
In light of either of these two approaches, we obtain the following weight update
equations of RPNLMS:9
m n m n 1
m n 1unm n 1unT
;
1 uT nm n 1un
wn wn 1 m nunen;
8:35
8:36
where m nun in Eq. (8.36) plays the role of the Kalman gain. The form of Eq.
(8.35) is similar to that of Eq. (8.9) and Eq. (8.34). The learning rate matrix m 0 can
be initialized to be an identity matrix or some other form according to prior
knowledge of the correlation matrix of the input.
8.4.2.1 MSE and H 1 Optimality An optimal lter is one that is best in a certain
sense [1]. For instance, (1) the LMS lter is asymptotically optimal in the meansquared-error (MSE) sense under the assumption that the input components are
statistically independent and an appropriate step-size parameter is chosen; (2) the
LMS lter is H 1 optimal in the sense that it minimizes the maximum energy gain
from the disturbances to the predicted (a priori) error; and (3) the Kalman lter is
optimal in the sense that it provides a solution to minimize the instantaneous MSE
for the linear lter under the Gaussian assumption; under the Gaussian assumption,
it is also a maximum-likelihood (ML) estimator. For the RPNLMS lter, we have
the following:
Proposition 6 The RPNLMS algorithm

is MSE optimal in the sense that, given a
P
quadratic loss function En n jenj2 =2, the learning rate matrix m n is
optimal to minimize the cumulative instantaneous error; it also minimizes the
expected value of Ejwo wnj2 .
Proof:
See Appendix D.
8
It should be noted that RPNLMS is actually a misnomer since it doesnt really update proportionately; it is
so called because the form of the updating learning rate matrix is similar to Eq. (8.9). Actually, it can be
viewed as a lter with a priori Kalman gain.
9
In [21], a computationally efcient calculation scheme for gain matrices was proposed which allows fast
and recursive estimation of m nxn, namely, the a priori Kalman gain.
306
Proposition 7 Suppose that the vector fm 1=2 nung are exciting and 0 , uT n
m nun , 1. Given proper initialization m 0 I, the RPNLMS algorithm is H 1
optimal in the sense that it is among the family of minimax lters.
Proof: A sketch of the proof is given as follows. At time index n 1, m 1
u1uT 1
I
and
1 ku1k2
uT 1m 1u1 ku1k2
uT 1u1uT 1u1
1 ku1k2
ku1k2
, 1:
1 ku1k2
Generally, we have
uT nm nun
kunk2
, 1:
1 kunk2
Thus the condition in Proposition 7 is always satised. It is easy to check that the
exciting condition also holds. The rest of the procedure to prove the H 1 optimality
of RPNLMS is similar to that of PNLMS, which is omitted here.
8.4.2.2 Comparison of the RPNLMS and RLS Filters It is interesting to compare the RPNLMS and RLS lters since they have many common features in
adaptation.10 In particular, in a state-space formulation, the RLS lter is described
by [22]
wn wn 1
Pn 1un
dn uT nwn 1;
1 uT nPn 1un
8:37
where the recursion of Pn is given by

Pn Pn 1
Pn 1unuT nPn 1
1 uT nPn 1un
8:38
and P0 m I. Compared to the RPNLMS algorithm, several remarks are in order:

The basic operations of these two algorithms are similar. The difference
between them is that in Eq. (8.37) RLS has an extra computation in the
denominator 1 uT nPn 1un; hence, the complexity of RLS is
greater than that of RPNLMS.
10
An equivalence discussion between the RLS lter and the Kalman lter is given in [17, 22].
8.4 RECURSIVE PROPORTIONATE ADAPTATION
307
From another perspective, the RPNLMS algorithm can be viewed as

calculating a priori Kalman gain with a learning rate matrix m n, whereas the
RLS algorithm can be regarded as an a posteriori ltering (hence calculating a
posteriori Kalman gain) with a correlation matrix Pn 1.11 Since the RLS
lter is of a posteriori form, it is supposed to have better performance than the
RPNLMS lter, which has been found in our experiments.
It is noted that, given proper redenition of the H 1 adaptive ltering problem
(see Denition 3 in Appendix A), RPNLMS is H 1 optimal, whereas RLS turns
out not to be, with the original denition in [16, 17].
8.4.3
Proportionate Afne Projection Adaptation
We may also apply the proportionate adaptation principle to the afne projection
lter (Chapter 7, this volume), resulting in a new proportionate afne projection
adaptation (PAPA) paradigm:12
m t m n 1
m n 1Unm n 1UnT
;
m trUT nm n 1Un
wn wn 1 m nUnUT nUn1 en;
8:39
8:40
where en en; en 1; . . . ; en m 1T is an m-by-1 vector, and Un

un; un 1; . . . ; un m 1 is an N-by-m matrix formed by a block of m
input vectors. The derivation of Eq. (8.39) is straightforward:
m n 1 m n
m1
1X
m nun t m nun t T
m t 0
m1
1X
1
uT n t m nun t
m t 0
;
8:41
1
m nUnm nUnT
m
m n
;
1
1 trUT nm nUn
m
which is actually an averaged version of Eq. (8.35), given the current and m 1 past
observations.
11
This analogy can be understood by comparing the LMS and NLMS algorithms with the learning rate
scalar.
m Gn
12
In the original PAPA algorithm [3], m n T
, where Gn is dened in the same way
u nGnun a
as the PNLMS algorithm.
308
Equation (8.40) can be also extended to a regularized form:

wn wn 1 m nUnUT nUn d I1 en:
In some cases, wn m is used instead of wn 1 in the update.
8.4.4
The NRPNLMS Algorithm
Proportionate adaptation can be extended to nonlinear ltering. Motivated by the

quasi-Newton method, which is a rst-order approximation of the second-order
Newton method, a nonlinear RPNLMS (NRPNLMS) algorithm can be derived:
m n m n 1
m n 1gnm n 1gnT
;
1 gT nm n 1gn
wn wn 1 m ngnen;
8:42
8:43
where gn ungn; gn is the gradient of the nonlinear function f uT nwn

1 at time index n, and m 0 I. Provided that a hyperbolic tangent function
f j tanhj is used, gj 12 1 f 2 j . Several remarks are in order:
The NRPNLMS lter is a special case of the recursive Levenberg-Marquardt
(LM) algorithm, in which the learning rate m n is a full matrix and
approximates the inverse Hessian. Taking different m n, we may obtain the
quasi-Newton or LM algorithms [27].
The NRPNLMS algorithm can be viewed as a rst-order approximation of the
nonlinear state-space model; namely, Eq. (8.29) is written as dn
f uT nwn 1 n2 n. Thus it can be interpreted as an extended Kalman
lter (EKF) algorithm.
The theoretic analysis of local H 1 optimality of the NRPNLMS algorithm can
be similarly established following previous work [14] (see Appendix A for
some background).
8.4.5
Signed-Regressor Proportionate Adaptation
Similar to the suggestion of the SR-PNLMS algorithm, where the input un is

replaced by the sgnun [3], we can also extend this idea to the proportionate
adaptation paradigms proposed herein to obtain their signed-regressor (SR)
versions. The advantage of SR proportionate adaptation lies mainly in the simplicity
of hardware implementation and the economics of memory allocation. The SR
algorithm and its two variants are as follows:
wn wn 1 m nsgnunen;
wn wn 1 m nunsgnen;
8:44
8:45
wn wn 1 m nsgnunsgnen:
8:46
8.5 APPLICATIONS
8.5
8.5.1
309
APPLICATIONS
Adaptive Equalization
The rst computer experiment is taken from [17] on adaptive equalization. The
purpose of this toy problem is to verify the fast convergence of our proposed
proportionate adaptation paradigms compared to the other stochastic gradient
algorithms, including PNLMS and PNLMS. The equalizer has N 11 taps, and
the impulse response of channel is described by the raised cosine

8
< 1 1 cos 2p n 2 ; n 1; 2; 3
W
hn 2
:
0;
otherwise;
where W controls the amount of amplitude distortion produced by the channel (and
also the eigenvalue spread of the correlation matrix of tap inputs). In our
experiments, W 3:1 and a signal-to-noise ratio (SNR) of 30 dB are used. For
comparison, various learning rate parameters are chosen for all of the algorithms,
but only the best results are reported here. The experimental curve was obtained by
ensemble-averaging the squared value of the prediction error over 100 independent
trials, as shown in Figure 8.1.
8.5.2
Decision-Feedback Equalization
In the second experiment, we consider the well-studied decision-feedback

equalization (DFE) in digital communications [28]. A schematic diagram is
shown in Figure 8.2. To obtain the input signal of DFE, the transmitted sequences
fsng are passed through a channel which introduces intersymbol interference (ISI)
and additive noise, which can be represented by the equation un xn
vn sn hn vn, where hn is the impulse response of a particular channel
and sn and xn represent the received and noiseless preinput signals, respectively.
In particular, the recursive proportionate adaptation algorithms are used to train two
transversal lters in the training phase. For a variety of linear and nonlinear, timevariant and time-invariant channel equalization problems, the RPNLMS and
NRPNLMS algorithms have shown surprisingly good performance in terms of the
speed of convergence, the need for training sequence, the bit error rate (BER) of
testing sequences, and the requirement of decision delay. Due to space limitation, we
only present part of the experimental results; the detailed results will be reported
elsewhere.
The benchmark time-invariant channels in our experiments are listed in Table
8.2. In particular, in order to model the nonlinear distortion of the transmission
channel, Channel G is produced from Channel D by using a nonlinear distortion
function:
un xn 0:2x2 n:
8:47
310
Figure 8.1 The ensemble-averaged convergence curves of stochastic gradient algorithms in

adaptive channel equalization.
Figure 8.2 A schematic diagram of DFE. (a) Channel model; (b) training phase: the dashed
box of the soft decision is nonexistent for linear equalizer; (c) testing phase: the hard decision
is always used for both linear and nonlinear algorithms trained DFE.
8.5 APPLICATIONS
TABLE 8.2
Channel
B
C
D
G
S
311
Benchmark Channels in the DFE Experiments

Impulse Response hn
1
Property
2
0:407 0:805z 0:407z

0:227 0:46z1 0:688z2
0:46z3 0:227z4
0:348 0:87z1 0:348z2
0:348 0:87z1 0:348z2
0:407 0:805z1 0:407z2
Linear, nonminimum phase

Linear
Linear, minimum phase
Nonlinear (Eq. 8.47)
Nonlinear (Eq. 8.48)
The nonlinear channel S is produced by employing the following function:

8
< 1;
un 1;
:
xn;
if xn . 1
if xn , 1
if jxnj 1:
8:48
In the DFE, feedback is passed through a hard decision which is supposed to be

more accurate than the soft (i.e., sigmoidal) decision; hence intuitively, it is expected
that during the training phase, the shape of the sigmoid function should gradually
approach the hard limiter function. To achieve this, we consider employing a selfannealing controller to the hyperbolic tangent function by changing tanhu to
tanhb u , where b is the scaling factor that controls the shape of the sigmoidal
function [17]. In particular, the adaptive scaling parameter is updated as
b n n 1 m c nen;
8:49
where
1
c n uT nwn1 tanh2 b n 1uT nwn
2
and m is a predened small real-valued step-size parameter (in our experiments
m 0:05). The input un here is an augmented vector un un; . . . ;
un N1 1; yn 1; . . . ; yn N2 , (N N1 N2 ), where N1 ; N2 are the
lengths of tap weights of feedforward and feedback lters, respectively. The
number of tap weights of the feedforward and feedback lters as well as decision
delay are summarized in Table 8.3. The learning (convergence) curves are averaged
on 100 independent trials using 1000 training sequence under SNR 14dB. The
convergence results for the time-invariant linear Channels B and C by using
different algorithms are shown in Figure 8.3a and Figure 8.4a, respectively. In order
to observe the evolution of slope parameter b , the trajectories of b n are plotted in
Figure 8.3b for all of the channels. As shown, with no exception, they increase as
convergence is approached.
The convergence results for the time-invariant nonlinear Channel S are shown in
Figure 8.4b. Figure 8.5 gives the BER curves of Channels B, D, S, and G, with
312
TABLE 8.3 A Comparison of Complexity of Various Equalizers

Equalizer
Decision Delay
Input Taps
Feedback Taps
Weights
Linear channels (B and D/C)

DFE-LMS
DFE-RPNLMS
DFE-NRPNLMS
2/4
2/4
1/3
16
16
2/5
15
15
2/10
31
31
4/15
Nonlinear channels (S and G)

DFE-NRPNLMS
DF-RNE-EKF
DF-Elman-EKF
DF-RMLP-EKF
1
N/A
1
1
2
N/A
2
2
2
N/A
1
1
4
64
25
25
N/A: not applicable.
Figure 8.3 (a) Learning curves of DFE for Channel B; (b) time evolution of slope parameter
b for Channels B, C, D, G, and S.
different equalizers. The BER is calculated for 10,000 test data, averaged on 100
independent trials upon training 100 data sequence, with the SNR varying from 4 to
14 dB. For the nonlinear channels, the NRPNLMS-DFE13 with minimal parameters
outperforms all of the linear equalizers and has even better performance than many
neural equalizers, including the decision-feedback recurrent neural equalizer
(DFRNE), the decision-feedback Elman network, and the decision-feedback
recurrent multilayer perceptron (RMLP), with lower BER, much less algorithmic
complexity and CPU time, and a much lower memory requirement.
The NRPNLMS algorithm can also be used for time-variant channels, which can
be modeled by varying the coefcients of the impulse response hn. In particular,
13
In the training phase, the NRPNLMS-DFE can be regarded as a RPNLMS-DFE passing a zero-mean soft
nonlinearity of hyperbolic tangent function.
8.5 APPLICATIONS
313
Figure 8.4 (a) Learning curves of DFE for Channel C; (b) learning curves of DFE for
Channel S.
the transfer function of the tap-delayed lter is written as

Hz
N 1
X
ai nzi :
8:50
i0
The coefcients are functions of time, and they are modeled as zero-mean Gaussian
random processes with user-dened variance. The time-variant coefcients ai n are
generated by using a second-order Markov model in which white Gaussian noise
(zero-mean and variance s 2 ) drives a second-order Butterworth lowpass lter
(LPF). In MATLAB14 language, it can be written by using the functions butter
and filter in the following:
[B,A]=butter(2,fs/Fs);
Ai=ai+filter(B,A,sigma*randn(1,1000));
where B and A are the numerator and denominator of the LPF, respectively; fs=Fs is
the normalizing cutoff frequency, with fs being a fading rate (the smaller fs is, the
slower the fading rate) and Fs being a sampling rate; ai is the xed coefcient, and Ai
is the corresponding time-varying 1000-length vector for ai at different moments.
The choice of fs in our experiment is 0:1 v 0:5Hz (0.1 corresponds to slow fading,
whereas 0.5 corresponds to fast fading); a typical choice of Fs in our experiment is
2400 bits/s.
Only the NRPNLMS algorithm with adaptive slope parameter is investigated
here. A three-tap forward lter, two-tap feedback lter (i.e., N 5 in total), and
14
MATLAB is the trademark of MathWorks, Inc.
314
Figure 8.5
BER of Channels B, D, S, and G (from left to right).
8.5 APPLICATIONS
315
Figure 8.6 Left: convergence curves of time-variant slow-fading and fast-fading channels
using NRPNLMS with an adaptive slope parameter. Right: BER of time-variant slowfading
and fast-fading channels.
two-tap decision delay are used in the experiments. The results of convergence and
BER are shown in Figure 8.6. More experiments on time-variant multipath channels
including wireless channels will be reported elsewhere.
8.5.3
Echo Cancellation
In telecommunications, echoes are generated electrically due to impedance mismatches at points along the transmission medium and are thus called line or network
echoes [3]. In particular, the echoes occur due to the delay especially in the longdistance connection. To alleviate this problem and improve the conversation quality,
the rst echo canceler using the LMS algorithm was developed at Bell Labs in the
1960s [30]. Nowadays in the echo cancellation industry, the NLMS lter is still
popular due to its simplicity. Recently, there has been some progress in the echo
cancellation area, where the idea of proportionate adaptation (originally the PNLMS
algorithm) originated (see [3, 13] for an overview).
First, a simple network echo cancellation problem is studied. A schematic
diagram of echo cancellation with a double-talk detector (DTD) is shown in Figure
8.7. In the experiments, the far-end speech (i.e., input excitation signal) is 16bit
PCM coded and lies in the range [32768, 32767]; the sampling rate is 8 kHz. The
normalized measured echo path impulse response is shown in Figure 8.8b, which
can be viewed as a noisy version of the real impulse response. White Gaussian noise
with SNR 30 dB is added to the near-end speech. The length of the tap weight vector
(i.e., impulse response) is N 200. A variety of recursive adaptive lter algorithms
of interest are investigated, including NLMS, PNLMS, PNLMS (double update),
PANLMS, PANLMS-II, and RPNLMS. The parameters of PNLMS and PNLMS
algorithms are chosen as d 0:01, r 5=N 0:025, a 0:001. The learning rate
scalar m for NLMS and PNLMS is 0.2 and 0.8 for PNLMS; for PANLMS and
PANLMS-II algorithms, it is m 0:1 and m 0 I. The initial tap weights are set
316
Figure 8.7
A schematic diagram of network echo cancellation with DTD.
to be w0 0. The performance of echo cancellation is measured by means of the

coefcient error convergence of the normalized misalignment, the latter of which is
dened by [3]
10 log10
kwo wnk2
:
kwo k2
The misalignment curves are shown in Figure 8.8c. As observed, the performance
of the proposed PANLMS and PANLMS-II algorithms is almost identical, and
both of them are better than NLMS, PNLMS and PNLMS. Among the
algorithms tested, RPNLMS achieves the best performance, though at the cost of
increasing computational complexity and memory requirement, especially when N
is large.
Figure 8.8 (a) Far-end speech; (b) normalized measured echo path impulse response; (c)
misalignment.
8.5 APPLICATIONS
Figure 8.9
317
(a) Two echo paths impulse responses; (b) misalignment.
Second, we consider the echopath change situation in order to study the tracking
performance of the proposed algorithms. Figure 8.9a illustrates two difference
impulse responses of two echo paths. In the rst 4s, the rst echo path is used; after
4 s, the echo path is changed abruptly. The misalignment curves of the proposed
algorithms are shown in Figure 8.9b. As shown, the newly developed proportionate
adaptation algorithms also exhibit very good tracking performance. It should be
noted that, compared to the PANLMS and PANLMS-II algorithms, the tracking
performance (for the second echo path) of the RPNLMS algorithm is worse due to
the time-decreasing nature of m n. Hence, it is favorable to reinitialize the learning
rate matrix once the change in echo path is detected.
We also consider the double-talk situation.15 The design of an efcient DTD is
essential in network echo cancellation. Although many advanced DTD algorithms
(e.g., cross-correlation or coherent methods) exist, a simple DTD called the Geigel
algorithm [13, 3], with threshold T 2, is used in the experiment. Besides, in order
to handle the divergence problem due to the existence of a near-end speech signal,
some robust variants of proportionate adaptation paradigms based on robust
statistics [20] were developed [3, 11]. For clarity of illustration, only the results of
robust PNLMS, robust PNLMS, and robust PANLMS-II algorithms are shown
here.16 in particular, the robust PANLMS-II algorithm is described as

m nun
jenj
c
sgnensn;
sn
kunk2 a

jenj
jenj
; k0
min
c
sn
sn

1 l s jenj
sn 1 l s sn
c
sn;
sn
b
wn wn 1
15
16
Namely, it happens when the far-end and near-end speakers speak simultaneously.
A detailed study and investigation are given in [6].
8:51
8:52
8:53
318
Figure 8.10 (a) Far-end speech; (b) near-end speech; (c) the rectangles indicate where the
double-talk is detected; (d) misalignment curves.
where sn is a time-varying scale factor. A typical choice for the parameters is

l s 0:997, k0 1:1 and b 0:60665; s0 is the averaged speech level in the voice
telephone network; in the double-talk period, wn is not updated, whereas Eq.
(8.53) is updated as sn 1 l s sn 1 l s smin , where smin is a predened
lower bound. The current experimental setup is s0 1900, smin 0:5, and the
experimental result is shown in Figure 8.10.
8.6
8.6.1
DISCUSSION
Complexity
A detailed comparison of some stochastic algorithms is given in Table 8.4 in the

context of computational complexity (in terms of oating point operations, FLOPS),
memory requirement,17 convergence, and robustness. As shown in the table, the
complexities of RLS, Kalman, and RPNLMS lters are roughly the same where the
N 2 term dominates, but RPNLMS has the lowest complexity among them;18
compared to LMS and NLMS, which have the lowest-level complexity, PNLMS,
PNLMS, and PANLMS have medium-level complexity. Hence, proportionate
adaptation is a good trade-off between rst-order and second-order adaptive lters in
terms of performance and computational complexity.
8.6.2
Robustness
The robustness of adaptive lters can be assessed in deterministic or stochastic

terms, depending on the approach taken [17]. An adaptive lter is said to be robust in
17
There exists a trade-off between computational complexity and memory requirement. Here we sacrice
memory by storing the intermediate result to reduce the computation cost.
18
As mentioned before, a fast calculation scheme [21] with linear complexity (see Appendix E) allows
RPNLMS to be implemented more efciently.
8.6 DISCUSSION
319
TABLE 8.4 A Comparison of Stochastic Recursive Gradient Algorithms

Algorithm
H1
Robust
LMS
NLMS
PNLMS
Yes
Yes
Yes
PNLMS
Yes
PANLMS
PANLMS-II
RPNLMS
SR-RPNLMS
EG
PAEG
RLS
No
No
Yes
N/A
N/A
N/A
UB
Kalman
UB
Computation
2N 2N 1 0 0 0
3N 3N 1 N 0 0
4N 1 5N 1 2N 1 0
N 2
6N 6N 1 3N 1 0
N 2
2N 6N 1 N N 0
3N 6N 1 N 0 0
2N 2 N 2N 2 2N N 2 0 0
2N 2 N 2N 2 2N N 2 0 0
2N 1 3N 1 N N 0
2N 1 7N 1 N 2N 0
2N 2 2N 2N 2 3N N 2 N
00
5N 2 N 1 3N 2 3N N
00
Memory
Conv. Rate
N
N
N
g 1
g 1
g 1
g 1
2N
2N
N 2 2N
N 2 2N
N
2N
N 2 2N
1,g
1,g
g 2
1,g
g 1
1,g
g 2
2N 2 2N
g 2
,2
,2
,2
,2
Note: Computational complexity is measured in one complete iteration. The order of computation is
denoted in terms of number of FLOPS: A M D E S, where A denotes addition, M denotes
multiplication, D denotes division, E denotes exponentiation, and S denotes sorting. UB, upper bounded.
the deterministic sense if some unavoidable disturbance (e.g., mischoice of initial

conditions, model mismatch, estimation error, measurement imprecision) is not
magnied by the system. A useful framework for addressing deterministic
robustness is H 1 robustness [15]. In some sense, the H 1 estimation problem can be
understood as a game-theoretic problem: Nature hurts you; you want to nd an
estimate that hurts you the least. In other words, H 1 ltering is overconservative or
minimax (i.e., worst-case analysis). As shown in Table 8.4, one attractive property of
proportionate adaptation paradigms is their H 1 robustness, which can be viewed as a
family of hybrid minimax/Kalman (H 1 =H 2 ) ltering.
On the other hand, by robustness in the stochastic sense, we mean that the
adaptive system or estimator is robust when it is insensitive to the small deviation
from the model assumption of the probability distribution or density, and that
somehow larger deviations should not cause a catastrophic effect [20]. Stochastic
robustness is called distributional robust, which is well addressed in robust statistics.
Along this line, many robust proportionate adaptation algorithms have been
developed [3].
8.6.3
Convergence and Tracking
Convergence behavior and tracking are two important performance measures for
adaptive lters. Convergence behavior is a transient phenomenon, whereas tracking
320
is a steady-state phenomenon which measures the generalization ability of the lter

in a nonstationary environment [17]. The convergence of the PNLMS algorithm,
similar to that of other stochastic gradient descent algorithms, can be studied within
the on-line learning framework [4]. By using stochastic approximation, we may
provide a proof for on-line proportionate adaptation paradigms under some mild
condition (see Appendix F).
By comparing the convergence property of various on-line recursive algorithms
listed in Table 8.4, we note that RPNLMS, RLS, and Kalman lters have secondorder convergence (similar to the Newton method); in particular, NRPNLMS can be
regarded as a quasi-Newton method. The LMS and NLMS algorithms have rstorder convergence, but some proportionate adaptation paradigms may have
superlinear convergence.19 Hence, proportionate adaptation offers a trade-off
between computational complexity and convergence speed.
8.6.4
Loss Function
The squared loss function (L2 norm) is widely used in the adaptive ltering
community due to its simplicity for optimization, though it is not necessarily the best
or the only choice. A general error metric of adaptive lters was studied in [19].
Since the loss function is essentially related to the noise density in regularization
theory [5], we may consider using different loss functions especially in the context
of stochastic robustness.
8.7
CONCLUDING REMARKS
In this chapter, a new class of adaptive lters family (called proportionate

adaptation lters) with a diagonal or full learning rate matrix were analyzed and
developed. With theoretical studies of their H 1 optimality, MSE optimality, and
convergence, we have also shown their potential value in practice, as demonstrated
by various experimental results in different applications. Compared to rst-order
LMS-type lters (endowed with the merit of fast convergence), the proportionate
adaptation paradigms are a good candidate for the trade-off of complexity and
performance. On the other hand, for specic real-life problems, prior knowledge is
essential to design the adaptive lter; the relevant issues include loss function,
regularization, and many others. Although we have only discussed FIR transversal
lter in this chapter, proportionate adaptation can also be used for innite-duration
impulse response (IIR) lters.
For an iterative algorithm that converges to a desired solution Q* , if there is a real number g and a
constant integer k0 , such that for all k . k0 , we have kQk1 Q* k C kQk Q* kg , with C being a
positive constant independent of k; then we say that the algorithm has a convergence rate of order g . In
particular, an algorithm has rst-order or linear convergence if g 1, superlinear convergence if
1 , g , 2, and second-order convergence if g 2.
19
APPENDIX A: H1 NORM AND H1 FILTERING
APPENDIX A:
321
H1 NORM AND H1 FILTERING
Denition 1 The H 1 Norm [15]: Let h2 denote the vector space of squaresummable,P real-values20 causal sequences with inner product kf f ng;
fgngl 1
n0 f ngn. Let T be a transfer operator that maps an input
sequence fung to an output sequence fyng. Then the H 1 norm of T is
dened as
kTk1
sup
u=0;u[h2
kyk2
;
kuk2
where kuk2 denotes the h2 norm of the causal sequence fung. In other words, the
H 1 norm is the maximum energy gain from the input u to the output y.
Denition 2 The H 1 Adaptive Filter [16, 15]: The problem of H 1 adaptive

ltering is formulated as nding an H 1 -optimal estimation (ltering) strategy
wn F f d0; . . . ; dn; u0; . . . ; un that minimizes kTf F k1 and an H 1 optimal prediction strategy wn F p d0; . . . ; dn; u0; . . . ; un that
minimizes kTp F k1 , obtaining the results
kef k22
2
2
1
wo ;e [ h2 m jwo w0j ke k2
g 2f ;opt inf kTf F k21 inf sup

F
and
kep k22
;
2
2
1
wo ;e [ h2 m jwo w0j ke k2
g 2p;opt inf kTp F k21 inf sup

F
where 0 , m , 1 is the constant scalar learning rate which denes the

relative weight to the initial weight estimate w0 compared
P to the sum-squared
je nj2 , ef n
error, jwo w0j2 wo w0T wo w0, ke k22 1
n0P
2
2
T
T
T
T
u nwo u nwn, ep n u nwo u nwn 1, kek2 1
n0 jenj .
For the purpose of analyzing the PNLMS algorithm, we redene the problem of
H 1 adaptive ltering as Denition 3.
Denition 3 The problem of H 1 adaptive ltering with time-varying multiple

learning rates is formulated as nding an H 1 -optimal (estimation) strategy wn
20
The complex-valued case can be similarly discussed in the context of Hermitian conjugation (for vector)
or complex conjugation (for scalar).
322
F d0; . . . ; dn; u0; . . . ; un that minimizes kTF k1 , obtaining the result
g 2opt inf kTF k21

F
inf
F
kek22
;
T 1
2
wo ;e [ h2 wo w0 m wo w0 ke k2
*
A:1
sup
where m * fm njwo w0T m 1 nwo w0 maxg n 0; 1; . . . ; 1.

Note that the rst term of the denominator in Eq. (A.1) is the maximum
disturbance gain, and the H 1 optimality of the transfer operator is always
guaranteed at any time step n, no matter what the time-varying learning rate m n is;
when m n m I, Denition 3 reduces to Denition 2. To understand Eq. (A.1) from
another perspective, the optimal solution wo can be viewed as a constant in a
stationary environment, but generally it may be time-varying (and hence denoted by
wo n) in the nonstationary environment, and we still have the relationship
dn uT nwo n e n.
For a nonlinear H 1 adaptive ltering problem with dn f uT nwo e n
and dn yn en, there is no general solution [14]. Similar to the studies
presented in [14], by using the rst-order linear approximation
wn wn 1 m ngnen;
A:2
@f wn 1
, we have the following suboptimal algorithm: If gn are
@wo
exciting in the sense that
where gn
lim
!1
gT nm ngn 1
A:3
n0
and
0 , gT nm ngn , 1;
0 , m n , gngT n1 ;
A:4
then for all nonzero wo ; e [ h2 ,

kgT nwo w0k22 wo w0T m 1
* wo w0
2

2

1
n 1
T @ f w
;
w
e
n

wn

1
w

wn

1

o
o

2
2
@wo
2
APPENDIX B: H1 OPTIMALITY OF THE PNLMS ALGORITHM
323
n 1 lies on the line between wn 1 and wo , and

where w
wo wn 1T
n 1
@2 f w
wo wn 1
@w2o
f wn f wn 1 gT nwo wn 1:
Furthermore, we have the following proposition:
Proposition 8 The Local H 1 Optimality: For the nonlinear H 1 adaptive ltering

problem, suppose that Eq. (A.3) and Eq. (A.4) are satised; then for each 1 . 0,
there exists d 1 ; d 2 . 0 such that for all jwo w0j , d 1 and all e [ h2 with
je nj , d 2 , one has
kek22
1 1:
2
wo w0T m 1
* wo w0 ke k2
APPENDIX B:
A:5
H1 OPTIMALITY OF THE PNLMS ALGORITHM
First, noting that m n dened by Eq. (8.6) always satises condition (1) of
Proposition 2 since m n . 0, we have
uT nm nun
m uT nGnun
,1
uT nGnun a
by virtue of a . 0 and 0 , m , 1.
Second, we want to show that for any time step n, the H 1 minimax problem
formulated in Eq. (A.1) is always satised for the PNLMS algorithm. From
Denition 3, for all wo w0 and m n and for all nonzero e [ h2 , one should nd
an estimate wn such that
kek22
, g 2:
wo w0 m 1 nwo w0 ke k22
T
B:1
Since the denominator in this inequality is nonzero, it further follows that

wo w0T m 1 nwo w0 ke k22 g 2 kek22 . 0:
B:2
324
Equivalently, the following quadratic form

Jn wo w0T m 1 nwo w0
1
X
jdn uT nwo j2 g 2 juT nwo uT nwn 1j2
n0
is positive. To prove the H 1 optimality, we must show that for all wo = w0, the
estimate wn always guarantees Jn . 0. Since Jn is quadratic with respect to
wo ,21 it must have a minimum over wo . In order to ensure that the minimum exists,
the following Hessian matrix is positive-denite, namely,
1
X
@2 Jn
1
m
n
1 g 2 unuT n . 0:
@w2o
n0
B:3
Provided that g , 1 so that 1 g 2 , 0, since fm 1=2 nung is exciting, it is

concluded that for large ! 1 (recalling the condition (1) in Proposition 2)) and
for some k, the following inequality can be satised:
1
X
m 1
k n
,
juk nj2
g 2 1 n0
B:4
in light of the exciting condition. Equation (B.4) implies that the kth diagonal entry
of the Hessian matrix in Eq. (B.3) is negative:
m 1
k n
1
X
1 g 2 juk nj2 , 0:
B:5
n0
P
T
Hence, m 1 n 1 g 2 1
n0 unu n cannot be positive-denite and Eq.
(B.3) is violated. Therefore g opt 1. We now attempt to prove that g opt is indeed
equal to 1. For this purpose, we consider the case of g 1. Equation (B.3) reduces
to m n . 0, which is always true from the conditions of Proposition 2.
Now that we have guaranteed that for g 1 the quadratic form Jn has a
minimum over wo , the next step is to show that the estimate given by the PNLMS
algorithm at each time step n is always guaranteed to be positive for the same choice
g 1.
21
Note that although m n is time-varying and data-dependent, it doesnt invalidate the quadratic property
of Jn with respect to wo .
325
APPENDIX B: H1 OPTIMALITY OF THE PNLMS ALGORITHM
At the rst step n 1,

J1 wo w0T m 1 1wo w0 juT 1wo uT 1w0j2
wo w0T m 1 1 u1uT 1wo w0;
B:6
which is positive by virtue of wo = w0 and condition (2) of Proposition 2. At the

second step n 2,
J2 wo w0T m 1 2wo w0 juT 1wo uT 1w0j2
jd1 uT 1wo j2 juT 2wo uT 2w0j2
wo w0T m 1 2wo w0 juT 1wo uT 1w0j2
jd1 uT 1wo j2
juT 2wo w0 m 2u1d1 uT 1w0j2
wo w0T m 1 2wo w0 juT 1wo uT 1w0j2
jd1 uT 1w0 uT 1wo w0j2
B:7
juT 2wo w0 uT 2m 2u1d1 uT 1w0j2

wo w0
T
d1 uT 1w0

m 2 u2uT 2
uT 1 uT 1m 2u2uT 2

wo w0

:
d1 uT 1w0
u1 m 2u2uT 2u1
1 uT 1m 2u2uT 2m 2u1
Observing that the second matrix of the last equality in Eq. (B.7) is positive-denite
by virtue of condition (2) of Proposition 2, namely,
u1 m 2u2uT 2u1 m 2m 1 2 u2uT 2u1 . 0;
uT 1 uT 1m 2u2uT 2 uT 1m 2m 1 2 u2uT 2 . 0;
it follows that J2 . 0. This argument can be continued to show that Jn . 0 for
all n 3, which then states that if the conditions of Proposition 2 are satised, then
326
g opt 1 and the PNLMS algorithm achieves it. Hence, the H 1 norm
P
n0
jenj2
wo w0 m 1 nwo w0
T
P1
n0
je nj2
1
B:8
is always satised at each time step n. When ! 1, we obtain the innite-horizon

supremum bound
inf
F
kek22
1:
T 1
2
wo ;e [ h2 wo w0 m wo w0 ke k2
*
sup
The proof is completed.
APPENDIX C:
MATRIX
ASYMPTOTIC BEHAVIOR OF LEARNING RATE
Consider an adaptive stochastic gradient algorithm with a time-varying diagonal

learning rate matrix m n; denote epi as a priori innovation error and efi as a
posteriori innovation error. At time step n, we have
e2fi dn uT nwn2

m nunepi n 2
dn uT n wn 1
kunk2
epi n trm nepi n2
e2pi n
1
N
X
C:1
!2
m k n
k1
Rearranging the terms and taking the limit of both sides of Eq. (C.1),
lim
e2fi n
n!1 e2 n
pi
lim 1 trm n2 :
n!1
C:2
In the limit of convergence, the left-hand side equals 1. Thus, the right-hand should
be also 1, and it follows that
lim
n!1
N
X
k1
m k n 0:
C:3
327
APPENDIX D: MSE OPTIMALITY OF THE RPNLMS ALGORITHM
Due to the nonnegativeness of m k n,

lim m k n 0;
n!1
k 1; . . . ; N:
C:4
In the special case of time-varying learning rate scalar where m n m nI, the
above derivation still holds.PFor the PNLMS algorithm, however, we cannot
generally ensure that limn!1 Nk1 m gk n 0 by recalling Eq. (8.2) and Eq. (8.6),
where limn!1 gk n = 0.
APPENDIX D:
MSE OPTIMALITY OF THE RPNLMS ALGORITHM
The proof follows the idea presented in [32]. First, we want to prove that the
RPNLMS
P algorithm is optimal to minimize the cumulative quadratic instantaneous
error n jenj2 =2. Denote the optimal learning rate matrix by m o n. In particular,
we have the following form:
wn wn 1 m o nundn uT nwn 1;
D:1
and the optimal m o n is supposed to approximate the inverse of Hessian [32]. In the
linear case, the Hessian is approximately represented by
Hn Hn 1 unuT n:
According to the matrix inverse lemma, we have
H1 n H1 n 1
H1 n 1unH1 n 1unT

;
1 uT nH1 n 1un
which shares the same form of m n in the RPNLMS algorithm.

Next, we want to prove the optimality of the RPNLMS algorithm in that at each
iteration it minimizes the variance of the weight estimate of the optimal (desired)
solution, namely, Ejwo wnj2 . Consider the a priori ltering problem dened in
Eq. (D.1). Now we want to nd an optimal m o n to minimize the mean square
criterion between the estimate and the optimal solution. For the purpose of
~ n wo wn and
presentation clarity, we introduce the following notations: w
e n dn uT nwo , and we assume that e n is an independent sequence with
zero mean and variance s 2 . Thus, subtracting by wo in both sides of Eq. (D.1), we
have
~ n w
~ n 1 m o nunuT nw
~ n 1 e n:
w
D:2
~ nw
~ T n, minimizing Ejw
~ nj2 is equivalent to
~ nj2 trEw
Since Ejw
minimizing the trace of the following matrix:
~ nw
~ T n:
h n s 2 Ew
D:3
328
Substituting Eq. (D.2) into Eq. (D.3) and rearranging, we have
h n h n 1
h n 1unh n 1unT
1 uT nh n 1un
1 uT nh n 1un

h n 1un
m o nun
1 uT nh n 1un

T
h n 1un
m o nun
:
1 uT nh n 1un
D:4
Hence the trace of h n is given by

trh n trh n 1
kh n 1unk2
1 uT nh n 1un
1 uT nh n 1un
2

h n 1un

:
m o nun
T
1 u nh n 1un
D:5
It further follows that the optimal m o n is
m o n
h n 1
;
1 uT nh n 1un
D:6
and
m o n m o n 1
m o n 1unm o n 1unT
;
1 uT nm o n 1un
D:7
which is essentially the RPNLMS algorithm. Here m o n 1un plays the role of
the Kalman gain Kn. Thus far, the proof is completed.
APPENDIX E:
FAST CALCULATION OF A PRIORI KALMAN GAIN
In [21], a fast and computationally efcient scheme was proposed to calculate the a
priori Kalman gain with the form
n
X
j0
!1
x jxT j
xn;
APPENDIX E: FAST CALCULATION OF A PRIORI KALMAN GAIN
329
where x j can be an m-by-1 vector or, more generally, an mp-by-1 vector such that
x j 1 is obtained from x j by introducing p new elements and deleting p old
ones. In particular, the scheme can be used straightforwardly to implement the
RPNLMS and NRPNLMS algorithms (where p 1 and thus mp N).
The fast algorithm, similar to the idea of Levinsons algorithm in the linear
estimation (prediction) literature (see e.g., [17]), is summarized in the following
generic lemma [21]:
Lemma 1
Let fz ng be a sequence of p-length vector, and let

2
3
z n 1
6
7
..
xn 4
5:
.
z n m
Then the quantity
Kn
n
X
!1
x jx j d I
T
xn
j1
can be determined recursively as
e n z n AT n 1xn;
E:1
An An 1 Kne T n;
E:2
e 0 n z n AT nxn;
E:3
Sn Sn 1 e 0 ne T n;
"
#
S1 ne 0 n
Kn
:
Kn AnS1 ne 0 n
E:4
E:5
Partition Kn as an m p 1-by-1 vector:

mn
Kn
:
nn
E:6
330
Let
k n z n m DT n 1xn 1;
E:7
Dn Dn 1 mnk T nI nnk T n1 ;
E:8
Kn 1 mn Dnnn:
E:9
The initial conditions can be taken as K1 0, A0 0, S0 d I, D0 0.

The dimensionality of the above notations reads:
K
I
S
x
A
m
e
K
d
k
z
D
n
e0
mp-by-1
p-by-p
p-by-p
mp-by-1
mp-by-p
mp-by-1
p-by-1
APPENDIX F:
mpp 1-by-1
1-by-1
p-by-1
p-by-1
mp-by-p
p-by-1
p-by-1
CONVERGENCE ANALYSIS
The convergence analysis of the LMS algorithm can be addressed in the framework
of stochastic approximation [4]. For learning rate scalar, we have
Lemma 2 In order to guarantee the convergence of wn ! wo , it is necessary for
the learning rate m n to satisfy
1
X
m 2 n , 1;
and
n1
1
X
m n 1;
F:1a
n1
r^ E 2 wn a b wn wo T wn wo ;
a 0; b 0:
F:1b
The convergence analysis of proportionate adaptation paradigms with the timevarying learning rate matrices can be similarly taken by using the quasi-Martingale
convergence theorem (see e.g., [4]). Without presenting the proof here, we give the
following theorem:
Theorem 1 In the case of on-line proportionate adaptation, the almost assure
(a.s.) convergence is guaranteed only when the following conditions hold:
1
X
n1
l 2max m n , 1 and
1
X
l min m n 1;
F:2a
n1
r^ E 2 wn a b wn wo T wn wo a 0; b 0;
F:2b
APPENDIX G: NOTATIONS
331
where l maxmin m n is the maximum (minimum) eigenvalue of the learning rate

matrix m n. When m n is diagonal, l maxmin reduces to the maximum (minimum)
value of the diagonal entries of m n.
APPENDIX G: NOTATIONS
Symbol
Description
d
E
E
e, epi
efi
ep
ef
e
f
Gn
Gw
gk
g, g
Hn
hn
Hz
I
K
N
N 0; s 2
p
P
Q
sgn
T
n
tanh
tr
u
U
w
w0
wo
~ n
w
y
desired output scalar

mathematical expectation
loss function
a priori (predicted) innovation error
a posteriori (ltered) innovation error
a priori ltering error
a posteriori ltering error
error vector
nonlinear (hyperbolic tangent) function
averaged diagonal weighting matrix
diagonal weighting matrix
kth diagonal element of Gw
gradient scalar and gradient vector
Hessian matrix
impulse response
transfer function
identity matrix
Kalman gain
the dimension of input
Gaussian distribution with zero mean and variance (covariance) s 2
probability density function
state-error correlation matrix
covariance matrix of noise
signum function
transfer operator between input and output
discrete time index
hyperbolic tangent function
trace of matrix
input vector
input matrix
tap-weight vector
initial tap-weight vector
optimal (true) tap-weight vector
wo wn
estimated output scalar
332
m
m n
m n
mo
h
l maxmin
e
e
a
b
d
r
y
f
n1
n2
r^ t
kk
constant learning rate scalar

time-varying learning rate scalar
time-varying learning rate matrix
optimal learning rate matrix
momentum coefcient
maximum (minimum) eigenvalue
noise
vector of e
regularization parameter
slope parameter
natural mode of adaptive lter
stochastic driving force
process noise vector
measurement noise vector
instantaneous gradient operator
norm operator
Acknowledgments
S.H. and Z.C. are supported by the Natural Sciences and Engineering Council
(NSERC) of Canada. Z.C. is the recipient of the IEEE Neural Networks Society
2002 Summer Research Grant, and he would like to express his thanks for the
summer internship and nancial support provided by Bell Labs, Lucent
Technologies. The authors also acknowledge Dr. Anders Eriksson (Ericsson
Company, Sweden) for providing some data and help in the earlier investigation on
echo cancellation. The results on decision-feedback equalization presented here are
partially based on the collaborative work with Dr. Antonio C. de C. Lima. The
experiments on network echo cancellation were partially done at Bell Labs; the
authors thank Drs. Thomas Gansler and Jacob Benesty for some helpful discussions.
REFERENCES
1. B. D. O. Anderson and J. B. Moore, Optimal Filtering. Englewood Cliffs, NJ: PrenticeHall, 1979.
2. W.-P. Ang and B. Farhang-Boroujeny, A new class of gradient adaptive step-size LMS
algorithms, IEEE Transactions on Signal Processing, 49, 805 809 (2001).
3. J. Benesty, T. Gansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay. Advances in Network
and Acoustic Echo Cancellation. New York: Springer-Verlag, 2001.
4. L. Bottou, On-line learning and stochastic approximation, in D. Saad, ed. On-line
Learning in Neural Networks. Cambridge: Cambridge University Press, 1998, pp. 9 42.
REFERENCES
333
5. Z. Chen and S. Haykin, On different facets of regularization theory, Neural

Computation, 14, 2790 2845 (2002).
6. Z. Chen and S. L. Gay, New adaptive lter algorithm for network echo cancellation,
Technical Report, Adaptive Systems Lab, McMaster University, 2002.
7. Z. Chen and A. C. de C. Lima, A new class of decision-feedback equalizers, Technical
Report, Adaptive Systems Lab, McMaster University, 2002.
8. J. F. G. de Freitas, M. Niranjan, and A. H. Gee, Hierarchical Bayesian-Kalman models
for regularization and ARD in sequential learning, Technical Report TR-307,
Cambridge University Engineering Department, 1997.
9. S. C. Douglas and T. H.-Y. Meng, Stochastic gradient adaptation under general error
criteria, IEEE Transactions on Signal Processing, 42, 1335 1351 (1994).
10. D. L. Duttweiler, Proportionate normalized least-mean-squares adaptation in echo
cancellers, IEEE Transactions on Speech and Audio Processing, 8, 508 517 (2000).
11. T. Gansler, S. L. Gay, M. M. Sondhi, and J. Benesty, Double-talk robust fast converging
algorithms for network echo cancellation, IEEE Transactions on Speech and Audio
Processing, 8, 656 663 (2000).
12. S. L. Gay, An efcient, fast converging adaptive lter for network echo cancellation,
Proc. Asilomar Conference on Signals, Systems, and Computers, Pacic Grove, CA,
1998, pp. 394 398.
13. S. L. Gay and J. Benesty, Eds. Acoustic Signal Processing for Telecommunications.
Boston: Kluwer Academic, 2000.
14. B. Hassibi, A. H. Sayed, and T. Kailath, H 1 optimality criteria for LMS and
backpropagation, in J. D. Cowan, G. Tesauro, and J. Alspector, Eds. Advances in Neural
Information Processing Systems, Vol. 6. San Francisco: Morgan-Kaufmann, 1994, pp.
351 358.
15. B. Hassibi, A. H. Sayed, and T. Kailath, H 1 optimality of the LMS algorithm, IEEE
Transactions on Signal Processing, 44, 267 280 (1996).
16. B. Hassibi and T. Kailath, H 1 bounds for least-squares estimators, IEEE Transactions
on Automatic Control, 46, 309314 (2001).
17. S. Haykin, Adaptive Filter Theory, 4th ed. Upper Saddle River, NJ: Prentice-Hall, 2002.
18. S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Upper Saddle River,
NJ: Prentice-Hall, 1999.
19. S. I. Hill and R. C. Williamson, Convergence of exponentiated gradient algorithms,
IEEE Transactions on Signal Processing, 49, 1208 1215 (2001).
20. P. Huber, Robust Statistics. New York: Wiley, 1981.
21. L. Ljung, M. Morf, and D. Falconer, Fast calculation of gain matrices for recursive
estimation schemes, International Journal of Control, 27, 1 19 (1978).
22. T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. Upper Saddle River, NJ:
23. R. E. Kalman, A new approach to linear ltering and prediction problems, Transactions
of the ASME, Journal of Basic Engineering, 82, 33 45 (1960).
24. J. Kivinen and M. K. Warmuth, Exponentiated gradient versus gradient descent for
linear predictor, Information and Computation, 132, 1 63 (1997).
25. R. E. Mahony and R. C. Williamson, Prior knowledge and preferential structure in
gradient descent learning algorithms, Journal of Machine Learning Research, 2, 311
355 (2001).
334
26. R. K. Martin, W. A. Sethares, R. C. Williamson, and C. R. Johnson, Jr., Exploiting

sparsity in adaptive lters, IEEE Transactions on Signal Processing, 50, 1883 1894
(2002).
27. L. S. H. Ngia and J. Sjoberg, Efcient training of neural networks for nonlinear adaptive
ltering using a recursive Levenberg-Marquardt algorithm, IEEE Transactions on
Signal Processing, 48, 1915 1927 (2000).
28. J. G. Proakis, Digital Communications, 4th ed. New York: McGraw-Hill, 2001.
29. N. Qian, On the momentum term in gradient descent learning algorithms, Neural
Networks, 12, 145 151 (1999).
30. M. M. Sondhi, An adaptive echo canceller, Bell Systems Technical Journal, 46, 497
510 (1967).
31. G. Thimm, P. Moerland, and E. Fiesler, The interchangeability of learning rates and gain
in backpropagation neural networks, Neural Computation, 8, 451 460 (1996).
32. Y. Z. Tsypkin, Foundations of the Theory of Learning Systems. New York: Academic
Press, 1973.
33. B. Widrow and M. E. Hoff, Jr., Adaptive switch circuits, IRE WESCON Convention
Record, Part 4, pp. 96 104, 1960.
34. B. Widrow and S. D. Stearns, Adaptive Signal Processing. Englewood Cliffs, NJ:
STEADY-STATE DYNAMIC
WEIGHT BEHAVIOR IN
(N)LMS ADAPTIVE FILTERS
A. A. (LOUIS) BEEX
DSPRL, ECE, Virginia Tech, Blacksburg, Virginia
and
JAMES R. ZEIDLER
SPAWAR Systems Center, San Diego, California
University of California, San Diego, La Jolla, California
9.1
INTRODUCTION
Nonlinear effects were demonstrated to be a fundamental property of least-meansquares (LMS) adaptive lters in the early work on adaptive noise cancellation
applications with sinusoidal interference [38]. The fundamental adaptive lter
conguration for noise canceling is shown in Figure 9.1. The adaptive lter adjusts
the weights wm , which are used to form the instantaneous linear combination of the
signals that reside in the tapped delay line at its input.
It was established [38, 19] that when the primary input to an LMS adaptive noise
canceler (ANC), dn, contains a sinusoidal signal of frequency v d and the reference
input, rn, contains a sinusoidal signal of a slightly different frequency, v r , the
weights of the LMS ANC will converge to a time-varying solution which modulates
the reference signal at v r and heterodynes it to produce an output signal yn
which consists of a sinusoidal signal at v d to match the frequency in the desired
signal. This was shown to produce a notch lter with a bandwidth that is controlled
by the product of the adaptive step size of the LMS algorithm and the lter order. It
was shown that by selecting the appropriate step-size, the resulting notch bandwidth
can be signicantly less than that of a conventional linear lter of the same order.
Since the effects cannot be predicted from classical linear systems analysis, several
authors [34, 8] have also described these nonlinear phenomena as non-Wiener
effects.
335
336
STEADY-STATE DYNAMIC WEIGHT BEHAVIOR IN (N)LMS ADAPTIVE FILTERS
Figure 9.1
Fundamental ANC conguration.
Nonlinear effects were subsequently reported in active noise cancellation

applications [16, 17, 25, 29], biomedical applications [30], adaptive equalizers when
the data signal is corrupted by narrowband interference [33], and adaptive linear
predictors [21]. Nonlinear effects were recently shown to be a general property of
the LMS estimation process and were also shown to occur for small values of the
step-size [13].
Conventional LMS adaptive lter analysis techniques based on the independence
assumptions [18, 28, 39] neglect the correlation between the LMS weight vector and
the current input vectors. Consequently, these analytical techniques do not describe
the conditions in which nonlinear terms become an important factor in dening LMS
adaptive behavior. Analytical techniques which do not invoke the independence
assumptions, but which are based on small step-size approximations [10, 11], are
also inadequate since the nonlinear effects are most dominant for high values of
step-size. Another method of analyzing LMS performance without invoking the
independence assumptions is the exact expectation method [15]. This method
becomes computationally intractable for large values of lter order.
Analytical techniques based on the transfer function approach [19, 38] have been
used to attempt to describe the nonlinear behavior of the LMS algorithm. These
techniques were extended to include noise in the reference input [34], to include
harmonic and synchronously sampled reference inputs [16], to include deterministic
inputs of arbitrary periodic nature and stochastic inputs [12], to provide an
orthogonal subspace decomposition approach [6 8] to generate approximate
expressions for the steady-state mean square error (MSE), and to generalize the
approach to the normalized LMS algorithm (NLMS) [33].
The (N)LMS weight update equation shows that the current LMS weight update,
wn, contains the previous error output en 1, which in turn is a function of the
previous reference input rn 1 and previous desired response dn 1. An LMS
lter of order M thus uses the past errors from previous updates, and the performance
is a function of both the current and past reference inputs and past errors. It was
shown [31, 32] that, in some cases, the nite-length LMS lter is able to use
information from past errors to produce a steady-state MSE that is superior to that
which would be obtained for a Wiener lter (WF) of order M operating on the same
9.1 INTRODUCTION
337
inputs. This result contradicts the traditional assumption [18, 28, 38, 39] that the
misadjustment noise of the LMS lter (i.e., the difference between the MSE of the
WF and the LMS lter) represents the loss in performance associated with the
adaptive estimation process.
It was recognized [31] that the improvements in MSE due to nonlinear effects are
bounded by the MSE for an innite-length WF that includes contributions of the past
and present values of the reference signal rn and past values of the desired
response dn. The analysis is based on constraining the processes fdng and frng
to be jointly wide-sense-stationary (WSS) so that the WF is time invariant. The
innite past of the WSS process is not available to the nite-length Wiener lter but
is available to the innite-length WF and may be available to the LMS adaptive
lter. It is shown that there are often substantial performance improvements for an
LMS lter over a nite-length WF of the same order, but that the performance is
always bounded by that of an optimal WF of innite orders, operating on the past
desired and present and past reference inputs.
The performance of the LMS and exponentially weighted recursive-least-squares
(RLS) estimators was compared [32] for both the noise-canceling and the
interference-contaminated equalizer applications where nonlinear effects had been
observed in the LMS algorithm. The exponentially weighted RLS estimator did not
exhibit enhanced performance for these cases. It is important to note that the LMS
algorithm does not make any a priori assumptions on the temporal correlation of the
input model. The LMS lter selects from a manifold of potential weight vector
solutions to minimize MSE based solely on its present state and the current desired
and reference input data. The method by which these solutions are achieved will be
described in detail in this chapter for several cases in which nonlinear effects are
observed in LMS lters.
It was previously shown [23] that the improved tracking performance of the LMS
algorithm relative to the exponentially weighted RLS algorithm [9, 26, 27] results
from the fact that the correlation estimate used by the algorithm does not match the
true temporal correlation of the data. An extended RLS algorithm [23], which
incorporates estimates of the chirp rate into the state space model, can provide
tracking performance superior to that of both the LMS and exponentially weighted
RLS algorithms for the tracking of a chirped narrowband signal in noise. Likewise,
for the noise-canceling applications considered in [19], it would be possible to
introduce an extended RLS estimator that estimates the frequencies of the primary
and reference inputs and incorporates those estimates into the ltering process. Such
approaches could provide performance much closer to the optimal bounds that are
given below, provided that the state space model used accurately describes the input
data. There are many applications however, where there are underlying uncertainties
and nonstationarities in the input processes that do not allow an accurate state space
model to be dened. The advantage of the LMS estimator for such cases is that it is
not necessary to know the statistics of the input processes a priori.
In this chapter, we will begin by introducing three scenarios in Section 9.2 where
nonlinear effects are observed in LMS lters and one in which they are not easily
observed (wideband ANC). These four scenarios provide useful comparisons of the
338
magnitude of the effects which can be expected under different conditions, and will
be considered throughout the chapter as we develop the mechanisms that produce
nonlinear effects. These scenarios are also used to illustrate what is required to
realize performance that approaches the optimal bounds, as provided by an innitelength WF which has access to all the present and past of the reference signal and all
the past of the desired response.
Much of the previous work on nonlinear effects has focused on the behavior of
the LMS lter for sinusoidal inputs. The performance here will be obtained for both
deterministic sinusoids and stochastic order 1 autoregressive AR(1) inputs so that
the effect of signal bandwidth on the adaptive lter performance can be described
and so that the results are applicable to a larger set of adaptive lter applications.
We will focus on the use of the normalized LMS (NLMS) algorithm rather than
LMS so that we can utilize the noise normalization properties of NLMS to simplify
the performance comparisons. In addition, the afne projection and minimum-norm
least-squares interpretation of the NLMS algorithm [20] provide a useful model to
dene how the information from the past errors couples to the current error in the
weight update. It is important to realize that there is generally a manifold of weight
vector solutions that minimize MSE. This issue is also addressed in Section 9.3 in
the context of the NLMS algorithm. A linear time-invariant (LTI) transfer function
model for the NLMS algorithm is dened in Section 9.3.2.
The performance evaluations for nite- and innite-horizon causal WFs are
analyzed for reference-only, for desired-only, and for two-channel LTI Wiener
lters in Section 9.4. The absolute bounds [31] are dened, and necessary conditions
for achieving performance improvements are delineated. It is only when there is a
signicant difference in the performance bounds for the two-channel Wiener lter
and the reference-only WF that nonlinear performance enhancements may be
observable.
Section 9.4 establishes the conditions in which nonlinear performance enhancements are possible; in Section 9.5 we address the mechanisms by which they
may be achieved. It is shown that it is possible to dene a time-varying (TV) singlechannel, reference-only Wiener lter which has exactly the same performance as the
two-channel LTI WF dened in Section 9.4. This solution is based on a simple
rotation or linking sequence that connects the samples of the desired process and the
samples of the reference process. It is shown that the linking sequence is not in
general unique, corresponding to the nonuniqueness of the weight vector solution
represented by the manifold of possible solutions dened in Section 9.3.
Section 9.6 proves that there is an exact rotational linking sequence between the
reference and desired inputs for the deterministic sinusoidal ANC applications
dened in Section 9.2 and illustrates that this allows an accurate determination of the
adaptive TV weight behavior of the NLMS lter. In addition, the minimum norm
interpretation of the NLMS algorithm forces the new weight vector to be the one that
differs minimally from the current solution. This condition resolves the ambiguity in
the solutions. The key issue in realizing the potential performance improvements
delineated in Section 9.4 is shown to be whether the lter is able to track the
temporal variations dened by the single-channel TV WF.
9.2 NONLINEAR EFFECTS IN VARIOUS APPLICATION SCENARIOS
339
Section 9.7 extends these results to stochastic AR(1) inputs and shows that the
properties of the linking sequences between desired and reference processes for the
exponential case still hold approximately for the AR(1) case. The approximation
inherent in this class of inputs is dened by the stochastic component of the AR(1)
model. It is shown that the stochastic component becomes especially important at
the zero crossings of the reference process. The result of the emergence of a driving
termin the difference equations that represent these processesis that abrupt and
signicant changes in the individual weight values can be produced over time as the
NLMS lter selects an update that is the minimum norm variation within the
manifold of possible weight vector solutions. It is shown that the key issue in
realizing potential improvements is the tracking of the temporal variations dened
by the single-channel TV WF.
In Section 9.8 the linking sequence approach is applied in the adaptive linear
prediction (ALP) application and the narrowband interference-contaminated
equalization (AEQ) application. The auxiliary channel for the ALP case consists
of the most recent past values of the desired process. In the equalization application,
the auxiliary channel contains the interference signal itself or an estimate for the
latter. Time-varying equivalent lters are derived for the corresponding two-channel
scenarios. In ALP the equivalent lter can be interpreted as the combination of
variable-step predictors of the desired signal. In AEQ the equivalent lter consists of
a combination of variable-step predictors of the interference at the center tap.
Finally, in Section 9.9, we indicate the conditions that must be satised for
nonlinear effects to be a signicant factor in NLMS adaptive lter performance. The
rst necessary condition is that there be a signicant difference in performance
between the reference-only WF and the two-channel WF using all present and past
reference inputs and all past desired inputs (ANC) or all recent past inputs (ALP) or
center-tap interference input (AEQ). The second requirement is that the adaptive
lter be capable of tracking the temporal variations of the equivalent reference-only
TV WF. In Section 9.9 we show that both of these necessary requirements are
satised simultaneously for ANC scenarios using various signal-to-noise ratios,
bandwidths, frequency differences, and model orders. We also show that a wide WF
performance gap alone is not sufcient for the adaptive lter to realize performance
gain over the reference-only WF. We illustrate that in the ALP scenario, more of the
Wiener lter performance gap is realized by the adaptive lter when the signal is
more narrowband. In the AEQ case the TV nature is such that almost the entire
Wiener lter performance gap is realized when the auxiliary choice approximates
what is practically realizable.
9.2
NONLINEAR EFFECTS IN VARIOUS APPLICATION SCENARIOS
In this section we summarize the conditions for which nonlinear effects have been
observed previously. Four different scenarios have been selected for illustration: (1)
wideband ANC applications, where nonlinear effects are not easily observed; (2)
narrowband ANC applications, where nonlinear effects dominate performance; (3)
340
AEQ applications with narrowband interference, where the narrowband interference

creates nonlinear effects; and (4) ALP applications, where nonlinear effects become
apparent as prediction lag increases.
An upper bound on the performance of the LMS and NLMS adaptive lters was
derived from the limiting performance of the optimal WF solution [31]. In some
scenarios, the performance of the adaptive lteroperating on the (nite) causal
past of the reference inputcomes close to the performance of the optimal causal
WF solutionwhich uses all of the past of the desired signal as well as the causal
past of the reference signal. Close approximation of the bound suggests that the
adaptive lter manages to access information associated with the past of the desired
signal, even though it was not provided as an input. Hence the conclusion was
reached that the effect was a nonlinear one. In this section, we will illustrate some of
the conditions under which nonlinear effects arise and then describe the mechanisms
that generate the nonlinear effects in the sections that follow.
9.2.1
Wideband ANC Scenario
To start from a commonly understood framework, we investigate the behavior of the

NLMS AF in the ANC conguration of Figure 9.2. One gure is used to refer to WF
as well as AF implementations because of what they have in common in terms of
their application. However, there is a fundamental difference between the two. That
difference is expressed in the dotted lines representing instantaneous feedback from
the error signal to the weights of the linear combiner inside the lter block (Fig. 9.1)
in the case of the AF. In the WF implementation, the linear combination weights
inside the lter are xed, not subject to the instantaneous temporal nature of any
signaland instead, are designed on the basis of statistical information about the
reference and desired processes with the goal of minimizing the ensemble-averaged
MSE. The desired and reference processes fdng and frng, respectively, will be
wideband in this scenario. In order to cancel a portion of one process using another
Figure 9.2
ANC conguration.
Figure 9.3
341
Signal generator conguration.
process, the two processes must have something in common. We will generate the
desired and reference processes according to the signal generator illustrated in
Figure 9.3.
For purposes of illustration, the system functions Hd z and Hr z each have a
single pole and are described as follows:
Hd z
1
1 pd z1
9:1
1
Hr z
:
1 pr z1
These systems are driven by the same unit-variance, zero-mean, white noise process
fv0 ng, thus generating the related AR(1) stochastic processes fd~ ng and f~r ng,
with power of 1 jpd j2 1 and 1 jpr j2 1 , respectively [24]. These AR(1)
generating systems are therefore governed by the following difference equations:
d~ n pd d~ n 1 v0 n
r~ n pr r~ n 1 v0 n:
9:2
The desired and reference stochastic processes fdng and frng, respectively, that
form the inputs to the ANC are noisy versions of the AR(1) processes fd~ ng and
f~r ng as a result of the addition of independent, zero-mean white noise to each:
dn d~ n vd n
rn r~ n vr n:
9:3
The specic poles and measurement noise levels are now specied to complete the
parameterized scenario for Figures 9.2 and 9.3:
p
pd 0:4e j 3
pr 0:4e j 5
SNRd 60 dB
p
9:4
SNRr 60 dB
The nal parameter to be chosen is M, the number of delays in the AF tapped delay
line in Figure 9.2. For clarity of illustration we select M 3. The details of the
342
necessary evaluations for adaptive ltering will be provided in Section 9.3.1 and for
Wiener ltering in Section 9.4.2; for now, we are interested in illustrating the
behavior of the AF in comparison to that of the corresponding WF. The M-tap AF
and WF will be denoted AF(M) and WF(M). For the above scenario, the theoretical
minimum mean square error (MMSE) for the WF(3) and the actual errors for the
WF(3) and AF(3) implementations are shown in Figure 9.4. The difference between
WF(3) and MMSE WF(3) is that the former refers to a nite data realization of the
three-tap Wiener lter implementation, as illustrated in Figure 9.2, while the latter
refers to the theoretical expectation for the performance of such a three-tap WF,
based on perfect knowledge of the statistical descriptions of the processes involved
[the AF solutions are computed using Eqns. (9.6 9.7), and the WF solutions
are computed using Eqns. (9.26 9.31), to be developed later]. We see that the
WF(3) produces errors close to its theoretically expected performance and that
the AF(3) does almost as well. What looks like excess MSE in the latter case
is commonly attributed to the small variations of the steady-state AF weights.
The behavior of the AF(3) weights, relative to the constant WF(3) weights of
[1 0:1236 0:1113j 0:0719 0:0296j], is shown in Figure 9.5. We note that
the AF(3) weights vary in random fashion about their theoretically expected, and
constant, WF(3) weight values. In this scenario, the AF produces Wiener
Figure 9.4 Error behavior for NLMS AF(3) (m 1) and WF(3) [scenario in Eqn. (9.4)]:
pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
343
Figure 9.5 Real and imaginary part of weights for NLMS (m 1) and WF [scenario in Eqn.
(9.4)]: pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
behavior, that is, weight and MSE behavior one would reasonably expect from the
corresponding WF.
9.2.2
Nonlinear Behavior in ANC Scenarios
Relative to the above scenario, we change the parameters in two signicant ways:
The desired and reference signals are made narrowband, and their center frequencies
are moved closer together. Thereto the signal generator parameters are modied as
follows:
p
pd 0:99e j 3
pr 0:99e j 3 0:052p
SNRd 20 dB
p
9:5
SNRr 20 dB
and the corresponding experiment is executed. The number of delays in the AF
tapped delay line is kept at three, that is, M 3. The resulting error behavior is
344
Figure 9.5 (continued )
represented in Figure 9.6. Note that not only is the AF(3) error generally less than the
WF(3) error, it also falls well below what is theoretically expected for the
corresponding WF(3). This performance aspect, of the AF performing better than
the corresponding WF, is surprising. The explanation of this behavior will be given
in detail in the later sections of this chapter.
The AF(3) weight behavior for the narrowband ANC scenario, together with the
constant WF(3) weight vector solution of [0:6587 0:0447j 0:1277 0:0482j
0:5399 0:3701j], is shown in Figure 9.7. We note here that the AF(3) weights are
varying in a somewhat random yet decidedly semiperiodic fashion, and that this
variation is at most only vaguely centered on the constant weights of the
corresponding WF(3). Since the AF error is less than that for the corresponding WF,
and because of the time-varying weight behavior, the AF behavior for this scenario
is termed non-Wiener. Such non-Wiener behaviors had originally been observed
when closely spaced sinusoids were used as inputs to the ANC [19].
The non-Wiener effects were observed in the narrowband ANC scenario and the
effects investigated in terms of pole radii (bandwidth), pole angle difference
(spectral overlap), and signal-to-noise ratio (SNR) [31]. A prediction for the
performance in the narrowband ANC scenario was derived on the basis of a
345
Figure 9.6 Error behavior for NLMS (m 1) and WF [scenario in Eqn. (9.5)]:
pd 0:99e jp =3 , pr 0:99e jp =50:052p , SNRd SNRr 20 dB.
modied transfer function approach [32]. Under a number of circumstances

generally for large step-sizes, narrowband processes, and moderate SNRsthe AF
performance approaches the theoretical limit provided by knowing all of the causal
past of the desired and reference processes. As the AF appears to access some or all
of this information, the effects as illustrated above are alternatively termed
nonlinear.
9.2.3
Nonlinear Behavior in AEQ Scenarios
Nonlinear behavior was also observed in the AEQ scenario [33] depicted in Figure
9.8, where xn is a wideband quadrature phase shift keyed (QPSK) signal, in is a
narrowband interference, and vr n is an additive, zero-mean, white noise.
The delay D in the signal path ensures that the lter output is compared to the
signal value at the center tap [D M 1=2]. After training of the AEQ, the error
signal can be derived by comparison with the output from the decision device.
However, as our purpose here is to demonstrate the occurrence of nonlinear effects,
we will compare the estimated signal constellations when using the WF and when
using the AF in training mode in the presence of strong narrowband AR(1)
346
Figure 9.7 Real and imaginary part of weights for NLMS (m 1) and WF [scenario in Eqn.
(9.5)]: pd 0:99e jp =3 , pr 0:99e jp =50:052p , SNRd SNRr 20 dB.
347
Figure 9.8 AEQ with narrowband interference conguration.
interference. For strict comparison, the WF and AF are operating on the same
realization. The AR(1) pole is located at 0:9999 exp jp =3. Adaptive lter
performance is again computed using Eqns. (9.6 9.7), and WF performance is
computed using Eqns. (9.26 9.31). The respective results are shown in Figure 9.9,
for an SNR of 25 dB, a signal-to-interference ratio (SIR) of 20 dB, a lter length M
of 51 (D 25), and NLMS step-sizes m 0:1, 0.8, and 1.2. Step-size is an
important factor in optimizing performance; a step-size of 0.8 is close to optimal for
this scenario [33], while a very small step-size elicits WF-like results. For signal
power at 0 dB, the NLMS (m 0:1, 0.8, 1.2) AF(51) produced MSE of 12.83,
16.01, and 15.11 dB, respectively, while WF(51) produced MSE of 11.09 dB.
The WF(51) MMSE is 11.34 dB for this case. We see that the AF(51) equalized
symbols for the larger step-sizes are more tightly clustered around the true symbol
values (at the cross-hairs) than the WF(51) equalized symbols. Correspondinglyas
borne out by the MSEthe AF errors are more tightly clustered around zero,
thereby demonstrating the nonlinear effect in this AEQ scenario.
Nonlinear effects in LMS adaptive equalizers were investigated for the LMS as
well as NLMS algorithms for a variety of SIR and SNR values [33]. The latter
investigation included deriving the corresponding WF, expressions for the optimal
step-size parameter, and results for sinusoidal and AR(1) interference processes,
with and without the use of decision feedback.
9.2.4
Nonlinear Behavior in ALP Scenarios
Nonlinear behavior was observed as well in the ALP scenario depicted in Figure
9.10, wherein particularthe process to be predicted was a chirped signal [21].
To demonstrate the nonlinear effect in the ALP scenario, we will use an AR(1)
process in observation noise, as provided by fdng in Figure 9.3. The AR(1) pole
pd 0:95 exp jp =3, SNRd 20 dB, and prediction lag D 10. The number of
delays in the AF tapped delay line is again three, that is, M 3. In Figure 9.11
the NLMS AF(3) and WF(3) performance, operating on the same 10 process
348
Figure 9.9 Equalization performance for WF(51) (a) and NLMS AF(51) (m 0:1, 0.8, 1.2
in (b), (c), and (d), respectively): pi 0:9999e jp =3 , SNR 25 dB, SIR 20 dB.
Figure 9.9 (continued )
349
350
Figure 9.10
ALP scenario.
realizations, is compared to the theoretical MMSE for the corresponding WF. The
experiment is repeated at different step-sizes. We observe that in the given scenario
the NLMS AF(3) outperforms the corresponding WF(3) for a wide range of stepsizes, thereby illustrating the existence of non-Wiener effects in the ALP scenario.
This limited set of experiments suggests that the nonlinear effects are more
pronounced for larger step-sizes.
Figure 9.11 ALP performance relative to Wiener performance: pd 0:95e jp =3 ,

SNRd 20 dB, D 10.
9.3 NLMS INTERPRETATIONS AND PERFORMANCE ESTIMATES
351
Results for chirped signals, elaborating on the effects of chirp rate and bandwidth,
have been reported [21]. The latter also provides an estimate of the performance that
may be expected using the transfer function approach in the un-chirped signal
domain.
9.3
NLMS INTERPRETATIONS AND PERFORMANCE ESTIMATES
We have seen that NLMS performance can be better than the performance of the
corresponding nite-length WF due to the nonlinear or non-Wiener effect illustrated
in Section 9.2. The intriguing question we now address is how this performance
improvement is achieved. To begin, we establish notation and briey review some
of the well-known interpretations of the NLMS algorithm that will be used here. An
indicator for NLMS performance can be found in the transfer function approach to
modeling the behavior of NLMS, which is rooted in adaptive noise cancellation of
sinusoidal interference [19] as well as of colored processes [12]. The LTI transfer
function model for NLMS is derived so that we can later compare the performance
estimate it provides to the performance of the NLMS algorithm for several of the
scenarios described in Section 9.2.
9.3.1 Projection and Minimum-Norm Least-Squares
Interpretations of NLMS
Using the setups in Figures 9.2, 9.8, and 9.10 for environments that are not known a
priori, or that vary over time, the WF is replaced with an AF that uses the same
inputs. In the ANC and ALP applications, the noisy version of the desired signal is
used as the desired signal, and the error between the desired signal and its estimate
the AF output ynis used for AF weight adaptation. In the AEQ scenario the
signal of interest (the QPSK signal) serves as the desired signal during the training
period.
The nonlinear effects in the AF environment have been observed [33] when using
the NLMS algorithm, which is summarized here as follows:
en dn wH nun
wn 1 wn m
e*n
un;
H
u nun
where the Hermitian transpose is denoted H as

information vector un is dened on the basis of
earlier ANC, AEQ, and ALP scenarios as follows:
2
rn
6 rn 1
un 6
..
4
.
9:6
and the conjugate as *. The

the reference signal rn in the
rn M 1
3
7
7:
5
9:7
352
The constant m is generally referred to as the step-size. As fast convergence is

associated with large m , and as the observed nonlinear effects were most prominent
for large m , we will generally use m 1 in our examples, although the result of
varying m will also be discussed in several cases.
One interpretation of the NLMS algorithm is that it produces the solution to the
following problem [20]:
Minimize kwn 1 wnk such that wn 1 [ Mw ;
9:8
where Mw is the manifold of weight vector solutions generating a posteriori errors

equal to zero:
Mw fwn 1 : dn wH n 1un 0g:
9:9
The other interpretation of NLMS that is of interest to us in the present context is

that the a posteriori NLMS weight vector is the orthogonal projection (when m 1)
of the true weight vector (the one responsible for the current desired value) onto the
direction un starting from the a priori weight vector wn (i.e., an afne projection).
This interpretation is depicted in Figure 9.12.
Assuming a time-invariant (TI) scenario that is also noiseless, we have the
following relationship for the desired signal:
dn wH
TI un:
9:10
In a noiseless TI scenario, the NLMS weight vector wn converges exponentially to

the TI weight vector wTI when that TI weight vector is in the space of solutions
modeled by NLMS and the input is persistently exciting [20]. This is compatible
with the rst interpretation, since the changes in the NLMS weight vector vanish,
while the a posteriori manifold constraint is satised. When we have measurement
noise, we get the familiar situation where the NLMS weight vector converges to a
statistical neighborhood of the TI weight vector, that is, mean square convergence.
Figure 9.12
Afne projection interpretation of NLMS.
353
In this situation, we can write for the observed desired signal the following:
dn wH
TI un vn
9:11
H nun:
w
n changes at each iteration, which we can interpret as

Now the weight vector w
being the result of incorporation of the measurement noise into a corresponding
change of the weight vector. According to the projection interpretation, the a
n onto the direction un
posteriori NLMS weight vector is the projection of w
starting from the a priori NLMS weight vector wn. Consequently, the a posteriori
NLMS weight vector wanders around in a probabilistic ball centered at the TI
solution.
In the TV case, we can also express the observed desired signal as in Eqn. (9.11).
In this case the a posteriori NLMS weight vector is still that vector which satises
n onto the afne
the condition that, for m 1, it is the orthogonal projection of w
space determined by the direction un with wn as the starting point. The
n does. It
neighborhood of this projection now changes at every iteration because w
is important to recognize that it is the a posteriori NLMS weight vector, wn 1,
which is used for the actual adaptive ltering task during the next iteration, that is,
n 1. Consequently, the
when the target weight vector has changed to w
n. The
performance of this AF becomes very dependent on the TV nature of w
size of the weight vector neighborhood associated with mean square convergence
of NLMS is now generally larger and has two components, a misadjustment
component and a lag component [35].
9.3.2
LTI Transfer Function Model for NLMS
In Section 9.2 we showed that NLMS AF performance could be better than the
performance of the corresponding WF. This phenomenon was often, though by no
means always, associated with large step-sizes. We now show the derivation of this
LTI transfer function model for NLMS, which has the attractive feature that it is
equally valid for large and small step-sizes. The LTI transfer function model has
been a reasonably good indicator for NLMS performance in ANC, AEQ, and ALP
scenarios [21, 32, 33].
Starting from an initial weight vector w0, repeated application of Eqn. (9.6)
leads to the following expression for the weight vector wn:
wn w0 m
n1
X
e*i
i0
uH iui
ui:
9:12
In connection with subsequent developments, it is important to note from Eqn.

(9.12) that wn, the weight vector after n iterations of the NLMS algorithm, is a
function of all the errors encountered during operation. Furthermore, by means of
354
the rst equality in Eqn. (9.6), wn is then a function of all the previously encountered values of the desired signal. The transfer function model for NLMS is based on
Eqn. (9.12), so that we can reasonably expect this model to accountmore or less
for the fact that NLMS uses all previously encountered values of the desired and
reference processes.
With yn, the output of the adaptive lter, given by
yn wH nun;
9:13
we nd
yn wH 0un m
n1
X
i0
ei
uH iun:
uH iui
9:14
The approximation facilitating the derivation of the LTI transfer function model for
NLMS is
uH iun
M1
X
rn jr*i j
j0
9:15
Mrr n i;
where rr m denotes the ensemble-average correlation of the reference process
frng at lag m. The latter results from the ergodicity assumption, so that time
averages can be replaced by ensemble averages. For large M this approximation
appears to be more valid than for small M. Nevertheless, as will be shown in Section
9.8, the approximation in Eqn. (9.15) is useful for reasonably small M also, in the
sense that the resulting model for NLMS behavior produces a good indicator of
NLMS performance.
Using the rst equality in Eqn. (9.6) in the LHS of Eqn. (9.14), and substituting
Eqn. (9.15) in the right-hand side of Eqn. (9.14) twice, yields
dn en wH 0un m
n1
X
ei
Mrr n i:
Mr
r 0
i0
9:16
Noting that the denominator under the summation is constant, and dening tn
dn wH 0un as the excitation, produces the following difference equation as
governing the NLMS error process:
en
n1
m X
eirr n i dn wH 0un
rr 0 i0
tn:
9:17
355
Rewriting the left-hand side, using a change of variable, yields

en
n1
m X
en mrr m tn:
rr 0 m1
9:18
As we are interested in steady-state behavior, it is now assumed that the NLMS

iteration process started a long time ago, at negative innity. This steady-state
assumption leads to the following corresponding form for Eqn. (9.18):
en
1
m X
en mrr m tn:
rr 0 m1
9:19
Equation (9.19) is recognized as a difference equation describing a causal, allpole, LTI system of innite order, with numerator polynomial equal to 1. The
denominator of the corresponding system function is given by the following
polynomial:
DNLMS z 1
1
m X
rr mzm :
rr 0 m1
9:20
The difference equation in Eqn. (9.19) therefore represents the NLMS error process
as the result of the LTI system HNLMS z driven by the process tn . The NLMS system
function is given by
"
1
m X
HNLMS z 1
rr mzm
rr 0 m1
#1
1
P1 rr m m
z
1 m m1
rr 0
1m
1
P1
m1 r r mz
m
9:21
where r r m is the normalized autocorrelation of the input process to the AF.

To evaluate the NLMS transfer function model requires the evaluation of the
autocorrelation of the reference process frn g and its strictly causal z-transform. An
example for AR(1) processes will be provided in Section 9.4.5.
The NLMS MSE estimate provided by this transfer function model is then given
by
JNLMS 1
1
2h
h
h
jHNLMS e jv j2 St e jv d v ;
9:22
356
where St e jv is the spectral density of the process ftn g dened in Eqn. (9.17).
Alternatively,
1
JNLMS 1
2p j
jzj1
HNLMS zH *NLMS z* St zz1 dz;
9:23
which can be interpreted in terms of auto- and cross-correlations when the integrand
is rational [1, 14].
9.4
FINITE AND INFINITE HORIZON CAUSAL WIENER FILTERS
In Section 9.3, Eqn. (9.12), we saw that the NLMS weight vector wn was an
implicit function of past values of dn, rn, and wn. Note that the NLMS AF does
not have direct access to all these causally past values; for example, the value of
dn 1 is embedded in wn and is no longer directly available at n and beyond.
Consequently, the NLMS AF is constrained in its use of the causally past values of
dn and rn, and as a result, its performance is limited relative to that of a lter
having full access to the past. In this section, we look at what is possible in terms of
estimation performance when an LTI estimator has full access to the causal past of
the desired process and the reference process, as well as to the present of the
reference process, thereby reaching the bounds dened earlier [31]. The latter will
provide absolute bounds on the performance of AFs when used in a WSS
environment. We will show in Section 9.5 how the NLMS AF is able to access
information from the past values of dn and achieve performance which exceeds
that of the reference-only WF while always being bounded by the performance of
the WF that uses all past values of dn and rn.
The goal of the estimation scenarios of interest is to provide the best estimate
based on given information or measurements. In each case, the lter output can be
seen as having been produced by the input process fdn ; un g, that is, a joint or
multichannel process. In the WF case, statistical information about the joint process
is used for the design of the lter, which then operates on the samples of (or perhaps
only a subset of) the joint process. In the AF case, the lter operates on the samples
of the joint process (although differently on its different subsets) while
simultaneously redesigning itself on the basis of those same samples. This
multichannel view may not directly represent the most commonly encountered
implementation of AFs, but it will afford us insights into the limits of performance
encountered in the AF scenarios above in their usual single-channel implementation.
Also, using different multichannel views, for the ANC, ALP, and AEQ scenarios,
we will be able to advance explanations for the observed nonlinear or nonWiener effects. In addition, we will show that, in some scenarios, multichannel
AF implementations may provide performance gain that cannot otherwise be
obtained.
9.4 FINITE AND INFINITE HORIZON CAUSAL WIENER FILTERS
9.4.1
357
Multichannel Adaptive or Wiener Filtering Scenario
From a stochastic process point of view, we state our problem as a multichannel

(adaptive or Wiener) ltering problem for a WSS scenario, as depicted in Figure
9.13 (and then use it with minor variations on the theme).
The desired and reference signals form a joint process fdn; rng, which is used
to produce the estimate fyng. The joint process is jointly WSS and can be generated
using the signal generator in Figure 9.3 using WSS noise processes and stable
system functions. We noted from Eqn. (9.12) that wn depends on past values of the
joint process fdn; rng, while yn wH nun also depends on the current
reference input values. Note that the delay at the beginning of the desired signal
tapped delay in Figure 9.13 ensures that only the strict past is used. In both the AF
and WF congurations, the output is determined on the basis of the L past values of
dn and M values of the causal past (past and present) of rn. Our information
space Un , is therefore dened as follows:
Un fdn 1dn 2 dn Lrnrn 1rn 2 rn M 1g: 9:24
Based on the available information space, the best (MMSE) estimate of dn is given
by the following conditional expectation:
d^ n EfdnjUn g
Efd~ n vd njUn g
Efd~ njUn g Efvd njUn g
^
d~ n;
Figure 9.13 WF or AF conguration.
9:25
358
where the last equality results from the measurement noise process fvd ng being
white, zero-mean, and independent of the noise processes fv0 ng and fvr ng.
In a Gaussian scenario, the driving and measurement noises are all Gaussian in
addition to having the above properties. The optimal lter for estimating dn is then
in fact an LTI lter [37]. The latter being truly optimal in MMSE sense, that is, the
best of all possible operations on the joint process sampleswhether that operation
is linear or nonlinearproduces an absolute bound on the performance of the AF.
This performance bound was recognized by Quirk et al. [31], who also showed that it
may be approached by the performance of NLMS in specic ANC scenarios.
The optimal lter, and the corresponding absolute bound, can be derived using
spectral factorization [31]. In practice, we are often interested in designing the best
causal linear lter operating on a nite horizon, as expressed by L and M, the number
of tapped delay line stages for the desired and reference processes, respectively. It
can be shown [22, 36] that the performance of the nite horizon causal linear lter
as its order increasesconverges to the performance of the innite horizon causal
linear lter, which in a Gaussian scenario is the best performance possible. We will
therefore concentrate next on the design and performance of the optimal
(multichannel) nite horizon causal linear lter, which provides the opportunity
to make a practical trade-off between performance and computational effort.
9.4.2
Finite-Horizon WF Design and Performance
Finite-horizon WFs are designed according to the familiar Wiener-Hopf equation.

The multichannel case can be interpreted as an optimal linear combinerwhere the
inputs are not necessarily delayed versions of the same signaland leads to an
equation in the same form:
RwWFL;M p:
9:26
In general, the appropriate partitions in the following denitions are used in order to
yield the single-channel reference-only (L 0, M . 0), single-channel desired-only
(L . 0, M 0), or multichannel desired reference (L . 0, M . 0) WFs:
R EfunuH ng; p Efund*ng

dn 1
un
rn
2
3
2
rn
dn 1
6 rn 1
6 dn 2 7
6
7
6
dn 1 6
..
..
7; rn 6
4
5
4
.
.
dn L
9:27
9:28
3
7
7
7:
5
9:29
rn M 1
The output of the corresponding (multichannel) WFs can then be written as follows:
d^ n wH
WFL;M un;
9:30
359
where the values of L and M indicate which of the partition(s) of un in Eqn. (9.28)
is active. The performance of these nite horizon WFs, expressed in terms of MMSE
in estimating dn, is evaluated from
MMSEWFL;M rd 0 wH
WFL;M p;
9:31
where rd 0 is the power of the desired process. Since dn d~ n vd n, the

MMSE in estimating d~ n is less by s 2vd , the power of the white measurement noise
in dn.
The information needed to design these WFs consists of auto- and crosscorrelations for the processes fdng and frng. For rational systemsgiven in terms
of their numerator and denominator polynomialsthis correlation information can
be readily evaluated by solving linear systems of equations, as in the Sylvester
matrix based approach [1, 14].
9.4.3
WF Performance for Specic Scenarios
9.4.3.1 Wideband ANC The performance bounds corresponding to the

reference-only, desired-only, and desired reference WF will now be shown for
the four scenarios discussed in Section 9.2. For illustrative purposes, and for easy
comparison with the single-channel cases, it will be sufcient to limit ourselves to
the same number of taps in the reference channel as in the desired channel; that is,
L M, when investigating the desired reference WF. Figure 9.14 shows MMSE
performance for a variety of lter orders for what we termed the wideband ANC
scenario, as specied in Eqn. (9.4). To interpret the WF performance for this
scenario, it will be helpful to refer to some of the auto- and cross-correlation
properties. Figure 9.15 shows the magnitude of the autocorrelation for the desired
process and the magnitude of the cross-correlation between the reference and
desired process. For AR(1) processes simple expressions can be given [24], while
more generally, for rational processesthese functions can be computed by solving
linear systems of equations based on a Sylvester matrix formulation [1, 14]. The
magnitudes of the autocorrelation functions (ACF) for the desired and reference
processes are the same in this scenario, because the AR(1) poles have the same
magnitudes and the SNRs are also the same. In particular, a correlation of
approximately 0.4 exists between dn and dn 1. The magnitude of the crosscorrelation function (CCF) between the reference and desired process is nearly 1 at
zero lag, indicating that there is much information in rn about dn. Consequently,
the WF0; M performance seen in Figure 9.14 is rapidly better than for WFM; 0.
Note in Figure 9.14 that the performance for M 0 corresponds to not using any
input at all (zero taps used), resulting in s 2d , that is, the variance of the desired signal.
For M 3 we note that MMSE WF(0, 3) reects the reference-only WF
performance of approximately 29dB, corresponding to the gray (x) line in
Figure 9.4. We see that the desired reference WF can actually perform much better,
and in fact quickly approaches the absolute bound of linear lter performance.
360
Figure 9.14 MMSE WF performance for wideband ANC scenario: pd 0:4e jp =3 ,

pr 0:4e jp =5 , SNRd SNRr 60 dB.
However, the performance advantage of WFM; M over WF0; M quickly

disappears as M approaches 10.
The MMSE WFM; 0 performance behavior shows that knowing the past values
of the desired process does not help much in obtaining a better estimate of dn,
because there is little information in the past of the desired process that is relevant
to its present (and future), as shown in Figure 9.15a. The MMSE WF0; M performance behavior shows that the present and past values of the reference process
contain information that is relevant to estimating the present value of the desired
process. In fact, using about nine taps in the reference channel delay line, WF0; M
performance reaches the absolute bound provided by the two-channel WF. The
behavior of the two channel WFM; M performance shows that the absolute
performance bound is reached using only three taps in each channel.
9.4.3.2 Narrowband ANC Figure 9.16 shows the WF performance comparison
for the narrowband ANC scenario, specied in Eqn. (9.5), for which non-Wiener
behavior was demonstrated in Section 9.2.2. The corresponding correlation information is shown in Figure 9.17. The correlation characteristics are very different
from those seen in Figure 9.15 in the wideband ANC scenario. The desired signal
autocorrelation, shown in Figure 9.17a, shows the strong correlation between its
361
Figure 9.15 ACF magnitude (a) and CCF magnitude (b) versus lag for the wideband ANC
scenario: pd 0:4e jp =3 , pr 0:4e jp =5 , SNRd SNRr 60 dB.
362
Figure 9.16 MMSE WF performance for narrowband ANC scenario: pd 0:99e jp =3 ,

pr 0:99e jp =30:052p , SNRd SNRr 20 dB.
present and past values. Also note that the effect of the added measurement noise is
very small and is observed at lag 0 only. The reference signal ACF magnitude (not
its phase) is again the same as for the desired signal. For this narrowband process, it
takes many lags for the correlation between process samples to vanish. In fact, for
both the ACF and the CCF, the magnitude is exponential, with the factor equal to the
magnitude of the poles used in the generating processes. We see that the correlation
between adjacent desired signal values is approximately 0.99, that is, very high. The
magnitude of the CCF between the reference and desired processes is only
approximately 0.065 at zero lag, indicating that there is not much (statistical)
information in rn about dn. Consequently, the WFM; 0 performance seen in
Figure 9.16 is rapidly better than for WF0; M, that is, a reversal relative to the
wideband ANC scenario.
We note that MMSE WF(0, 3) is about 17 dB, as reected in Figure 9.6. For this
scenario, the reference-only WF will need an extremely high order before its
performance will approach that of the desired reference WF. In the present
scenario, a big improvement in performance results from incorporating the rst
desired signal sample; this is the reverse of what we saw in the wideband ANC
scenario. The performance advantage of WFM; M over WF0; M is immediate
363
Figure 9.17 ACF magnitude (a) and CCF magnitude (b) for the narrowband ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 20 dB.
364
and not easily mitigated by an increase in the lter order. The latter performance
behavior provides good arguments for the use of lters that include the desired
channel. Recall that in Figure 9.6 we saw an AF(0, 3) performance improvement of
about 6 dB, performance still well above the bound of less than 1 dB indicated in
Figure 9.16 for M 3. This indicates that the AF is accessing at least some, but not
all, of the information present in the desired signal channel. The mechanism by
which this is accomplished will be described in Section 9.5.
The performance perspective in Figure 9.16 shows thatfor scenarios such as
this one, where there is strong temporal correlation in fdngit would be more
benecial to use past desired process values than past (and present) reference values.
Furthermore, the results again indicate that the AF is accessing some of the
information available to the two-channel WF. However, this does not yet explain the
mechanism responsible for the nonlinear effects that have been observed when using
a reference-only adaptive lter.
9.4.3.3 Adaptive Linear Predictor The WF performance behavior for the ALP
scenario specied in Section 9.2.4 is shown in Figure 9.18, while Figure 9.19 shows
Figure 9.18 MMSE WF performance for the ALP scenario: pd 0:95e jp =3 ,

SNRd 20 dB, D 10.
365
Figure 9.19 ACF and CCF magnitudes for the ALP scenario: pd 0:95e jp =3 ,
SNRd 20 dB, D 10.
366
the magnitude of the ACF for the desired process and the magnitude of the CCF
between the reference and desired processes for the ALP scenario specied.
Recall that, in the ALP scenario, the reference process is a delayed version of the
desired process, so that rn in Eqn. (9.29) is actually equal to dn D, as dened in
Eqn. (9.29). The ACFs of the desired and reference processes are therefore exactly
the same. The ACF and CCF die out exponentially according to the pole magnitude
of 0.95, and the peak in the CCF magnitude at lag m 10 reects the pure delay of
10 samples in the reference signal relative to the desired signal. There is a strong
correlation between dn and dn 1, so that adding a past desired signal sample
provides immediate information about dn. This is reected in the sharp increase in
WFM; 0 performance at the addition of the very rst desired channel tap, as seen in
Figure 9.18. Since fdng is essentially an AR(1) process [24], adding past values
beyond dn 1 provides almost no additional information. This explains the
apparent performance saturation at 0.75 dB; that is, the performance bound is
reached almost immediately.
The MMSE WF(0, 3) performance is about 8.3 dB, corresponding to the level
indicated in Figure 9.11. Note that the performance of the reference-only WF has
saturated at 8.3 dB and will not approach the performance of the desired-only or
desired reference WF at any lter order. The latter is explained by the fact that
increasing the order, or number of taps, in WF0; M only adds further delayed
samples to the reference vector in Eqn. (9.29). Figure 9.19b shows that, as far as the
reference channel is concerned, rnwhich equals dn D contains most
information about dn but is correlated with it only at the 0.6 level (see lag m 0),
hence the saturation at about 8.3 dB rather than at 0.75 dB.
With respect to Figure 9.19b, note that as we change the prediction lag D, the
CCF magnitude peaks at the corresponding lag location. Simultaneously, the CCF
between rn dn D and dnat lag 0decreases (increases) as the
prediction lag D increases (decreases). A decrease in the latter CCF implies that
there is less information about dn in rn dn D, and therefore the performance of the reference-only WF decreases. Figure 9.20 shows the WF performance as
it changes with prediction lag D. As prediction lag D increases, the performance gap
between the reference-only WF and the desired reference WF increases. The latter
means that, potentially, there is more to gain from the nonlinear effects in AF as
prediction lag increases. The NLMS ALP performance for the ALP scenario of
Section 9.2.4 is evaluated for M 3, for ve different realizations, and for
prediction lags of 10 and 30, and is shown in Figure 9.21. Figure 9.21a shows that
the NLMS reference-only AF gains a maximum of about 2 to 2.5 dB over the
corresponding WF for prediction lag D 10. For prediction lag D 30, the NLMS
AF gains a maximum of about 3 to 4 dB. As hypothesized earlier, we gain relatively
more in performance when prediction lag increases.
In both of the latter cases, and in the results in Figure 9.11, we saw AF
performance down to about the 5.5dB level, well above the absolute bound of about
0.75 dB indicated in Figures 9.18 and 9.20. Again, this performance behavior
indicates that some, but not all, of the information from the desired channel is being
accessed in corresponding AF applications.
367
Figure 9.20 MMSE WF performance for the ALP scenario: pd 0:95e jp =3 ,

SNRd 20 dB.
9.4.3.4 Adaptive Equalization The AEQ scenario specied in Section 9.2.3 is

somewhat different from the usual AEQ scenario since the performance of the
equalizer is degraded by the presence of the narrowband interference process fing.
Consequently, the reference signal rn available to the AEQ consists of the
superposition of the information-bearing QPSK signal xn, the narrowband
interference process fing, and additive, white, Gaussian measurement noise vr n.
The signal of interest, the QPSK signal, is white (uncorrelated); therefore, knowing
its previous values cannot tell us anything about its current value. However, the
interference process fing is AR(1), with a pole of magnitude 0.9999, and is
therefore highly correlated with its past (and future) values. Hence, the only
temporally correlated component in the reference channel is the interference
component. This creates a desirable situation, since it is possible to utilize the past
values of rn to remove interference, with minimal distortion of the information in
the QPSK signal of interest. Consequently, instead of the recent past of the desired
signal (as in ANC and ALP), ideally the interference process is used as the second or
auxiliary channel, effectively annihilating the interference. More practically, we can
replace the ideal interference with an estimate of the interference, derived from
the reference process, and thereby produce the capability of its partial elimination.
The ideal setup is reected in Figure 9.22.
368
Figure 9.21 NLMS AF(0, 3) ALP performance: pd 0:95e jp =3 , SNRd 20 dB,

D 10 (a), D 30 (b).
369
Figure 9.22 Two-channel AEQ conguration for Section 2.3 scenario: pi 0:9999e jp =3 ,
SNR 25 dB, SIR 20 dB.
Figure 9.23 shows the WF performance behavior for the AEQ scenarios, using
the interference process as the second channel. The WF(0, 51) MMSE of 11.34 dB
corresponds to the WF constellation in Figure 9.9a. The AF(0, 51) MMSE that was
realized for m 0:8 was 16dB, corresponding to the AF constellation in Figure
9.9c. Note in Figure 9.23 that the latter is in the direction of the two-channel WF
performance of 25 dB. The latter performance bound is associated with using the
interference signal itself in the second channel, since the AF task in the AEQ
scenario is to reject the interference in the process of producing an equalized
estimate of the desired QPSK signal. We see in Figure 9.23 that WFM; 0 does not
produce any improvement in performance at any order. This situation corresponds to
using the interference signal, by itself, to estimate the desired QPSK signal.
Considering the independence of interference and QPSK, no improvement in
performance can be expected.
Each of the above examples illustrates that an appropriate two-channel WF
always performs at least as well as and often better than the reference-only WF. The
absolute two-channel WF bound, MMSE WF1; 1, indicates the best possible
performance that can be achieved by a two-channel NLMS AF implementation in
WSS scenarios. For analysis of the different contributions to NLMS error, the LTI
transfer function model for a two-channel NLMS lter is developed next.
9.4.4
Transfer Function Model for Two-Channel NLMS
With the AF input vector dened as in Eqn. (9.28), the derivation of the two-channel
LTI transfer function proceeds exactly as in Eqns. (9.12) through (9.14). Equation
(9.15) is replaced by:
uH iun
M
1
X
j0
rn jr*i j
L1
X
j0
Mrr n i Lrd n i;
dn 1 jd*i 1 j
9:32
370
Figure 9.23 WF performance for Section 2.3 AEQ scenario: pi 0:9999e jp =3 ,

SNR 25 dB, SIR 20 dB.
where rr m and rd m denote the correlations, at lag m, of the reference and desired
(or auxiliary) processes, respectively (the relationship is valid, in general, for any
two channels that make up the AF input vector). Using Eqn. (9.32) in Eqns. (9.16)
through (9.20) results in the LTI transfer function model for two-channel NLMS:
"
1
X
m
HNLMS z 1
Mrr m Lrd mzm
Mrr 0 Lrd 0 m1
"
1
1
X
1
X
#1
m
M
rr mzm L
rd mzm
Mrr;0 Lrd;0
m1
m1
9:33
!#1
:
The choice L 0 in Eqn. (9.33) directly yields the single-channel NLMS model
in Eqn. (9.21). To evaluate the NLMS transfer function model requires the
evaluation of the ACF of the reference process frng and the auxiliary (desired or
interference in our earlier scenarios) process, and their respective strictly causal ztransforms.
9.4.5
371
NLMS Transfer Function Model for ANC and ALP Scenarios
As our earlier scenarios have been based on AR(1) processes, and as the ACFs for
the desired process and the delayed desired process are equal, we will explicitly
derive the NLMS transfer function model for the ANC and ALP scenarios.
The AR(1) reference process ACF, for the process dened in the signal generator
portion of Section 9.2.1, is given by [24]
rr m
1
pjmj c2r d m;
1 jpr j2 r
9:34
where the scaling constant cr is determined by the SNR (in dB) as follows:
c2r rr 010SNRr =10
1
10SNRr =10
1 jpr j2
9:35
From Eqn. (9.34), and the analogy in generating fdng and frng, now follows
rr m
1
1 10SNRr =10
pjmj 1 d m
d m
2 r
1 jpr j2
1 jpr j
1
1 10SNRd =10
jmj
p
1

d
m
d m;
rd m
1 jpd j2
1 jpr j2 d
9:36
where the rst term in brackets equals 1 for all values of m except for m 0, where
the term is zero. For the summation terms in the NLMS transfer function we only
need the correlation function values for strictly positive lag m, resulting in
1
X
rr mzm
m1
1
X
m1
rd mz
m
pr z1
; jzj . jpr j
1 jpr j2 1 pr z1
pd z1
; jzj . jpd j:
1 jpd j2 1 pd z1
9:37
With rr 0 and rd 0 given by the constants in the second right-hand side terms in
Eqn. (9.36), also substituting Eqn. (9.37) in Eqn. (9.33), and some careful algebra,
we nd the following explicit expression for the NLMS transfer function applicable
372
to the ANC and ALP scenarios:

HNLMS z
1 pr z1 1 pd z1
1 m g rd m g dr pr pd z1 pr pd m g rd pd m g dr pr z2
g rd Mpr 1 jpd j2 b 1
g dr Lpd 1 jpr j2 b 1
b M1 10SNRr =10 1 jpd j2 L1 10SNRd =10 1 jpr j2 :
9:38
Associated with the NLMS transfer function is the choice of the driving term in
Eqn. (9.17). A common choice for the starting weight vector is the all-zero vector.
We see from the right-hand side of Eqn. (9.17) that this corresponds to driving the
NLMS difference equation with the desired signal, dn. Alternatively, we could
argue that our interest is in the steady-state performance of NLMS, and that we
should therefore start NLMS in the steady-state. In this second case, a reasonable
weight vector to start from is the optimal Wiener weight. Substituting the latter in
the NLMS driving term, dened in Eqn. (9.17), yields the Wiener error as the driving
term for the NLMS difference equation. In Section 9.8, we will refer to the latter
choice as yielding the MSE performance estimate from the transfer function model
for NLMS.
Note from Eqn. (9.33) that when m ! 0, the NLMS transfer function
HNLMS z ! 1. If the driving term is the Wiener error, then the NLMS error will still
be the Wiener error; that is, we get the expected behavior for small step-size.
Having derived an explicit expression for modeling NLMS performance in the
ANC and ALP scenarios, we will now outline the procedure by which the corresponding MSE estimate is evaluated. Working backward from Eqn. (9.23), the
Wiener error is the driving process for the LTI NLMS lter that then generates the
modeled steady-state NLMS error. This process is illustrated in Figure 9.24.
Figure 9.24
Steady-state MSE transfer function model for NLMS.
373
9.5 EQUIVALENT OPTIMAL TIME-VARYING WIENER SOLUTIONS
Figure 9.25
Equivalent steady-state MSE transfer function model for NLMS.
Note that the only dependence on NLMS step-size resides in the LTI transfer
function. The NLMS error process consists of the additive contributions due to the
independent processes fv0 ng, fvd ng, and fvr ng, corresponding to the input
process and the measurement noise processes on the desired and reference
processes, respectively. In order to calculate more readily the individual contributions to the modeled NLMS error, we note that all systems in Figure 9.24 are
LTI, so that the equivalent diagram, Figure 9.25, applies.
The individual contributions to the modeled NLMS MSE can therefore be
evaluated from the ACF and CCF of the corresponding contributions to the
processes fe1 ng and fe2 ng as follows:
re 0 re1 0 re2 0 re1 e2 0 r*e2 e1 0
9:39
Equation (9.39) is needed only in the evaluation of the contribution due to the input
process fv0 ng. The contribution due to the measurement noise on the desired
process involves only the corresponding component of fe1 ng, and the contribution
due to the measurement noise on the reference process involves only the
corresponding component of fe2 ng. Note that all systems in Figure 9.25 are autoregressive moving-average (ARMA) systems, and that each of the driving noises is
zero-mean and white. Consequently, the individual contributions to the modeled
NLMS MSE can be evaluated using the Sylvester matrix based approach [1, 14].
For this MSE estimate from the transfer function model for NLMS to apply to
reference-only NLMS, the WF partition corresponding to the desired channel is set
to zero and the WF partition corresponding to the reference channel is replaced by
the reference-only WF. The resulting input to the LTI NLMS lterwhich is
obtained by setting L 0 in Eqn. (9.33)is now the reference-only WF error.
9.5
EQUIVALENT OPTIMAL TIME-VARYING WIENER SOLUTIONS
In Section 9.2.2, Figure 9.7, we noted that time-varying behavior of the NLMS
weights occurred when demonstrating nonlinear effects in the ANC scenario. In this
374
section, it is shown that a time-varying, reference-only weight vector solution can be

dened that is equivalent to the optimal two-channel TI WF, that is, the WF that uses
both reference and auxiliary channel inputs. This is an important result since the
nonlinear behaviors previously observed were for reference-only AF implementations. The TV reference-only Wiener solution is found by relating the auxiliary
channel to the reference channel by means of a rotation sequence. As such, this
section is particularly descriptive of the ANC application. The resulting solution
becomes especially useful in describing, and understanding, AF performance for
cases in which the rotation sequence takes on a simple form. This result reveals the
mechanism by which the reference-only AF has access to information in the past
values of dn, as provided in the error signal. The latter is explicitly shown for the
sinusoidal ANC case in Section 9.6, for the narrowband AR(1) ANC case in Section
9.7, and for the ALP and AEQ cases in Section 9.8.
9.5.1
A Single-Channel TV Equivalent to the Two-Channel LTI WF
Assuming that we have solved for the optimal TI two-channel WF, using reference
and desired inputs, the estimate that such a lter produces is given by
d^ n wH
WFL;M un

dn 1
wH
L;
M
:
WF
rn
9:40
For the sake of illustration, lets assume that L M 1, so that the number of taps
in the desired signal channel is one less than the number of taps in the reference
channel. This is not an actual restriction, and we show in Section 9.5.3 how to
remove it.
Next, we dene the rotation, or linking, sequence fr ng, which expresses the
connection between the samples of the desired (or auxiliary) process and the
samples of the reference process:
dn r nrn:
9:41
In general, the linking sequence fr ng will be wildly uctuating. However, as will

be shown, for the nonlinear narrowband scenarios specied in Section 9.2, fr ng
exhibits regular and smooth behavior. This linking device allows us to rewrite
dn 1, the desired-input partition of un in Eqn. (9.28), as follows:
dn 1 diag0 r n 1 r n 2 r n Lrn
0 Dr n 1rn;
9:42
where Dr n 1 is an L L [here: M 1 M 1] diagonal matrix based on

the rotation sequence fr ng. Consequently, we can rewrite the optimal TI WF
375
estimate in Eqn. (9.40) as follows:

d^ n wH
WFL;M un
wdH
WFL;M
wrH
WFL;M
dn 1
rn
rH
wdH
WFL;M 0 Dr n 1rn wWFL;M rn

rH
wdH
WFL;M 0 Dr n 1 wWFL;M rn
9:43
wH
TVWF0;M nrn:
In the nal step shown in Eqn. (9.43), we have thus dened wTVWF0;M n, a timevarying (reference-only) WF that is equivalent to the optimal TI two-channel WF
wWFL;M . The latter is TI, but uses both the desired and the reference input channels.
The newly dened equivalent reference-only WF is TV, because of the term
involving Dr n 1. Note that both lters, in the rst and last lines of Eqn. (9.43),
produce exactly the same estimate.
Note from Eqn. (9.43) that Dr n 1 represents the only time-varying aspect of
the equivalent lter. Now, if the reference-only AF manages to effectively track this
TV equivalent to the optimal desired reference TI WF, then the AF may indeed
capture some of the performance advantage of the two-channel TI WF over the
corresponding reference-only WF.
9.5.2 Alternative Single-Channel TV Equivalent to the
Two-Channel LTI WF
The above TV WF equivalent is not unique unless L M 1. In this section, we
show that under the same assumption as in Section 9.5.1, that is, L M 1, a
different choice for linking the elements of the desired (auxiliary) channel vector
and the elements of the reference channel vector leads to an alternative TV WF
equivalent. The original and alternative produce the same estimates but exhibit
distinctly different weight vector behavior. In Section 9.6 we will show how NLMS
resolves the ambiguity of picking a weight vector from a manifold of equivalent
solutions.
The alternative rotation sequence fk ng, is dened as follows:
dn 1 k nrn:
9:44
Substitution of the latter leads to rewriting dn 1, the desired-input partition of

un in Eqn. (9.28), as follows:
dn 1 diagk n k n 1 k n L 0rn
Dk n 0rn:
9:45
376
Consequently, we can rewrite the optimal TI two-channel WF estimate in Eqn.

(9.40) alternatively as follows:
d^ n wH
WFL;M un

dn 1
rH
wdH
w

WFL;M
WFL;M
rn
rH
wdH
WFL;M Dk n 0rn wWFL;M rn
9:46

rH
wdH
WFL;M Dk n 0 wWFL;M rn
~H
w
TVWF0;M nrn:
~ TVWF0;M n, an alternative
Consequently, the last equality in Eqn. (9.46) denes w
TV equivalent lter that also produces the optimal TI two-channel WF estimate in
terms of only rn, the reference-input partition of un.
The optimal two-channel WF is LTI and uses both the desired and reference
channels as input, while the equivalent (and equally optimal) lter is TV and uses
only the reference channel as input. Again, Dk n in Eqn. (9.46) represents the only
TV aspect of this TV WF equivalent. Note that the rst weight vector element of
~ TVWF0;M n in Eqn. (9.46)
wTVWF0;M in Eqn. (9.43) is always constant, while for w
it is the last weight vector element that is always constant. However, while
exhibiting different weight vector behavior, these reference-only TV WF alternatives are equivalent in that each of these weight vectors lies in the manifold of
lters that produce the same WF estimate.
9.5.3
Nonuniqueness of TV WF Equivalents
We have shown two equivalent TV WFs in Sections 9.5.1 and 9.5.2. These
alternatives were associated with different ways to link a particular element of the
desired channel vector dn 1 with a particular element of the reference channel
vector rn. After the initial choice linking, for example, dn 1 with rn 1 by
using r n 1, we used delayed versions of the rotation sequence to link the
correspondingly delayed elements of the desired and reference channel vectors.
For the purpose of nding TV equivalent lters, we can generally dene rotation
sequences that link a particular element of the desired (auxiliary) channel vector
with any particular element of the reference channel vector, and then use that same
linking sequence to substitute for the other elements in the desired (auxiliary)
channel vector with the corresponding element of the reference channel vector. Such
377
general linking sequences can be dened as follows:

dn 1 r 1 nrn
dn 1 r 0 n 1rn 1
..
.
dn 1 r M2 n M 1rn M 1
..
.
9:47
..
.
dn L r L nrn
dn L r 1L n 1rn 1
..
.
dn L r M1L n M 1rn M 1:
For each of the L taps in the desired channel, M rotation sequences were dened
one for each tap in the reference channelthereby removing the earlier restriction of
assuming that L M 1. These linking sequences are uniquely dened by their
superscript, which reects the shift between the reference sequence and the desired
sequence that dene it. The linking sequences dened in Eqns. (9.41) and (9.44) are
now recognized as r 0 n and r 1 n, respectively.
Lets rst take a closer look at how these linking sequences may be useful in
narrowband scenarios, with processes governed by Eqn. (9.2). The linking sequence
r 0 n indicates how to operate on rn to get dn, while r 1 n 1 indicates how
to operate on rn 1 to get dn. In Figure 9.26 some of the linking sequences are
illustrated.
Lets assume that, at time n, we have the following linking relationship between
the desired signal dn and the reference signal rn:
dn r 0 nrn:
9:48
Based on the propagation dictated by the AR(1) processes as in Eqn. (9.2), we have
at time n 1 the following relationships for the desired signal dn 1 and the
378
Figure 9.26
Desired and reference signal propagation.
reference signal rn 1:
dn 1 d~ n 1 vd n 1
pd d~ n v0 n 1 vd n 1
pd dn
rn 1 r~ n 1 vr n 1
pr r~ n v0 n 1 vr n 1
9:49
pr rn;
where in the next-to-last step of each, we have used the fact that the driving noise is
small relative to the AR(1) process itself when its pole radius is close to 1 and the
fact that the measurement noise is small. For the narrowband ANC scenario in
Section 9.2, these assumptions are reasonable, other than close to zero-crossings,
where they are no longer valid. For purely exponential processes the relationships in
Eqn. (9.49) are in fact exact.
Consequently, the following approximate relationship between r 0 n 1 and
0
r n is valid most of the time in the narrowband ANC case:
r 0 n 1
dn 1
rn 1
pd dn
pr rn
0
pd p1
r r n
jpd j jv d v r 0
e
r n:
jpr j
9:50
379
Consequently, while dn and rnas stochastic processesmay not be very

correlated, there is in fact an almost functional relationship between the two as
sequenceslocally in timeas they rotate (almost) in unison! Furthermore, while
this (almost) functional relationship is TV, as seen from Eqn. (9.50), over a short
enough time horizon it is slowly varying in time when the pole magnitudes and
frequencies of the generating processes are relatively close.
Analogously, we nd the following approximate relationship, between
r 1 n 1 and r 1 n also to be valid for most time instants:
r 1 n 1
dn
rn 1
pd dn 1
pr rn
1
pd p1
n
r r
jpd j jv d v r 1
e
r n:
jpr j
9:51
Note that, under the narrowband assumptions above, all the linking sequences for
different shifts behave the same way. Under these circumstances, the behavior of the
TV aspects in Eqns. (9.43) and (9.46) is actually the same (though operating on a
different dimension of the weight vector).
As illustrated in Figure 9.26, there is also an approximate relationship between
the different linking sequences:
r 1 n 1
dn
rn 1
dn
pr rn
9:52
0
p1
r r n
jpr j1 ejv r r 0 n:

Note that the latter connection becomes a pure rotation when the reference process is
exponential, hence the name linking or rotation sequence.
With respect to the unique and optimal LTI two-channel WF, it is noted that
performance evaluations such as those in Section 9.4 can be used to select L and M
high enough for practical purposes. If L and/or M are selected too small, so that the
LTI two-channel WF is not (nearly) optimal, its TV WF equivalent is no longer
descriptive of the manifold from which NLMS picks its target weight vector. It is
possible in that case that NLMS performs better than that suboptimal LTI twochannel WF, since it is the limit, MMSE WF1; 1, that establishes the ultimate
performance bound for NLMS in WSS scenarios.
380
Assuming that L and M are chosen large enough, enumerating all possible linking
sequences involved in substituting for the desired channel vector elements with
reference channel vector elements, we nd the set from r L n to r M2 n. For
each of these linkages, we can determine the corresponding TV WF equivalent.
Each of these TV WF equivalents operates on the same reference vector, is a
reference-only lter, and produces the same estimate as the unique and optimal LTI
two-channel WF. Consequently, any linear combination of any of these TV WF
equivalent lters will produce that same optimal estimate, as long as the sum of
linear combination weights equals 1. The TV WF equivalents are nonunique and
make up the manifold of solutions that produces the optimal WF estimate.
We will next use the above linkage relationships to establish the exact and
approximate, respectively, time-varying WF targets for NLMS in the exponential
and narrowband AR(1) ANC scenarios. The question we then address is whether
there is a specic target solution determined by the multitude of possibilities
indicated above. In Section 9.6 we will provide the answer to that question for the
class of WSS sinusoidal processes. The narrowband AR(1) process case will be
addressed in Section 9.7.
9.6
TV NON-WIENER BEHAVIOR IN WSS EXPONENTIAL ANC
Having seen the multitude of alternative TV WF equivalents to the optimal LTI twochannel WF, we now illustrate the above ndings by applying them in the context of
WSS exponential processes. The specic context of the ANC scenario in Figure 9.1
will be used.
9.6.1
ANC for WSS Exponential Processes
Referring back to the signal generator in Figure 9.3, the noise-free desired and
reference processes, f~r ng and fd~ ng, respectively, are now governed by the
homogeneous difference equations corresponding to Eqn. (9.2):
d~ n pd d~ n 1
r~ n pr r~ n 1:
9:53
The WSS exponential processes are the zero-input responses of these systems,
starting from the appropriate random initial conditions. The frequencies and
amplitudes of the complex sinusoids are assumed xed. The following parameterization then applies:
pd e j v d
p r e jv r
d~ 0 Ad e jf d ; f d v U0; 2p
r~ 0 Ar e jf r ; f r v U0; 2p
f d ; f r statistically independent:
9:54
9.6 TV NON-WIENER BEHAVIOR IN WSS EXPONENTIAL ANC
381
For the noiseless desired and reference processes this leads to the following explicit
expressions:
dn Ad e jf d e jv d n
rn Ar e jf r e jv r n :
9:55
In the two-channel ANC scenario, our goal is to estimate the desired process from its
past and from causally past values of the reference signal. For the sake of simplicity,
rst select L 1 and M 1. The estimate for dn is then written as follows:
d^ n wH
WF1;1 un

dn 1
H
:
wWF1;1
rn
9:56
We are seeking the LTI two-channel WF solution that produces the desired signal
dn. To that end, we note from Eqn. (9.55) the following:
dn Ad e jf d e jv d n
Ad e jf d e jv d n1 e jv d
9:57
e jv d dn 1:
Using the latter to substitute for dn 1 in Eqn. (9.56) produces the LTI WF
solution:
jv d

e
dn
d^ n wH
WF1;1
rn
jv d

e
dn
jv d
9:58
0
e
rn
dn :
We are interested in the reference-only equivalent to the LTI WF solution. We use
the linking sequence r 1 n, as dened in Eqn. (9.47), to substitute for dn 1
with rn. This leads to the following, which is a special case of Eqn. (9.46):

dn 1
d^ n wH
WF1;1
rn
"
#
r 1 nrn
jv d
0
e
9:59
rn
e jv d r 1 nrn
wH
TVWF0;1 nrn:
382
Substituting Eqn. (9.55) in the linking sequence yields
r 1 n
dn 1
rn
Ad e jf d e jv d n1
Ar e j f r e j v r n
Ad jf d f r jv d v r n jv d
e
e
e
:
Ar
9:60
Substituting the latter in wTVWF0;1 , as dened in the last equality of Eqn. (9.59),
yields the following explicit expression for the equivalent TV WF:
wTVWF0;1 n e jv d r 1 nH
Ad jf d f r jv d v r n
e
e
:
Ar
9:61
How does this relate to AF behavior and performance? Recall that NLMS, for
step-size equal to 1, adjusts the NLMS weight vector so that the a posteriori error
equals 0. This means that the a posteriori weight vector wAF0;1 n 1 produces the
desired signal:
dn wH
AF0;1 n 1rn
wH
AF0;1 n 1rn
d^ n
9:62
e
e
rn:
Ar
The nal equality comes from comparing with Eqn. (9.59) and substituting Eqn.
(9.61), thereby producing the unique correspondence between the optimal TV WF
weight vector at time n and the AF weight vector at time n 1. In this example, we
can then write an explicit expression for the weight behavior of NLMS, with stepsize equal to 1, because the a posteriori weight vector in one iteration equals the a
priori weight vector in the next iteration:
wAF0;1 n
Ad jf d f r jv d v r n1

e
e
:
Ar
9:63
The weight behavior periodicity, determined by the difference frequency, is

consistent with the original observations [19]. Note from comparison of Eqn. (9.61)
and Eqn. (9.63) that the AF is always off by the factor e jv d v r , as it is always
lagging one step behind the TV optimal solution. This lag accounts for the nonzero
383
a priori estimation error associated with the AF, which tends to vanish as the
frequency difference vanishes.
Recall that at iteration n we have the a priori weight vector wAF0;1 n, and that
the a posteriori weight vector wAF0;1 n 1 follows from the weight update
equation:
wAF0;1 n 1 wAF0;1 n
wAF0;1 n
m
e*nun
uH nun
m

dn wH
AF0;1 nrn rn
r H nrn

e
e
Ar

m
Ad
2 Ad e jf d e jv d n e jf d f r e jv d v r n1 Ar e jf r e jv r
Ar
Ar
Ar e jf r e jv r

e
e
Ar

Ad jf d f r jv d v r n Ad jf d f r jv d v r n1
m
e
e
e
e
Ar
Ar
wAF0;1 n m wAF0;1 n 1 wAF0;1 n:

9:64
From substituting for the TV AF weight solution in the third equality, we recognize
in the nal equality that our solution satises the weight update equation for stepsize equal to 1. Note that the correction term in the rst equation above is in the
direction of the vector un. For the present example that direction is given by rn,
with wAF0;1 nthe a priori weight vectoras its point of origin. We then
orthogonally project the best weight vector solution for representing dn, that is,
wTVWF0;1 n, onto that direction vector. For this example, that best weight vector
solution falls on the direction vector and is itself the a posteriori weight vector when
m 1. Note from Eqn. (9.61) that the weight vector solution is represented by a
phasor rotating about the origin clockwise, with an incremental angle of v d v r .
When m = 1 we add the corresponding fraction of the correction vector to the a
priori weight vector, and the target weight vectorrepresented by the best model for
dnremains wTVWF0;1 n. Based on the rotating phasor representing the target
weight vector solution, we hypothesize the following steady-state solution for the
NLMS weight vector in general, with g e jc yet to be determined:
wAF0;1 n g e jc

e
e
:
Ar
9:65
384
Using Eqn. (9.65) to substitute for the steady-state NLMS weight vector in the
weight update equation,
wAF0;1 n 1 wAF0;1 n
m
r H nrn
e*nrn;
9:66
and solving for g and c , the unknowns in Eqn. (9.65), yields
g e jc
m
:
1 m 1e jv d v r
9:67
Note that m ! 1 produces g e jc ! 1, as expected. When m ! 0, on the other hand,

also g e jc ! 0 and results in a vanishing steady-state AF weight vector. The latter
corresponds to the reference-only WF, since the correlation between sinusoids of
different frequencies is zero. Using Eqn. (9.67) in Eqn. (9.65) produces the steadystate NLMS weight behaviorfor arbitrary step-sizefor the WSS sinusoidal ANC
scenario.
An explicit expression for the mean square value of the steady-state a priori error
can also be derived, in this case, from the explicit expression for the AF error.
Substituting Eqns. (9.65) and (9.67) in the equation for the a priori error, we nd
en dn wH
AF0;1 nrn
Ad e jf d e jv d n
dn
1 ejv d v r
1 1 m ejv d v r
9:68
1 ejv d v r
:
1 1 m ejv d v r
This expression for the steady-state error is valid for any realization; the factor that
converts the desired signal into the error signal depends only on the frequency
difference of the sinusoidal signals, and not in any way on their phases. Consequently, the mean square value of the error in Eqn. (9.68) is also the MSE for the
WSS sinusoidal process case. Note that for any nonzero step-size, the error goes to
zero as the frequency difference goes to zero. The latter corresponds to the AF
weight vector becoming the nonzero constant that corrects for the amplitude and
phase difference between any set of sinusoidal process realizations. Also, keeping
v r xed and sweeping v d , notch lter behavior is observed [19], but without any
restrictions on the desired and/or reference channel frequencies. An interesting
observation is that for m ! 1 and v d v r ! p , we nd en ! 2dn; that is, the
worst-case result of using NLMS is a 6 dB increase in error power over not having
ltered at all.
9.6.2
385
Alternative Equivalent TV WF
In the previous section M L 1, and there was only one way to link the past
desired value with the present reference value, resulting in a unique TV WF
equivalent. Correspondingly the NLMS weight vector solution was derived in
straightforward fashion. For M; L . 1 multiple TV WF equivalents exist.
For the same scenario as dened in Section 9.6.1, but with M 2, there are now
two elements in the reference vector partition (two taps in the reference channel
delay line). In complete accordance with the linking sequence in Eqn. (9.60) we
have already derived the following TV WF equivalent, as in Eqn. (9.59). It has
merely been rewritten with the additionalhere inactivedimension corresponding to the second reference vector dimension:
2
3
dn 1
6
7
rn 5
d^ n wH
WF1;2 4
rn 1
2
e jv d
3
r 1 nrn
6
7
0 04
rn
5
rn 1

e jv d r 1 n 0
rn
9:69
rn 1
_H
w
TVWF0;2 nrn:
There is now a second way to link the desired channel with the reference channel,
namely, by using the linking sequence r 0 n. The alternative to the development
resulting in Eqn. (9.69) is the following, a special case of Eqn. (9.43):
2
6
d^ n wH
WF1;2 4
dn 1
7
rn 5
rn 1
2
e jv d
3
r 0 n 1rn 1
6
7
0 04
rn
5
rn 1
0 e jv d r 0 n 1
H
w
TVWF0;2 nrn:
rn
rn 1
9:70
386
From Eqns. (9.69) and (9.70) the following generalized TV WF equivalent is

developed, after substitution from Eqn. (9.52) (exact for exponential signals) and
Eqn. (9.60):
_
TVWF0;2 n
wTVWF0;2 n a w
TVWF0;2 n 1 a w
"
#
a ejv d r 1* n
1 a ejv d r 0* n 1
9:71

a
:
e
e
Ar
1 a ejv r
Any (complex) scalar a generates an equally valid TV WF equivalent.

9.6.3
Uniqueness Resolved
Equation (9.71) provides the set of target solutions for the NLMS algorithm. For the
present scenario this set is complete, since knowledge of dn 1 is sufcient to
completely determine the desired dn. Actually, knowing dn 1 for any positive l
is sufcient, as following the above procedurefor any of these choicesleads to
the same solution set given in Eqn. (9.71).
Recall that NLMS can be interpreted as nding the new weight vector that
minimally differs from the current one. From Eqn. (9.71) we can write for the weight
vector increment
wTVWF0;2 n 1 wTVWF0;2 n
Ad jf d f r jv d v r n jv d v r

e
e
e
1
Ar

a
:
9:72

1 a ejv r
The only part that depends on a is the vector on the right. The norm of this vector is
minimized by the choice a 0:5. Substituting a 0:5 in Eqn. (9.71), incorporating the effect of m , as given in Eqn. (9.67), and accounting for the AF always
lagging one step behind its target gives the following expression for the a priori
steady-state weight vector behavior of the reference-only NLMS AF with two input
taps:
wAF0;2 n g e jc

0:5
;
e
e
0:5ejv r
Ar
9:73
where g e jc is as given in Eqn. (9.67). Figure 9.27 shows the actual a posteriori
weight vector behavior and the behavior in steady stateas governed by Eqn. (9.73)
one step advancedfor NLMS step-size m of 1 and 0.1, with Ad 2,
v d p =3 0:052p , Ar 3, v r p =3, and random initial phases. We see that
387
Figure 9.27 Actual (x) and theoretical (o) NLMS weight behavior for m 1 [(a) and (c)]
and m 0:1 [(b) and (d)]; Ad 2, v d p =3 0:052p , Ar 3, and v r p =3.
388
Figure 9.27
(continued )
389
steady state is reached more quickly and that the weight vector elements have larger
amplitudes when m 1 (a,c) than when m 0:1 (b,d). Also, the steady-state weight
behavior given in Eqn. (9.73) is veried by the actual NLMS result.
In Figure 9.28, using Eqn. (9.68), the corresponding actual and theoretical MSE
behaviors are shown for m 1 (a) and m 0:1 (b). For small step-size, NLMS
cannot follow the changes in the TV WF target weight vector and the steady-state
error is therefore larger.
The steady-state output of the AF follows from Eqn. (9.73), using Eqn. (9.55):
yn wH
AF0;2 nun
g ejc
ge
jc
e
e
0:5
Ar
Ad e

A r e jf r e jv r n
0:5e jv r
Ar e jf r e jv r n1
9:74
jf d jv d n1 jv r
g ejc e jv r dn 1:
In the next to last equality, we see that the steady-state output of the AF consists
of a single frequency component, at frequency v d , conrming Glovers original
heterodyning interpretation [19]. Any other frequency components in the AF output
vanish, as they result from AF transient behavior. Recall from Eqn. (9.67) that
m ! 1 produces g e jc ! 1, resulting in steady-state AF output yn
e jv r dn 1, showing that the AF adjusts the desired signal from the previous step.
The above readily generalizes to the use of more reference vector elements (or
delay line taps). For every increment in M, an element is added to the vector in the
right-hand-side of Eqn. (9.71). As a result of the next higher indexed and one sample
further delayed rotation sequence, using Eqn. (9.52), each addition of an element
contains an extra factor of ejv r . The latter expresses a rotation of the added weight
relative to the earlier weights. The multiple solutions, expressed by the corresponding elements, are all weighted equally to produce the minimum norm weight
vector increment. Figure 9.29 shows the weight vector behavior for a 10-tap lter
and m 1, with otherwise the same parameters as above. In Figure 9.29 we see that
the weight with index 1 and the weight with index 7 have the same behavior.
Weights with indices 2 and 8 also behave the same way, and so on. There is
periodicity over the weight index, with period equal to 6. This corresponds to the
factor ejv r in the weight vector, as in this example v r p =3. We also observe in
each weight a period of 20 over the time index, due to v d v r 0:052p ,
corresponding to the ejv d v r n term that all weight vector elements have in
common. These periodic weight behaviors were originally observed by Glover [19].
Figure 9.30 shows a close-up of the weight-vector behavior in the 10-tap case.
NLMS starts to adapt at time index 10. Starting from the zero vector, it only takes
one iteration to get into steady-state behavior, because m 1. As explained, we see
the behavior of only six different weight vector elements because of the periodicity
over the weight index.
390
Figure 9.28
Sinusoidal case MSE for m 1 (a) and m 0:1 (b).
391
Figure 9.29 Weight behavior for 10-tap NLMS (m 1) in the sinusoidal ANC scenario:
Ad 2, v d p =3 0:052p , Ar 3, and v r p =3.
392
Figure 9.30 Close-up of actual (x) and theoretical (o) weight behavior for 10-tap NLMS
(m 1) in the sinusoidal ANC scenario: Ad 2, v d p =3 0:052p , Ar 3, and
v r p =3.
9.7 TV NON-WIENER BEHAVIOR IN WSS AR(1) PROCESS ANC
393
Figure 9.31 shows the corresponding result for m 0:1. Only the real part of the
weights is presented, as the imaginary part behaves similarly. From the small stepsize result, we noted earlier that the transient behavior takes longer. However, after
100 iterations, the actual NLMS behavior and the theoretical steady-state behavior
have become indistinguishable.
The error behavior for the two-tap lters above was shown in Figure 9.28. For the
10-tap lters this behavior is simply delayed by eight samples, corresponding to the
delayed start of weight vector adaptation. That this is so can be seen from
generalizing the following transition from the one-tap to the two-tap lter, based on
Eqns. (9.65), (9.55), and (9.73):
d^ AF0;1 wH
AF0;1 nrn
g ejc
e
e
rn
Ar
g ejc
e
e
0:5rn 0:5e jv r rn 1
Ar

rn
jv r
e
e
0:5 0:5e
Ar
rn 1
g ejc
9:75
wH
AF0;2 nrn
d^ AF0;2 n:
The a priori steady-state estimatesand therefore the corresponding errorsare the
same for the one-tap and two-tap AF. Consequently, the system function from the
error signal to the AF output remains the same and is independent of M, the number
of taps in the AF delay line.
In Figure 9.27 we observe that the amplitudes of the real and imaginary
components of the steady-state weights for the two-tap lter are 0.34 and 0.11 for
m 1 and m 0:1, respectively. In Figures 9.30 and 9.31, for the 10-tap lter,
these amplitudes have dropped to 0.062 and 0.021. Recall that NLMS minimizes the
norm of the weight vector increment from one iteration to the next. While the a
priori estimation error remains the same, as the number of taps used is increased the
norm of the weight vector increment decreases.
9.7
TV NON-WIENER BEHAVIOR IN WSS AR(1) PROCESS ANC
We now test the TV optimal equivalent hypothesis for AR(1) processes. This
represents a widening of the bandwidth relative to the WSS exponential processes,
as well as the emergence of a driving term in the difference equations representing
these processes. While for the exponential processes the linear prediction error is
zero, this is no longer the case for AR(1) processes. While the stochastic nature of
the input processes makes it difcult to describe the weight dynamics exactly, it will
394
Figure 9.31 Close-ups of the real part of actual (x) and theoretical (o) weight behavior for
10-tap NLMS (m 0:1) in the sinusoidal ANC scenario: Ad 2, v d p =3 0:052p ,
Ar 3, and v r p =3.
395
be shown that the underlying mechanisms for explaining TV weight behavior

remain principally the same.
9.7.1
Two-Channel Narrowband AR(1) ANC Scenario
We now return to the ANC scenario in Section 9.2.2, with pole radii of 0.99 and
frequencies 188 apart. The optimal performance of the reference, desired, and twochannel nite horizon optimal WFs was shown in Figure 9.16 for various horizons.
We saw that when both reference and desired inputs are used, the performance
rapidly approaches a limit, which is in fact the limit achieved by the corresponding
innite horizon WF. The best LTI WF in the Gaussian scenario is a lter that
operates on a somewhat limited past of both the desired and reference inputs.
In order to demonstrate more easily the TV equivalent models, we use the
scenario from Section 9.2.2, but with the SNR increased to 80 dB. For later
reference, we thus have the following parameterization:
p
pd 0:99e j 3
pr 0:99e j 3 0:052p
p
SNRd 80 dB
SNRr 80 dB:
9:76
Figure 9.32 shows a close-up of the WF performance for this scenario. We see that
when there is (nearly) no observation noise, the optimal lter only requires two past
values of dn and rn to reach a performance equal to that for the innite horizon
lter. In fact, in the truly noiseless scenario, optimal performance is obtained with
one past value of dn and two past values of rn. The latter is a direct result of the
generating processes being AR(1) (only one past value is needed for its optimal
prediction). The addition of observation noise causes the (noisy) desired and
reference processes to become more and more ARMA [24]. Consequently, the
equivalent AR processes approach being of innite order. Depending on the SNRs,
we can approximate these processes reasonably well with AR( p) processes of high
enough order. Using the analytical technique described in Section 9.4.2 (resulting in
Figure 9.16, for example), we can readily determine how much of a WF horizon is
needed to get within an acceptable margin of the optimal performance.
For the scenarios reected in Figures 9.16, 9.18, and 9.23, we observed that there
is often a substantial performance gap between the reference-only WF and the
desired-only or desired reference WF. In this section we will outline the conditions
under which the AF performance can approach that of the optimal desired
reference WF. We will rst derive approximate reference-only TV WF equivalents
to the two-channel WF, as discussed in Section 9.5. Due to the misadjustment and
lag variance associated with the use of AF ltering techniques, only part of the
performance gap can be bridged. Furthermore, there may not be any performance
advantage when the time variations are too fast to be tracked by an AF.
While the overall optimal performance in Figure 9.32 is reached for L 2,
M 2, the optimal MMSE is actually reached for L 1, M 2 (as indicated by the
396
Figure 9.32 Optimal WF performance for the (nearly) noiseless ANC scenario:
MMSE WFL; 0 behavior). By making the SNR very high, we thus have a truly
optimal WF at very low orders with which to demonstrate the optimal TV WF
equivalents.
Now that we have established a scenario for which the optimal WF is wWF1;2 and
this lter is LTI, we expect nice behavior (convergence to a tight neighborhood of
the optimal LTI WF) from the corresponding AF, wAF1;2 . Figure 9.33 shows the
learning curve for the latter, together with the error behavior for the optimal lter,
wWF1;2 . Note that the AF does almost as well as the optimal WF. The discrepancy
between the two is known as the misadjustment error, which for m 1 is generally
close to 3 dB.
The weight behavior of the AF(1,2) and WF(1, 2) lters is shown in Figure 9.34.
The weight vector for WF(1, 2) is [0:4950 0:8574j 1 0:2058 0:9684j]. We see
that the adaptive lter weights are almost indistinguishable from those of the WF.
Only if we zoom in, as in Figure 9.35, do we see that the AF weights are actually
varying somewhat. The random uctuation behavior of the weights is responsible
for the excess MSE seen in Figure 9.33. One might say that we get nice, desirable
behavior. The NLMS AF converges to (a neighborhood of) the optimal solution in
Figure 9.33 Learning curve for two-channel

pr 0:99e j p =30:052p , SNRd SNRr 80 dB.
WF
and
AF:
397
pd 0:99e jp =3 ,
its quest for minimizing the error under the constraint of minimal change in the
weight vector increments. The latter is eminently compatible with the existence of
an LTI solution in this case.
9.7.2
Reference-Only Equivalents for Narrowband AR(1) ANC
Before developing the reference-only equivalents to the two-channel wWF1;2 of the

previous section, we demonstrate the wAF0;2 behavior we want to explain. Figure
9.36 shows a close-up of the learning curve for the single-channel lters AF(0, 2) and
WF(0, 2) during the steady-state interval, for the scenario in Eqn. (9.76), and for the
same realization as reected in Figures 9.34 and 9.35. During most of this steadystate interval, AF(0, 2) outperforms WF(0, 2); that is, the AF exhibits non-Wiener
behavior. This non-Wiener behavior is associated with TV behavior of the AF(0, 2)
weights, as shown in Figure 9.37 for the same steady-state time interval. Above, it
was argued that WF(1, 2) provides the optimal WF for the present scenario, in which
there is little to no measurement noise. To explain the behavior of the AF(0, 2) AF,
we set out to nd the WF(0, 2) equivalent to the optimal WF(1, 2).
398
Figure 9.34 Real (a) and imaginary (b) components of the AF(1, 2) and WF(1, 2) weights:
Figure 9.35
Expanded view of Figure 9.34.
399
400
Figure 9.36 Non-Wiener behavior of AF(0, 2): pd 0:99e jp =3 , pr 0:99e j p =30:052p ,

SNRd SNRr 80 dB.
The rst TV equivalent lter for this situation follows directly from Eqn. (9.43)
by using L 1 and M 2. This gives us the following result:
d^ n wH
WFL;M un

rH
wdH
WFL;M wWFL;M
dn 1
rn
rH
wdH
WFL;M 0 Dr n 1rn wWFL;M rn

0
rH
wdH
WF1;2 0 r n 1 wWF1;2 rn
9:77
H
w
TVWF0;2 nrn:
WFTV0;2 n has a rst component that is constant, as it comes from the
Note that w
LTI WF(1, 2) exclusively. The second component, the one depending on r 0 n 1,
is the only (potentially) TV component.
401
Figure 9.37 Real (a) and imaginary (b) components of the AF(0, 2) weights: exactlypd
0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB.
402
The second TV equivalent lter follows directly from Eqn. (9.46) and gives the
following result:
d^ n wH
WFL;M un

dn 1
rH
w

wdH
WFL;M
WFL;M
rn
rH
wdH
WFL;M Dk n 0rn wWFL;M rn

1
wdH
n 0 wrH
WF1;2 r
WF1;2 rn
9:78
_H
w
TVWF0;2 nrn:
_
Note that w
TVWF0;2 n has a second component that is constant, as it comes from the
LTI WF(1, 2) exclusively. Now the rst component, the one depending on r 1 n,
is the only (potentially) TV component.
Combining the results from Eqns. (9.77) and (9.78), and using Eqn. (9.52) to
substitute for r 0 n 1 in Eqn. (9.77), we can now state the set of (approximate)
TV WF equivalents that describes the manifold from which NLMS determines the a
posteriori weight vector:

wTVWF0;2 n a wdWF1;2

wrWF1;2
r 0* n 1
!
"
#
1*
r
n
1 a wdWF1;2
wrWF1;2
0
"
#
1 a r 1* n
d
wWF1;2
wrWF1;2
0*
ar n 1
wdWF1;2 r 1* n
1 a
a jpr jejv r
9:79

wrWF1;2 :
The rst term on the right-hand side in Eqn. (9.79) is TV, following the behavior of
the rotation sequence r 1 n. Both vector elements vary with the difference
frequency when the approximation in Eqn. (9.51) is valid, and the second weight is
offset by an angle corresponding to the reference frequency when the approximation
in Eqn. (9.52) is valid.
Referring back to Figure 9.37, we see both of these weight vector behaviors. Note
that our derivation was subject to holding most of the time, a condition based on
measurement noise being locally negligible with respect to the signal values; this
pertains in particular to the reference signal values, as those show up in the
denominator of our linking sequences. Note how the regularity of the TV weight
403
behavior in Figure 9.37 is temporarily lost near sample 4950, where WF(0, 2) does
temporarily better than AF(0, 2), as seen in Figure 9.36. When the signal is small
relative to the noise, Eqn. (9.51) loses its validity and the semiperiodic weight
behavior is disturbed, as reected in the interval around sample 4925. Furthermore,
in this example, a very short reference vector is being used in the reference-only AF
(containing only two reference channel samples), which can easily cause a rather
small reference vector norm for some instants. As a consequence, the NLMS weight
update produces temporarily large disturbances of the weight vector.
In order to nd the a posteriori target weight vector for NLMS from the manifold
of solutions described in Eqn. (9.79), we next evaluate the weight vector increment:

wWFTV0;2 n 1 wWFTV0;2 n wdWF1;2 r 1* n 1 r 1* n

1 a

:
9:80
a jpr jejv r
Assuming the rotation sequence difference to be constant, the norm squared of the
weight vector increment has the following proportionality:
kwWFTV0;2 n 1 wWFTV0;2 nk2 v j1 a j2 ja j2 jpr j2 :
9:81
Writing the right-hand side in terms of the real and imaginary part of a , and
minimizing with respect to both, yields the optimal linear combination coefcient:
a opt
1
:
1 jpr j2
9:82
Substituting in Eqn. (9.79) produces the a posteriori weight vector target for NLMS:
wWFTV0;2 n
wdWF1;2 r 1* n

jpr j
jpr j
wrWF1;2 :
jv
1 jpr j2 e r
9:83
Comparing the weight vector in Eqn. (9.71) to that in Eqn. (9.83), we note that the
former is explicit in terms of the parameters reecting the exponential scenario,
while the latter is implicit, as it contains the rotation sequence r 1 n. The latter
determines the behavior of the TV aspect of the a posteriori weight vector target for
NLMS. The stochastic nature of the temporal behavior of the linking sequence
exemplies the main difference between the deterministic and the stochastic
narrowband WSS cases.
Figure 9.38 shows the behavior of the actual NLMS update (solid, varying)
together with that of the hypothesized target model (dotted, varying) in Eqn. (9.83)
and the reference portion of the LTI wWF1;2 0:4950 0:8574j 1 0:2058
0:9684jT (gray, constant). The latter indicates the values around which the TV
weights vary, according to Eqn. (9.83), whichunlike in the exponential case are
now generally nonzero. We note that the hypothesized weight vector behavior, as
404
Figure 9.38 Real (a) and imaginary (b) components of the NLMS and hypothesized NLMS
weights: pd 0:99e jp =3 , pr 0:99e jp =30:052p , SNRd SNRr 80 dB.
405
predicted from the manifold of TV equivalent WFs, follows the actual NLMS
behavior quite well. While there appear to be discrepancies between the two from
time to time, this is attributed to the relationships used in the derivation being
approximate and valid most of the time. Note that NLMS for step-size m 1
produces an a posteriori estimate equal to the desired value (which is slightly noisy),
while the hypothesized model aims to produce the Wiener solution provided by the
two-channel LTI WF. Figure 9.39 shows these respective estimates. The estimates
from the TV AF and WF equivalent lter are indistinguishable ( and * coincide) and
nearly equal to the desired value (o), while the NLMS estimate is strictly equal to the
desired value (because m 1). The a posteriori weights track the TV WF
equivalent. More importantly, most of the time, these a posteriori weights are still
relatively close to the TV WF equivalent weights at the next iteration (as seen in Fig.
9.38), resulting in an a priori error that is small relative to that produced by the LTI
WF weights (as seen in Fig. 9.36).
Figure 9.40 shows the norm of the weight change vector during steady state for
the various solutions that were considered. The optimal TV WF, as expressed in
Eqn. (9.83), is observed to have a weight vector increment norm smaller than either
one of its two constituents, as given in Eqns. (9.77) and (9.78). Moreover, linearly
combining the latter, as in Eqn. (9.79), and numerically nding a to yield the
minimum of either the max, min, mean, or median of the norm of the weight vector
increments over the steady-state interval all yielded a very close to 0.5 and nearly
indistinguishable weight vector solutions.
As in the exponential case, the addition of more taps in the reference channel
creates additional weight solutions with the TV aspect modied by jpr jejv r , that is,
shifted and with slightly smaller amplitudes. We can observe the shifting in Figure
9.7, where M 3. In the latter case the SNRs were 20dB, illustrating that it is the
validity of Eqns. (9.51) and (9.52) in the vicinity of zero crossings that is more a
determinant of weight behavior than SNR.
9.7.3
Fast TV Optimal Equivalent Scenarios
The narrowband scenario in Sections 9.7.1 and 9.7.2 supports the notion that it is the
slowly TV equivalent optimal solution that is being tracked. It is relatively simple,
then, to hypothesize a very similar scenario in which the TV equivalent solution
varies rapidly. If we choose the following scenario for Figures 9.2 and 9.3,
p
pd 0:99e j 3
pr 0:99e j 3 0:502p
SNRd 80 dB
p
9:84
SNRr 80 dB;
then the optimal WF performance graph looks as it does in Figure 9.41. Note that,
although we have changed the pole angle difference dramatically (from 188 to 1808),
there is still a large performance gap between the reference-only and two-channel
WFs, so that one might benet from the possible nonlinear effect of using an AF in
406
Figure 9.39 NLMS and hypothesized NLMS a posteriori estimates (a) and close-up (b):
pd 0:99e jp =3 , pr 0:99e j p =30:052p , SNRd SNRr 80 dB; desired (o), AF0; M (*),
WF0; M (solid), WFTV0; M opt(.).
407
Figure 9.40 Weight vector increment norms for various TV equivalents: pd 0:99e jp =3 ,
_
WFTV0;2 (solid gray), w
pr 0:99e jp =30:052p , SNRd SNRr 80 dB; w
WFTV0;2 (dotted
gray), wWFTV0;2 (black).
this scenario. The manifold of TV equivalent lters is still described by Eqn. (9.83).
The linking sequence is still dened as before and, for this scenario, specically
evaluates as follows from Eqn. (9.51):
r 1 n
jpd j jv d v r 1
e
r n 1
jpr j
ej0:502p r 1 n 1
9:85
r 1 n 1:
Substituting the latter in Eqn. (9.83) yields the following TV WF equivalent relative
to some arbitrary steady-state time index n0 :
wWFTV0;2 n
wdWF1;2 1nn0

jpr j
jpr j
wrWF1;2 :
jv
1 jpr j2 e r
9:86
The rst term on the right-hand side is seen to change maximally from iteration to
iteration. An example of the performance of the reference-only WF and reference-
408
Figure 9.41 WF performance for the fast

TV
scenario:
pd 0:99e jp =3 ,
only NLMS AF is shown in Figure 9.42. In the latter, we now see that the TI
WF(0, 2) still performs close to its theoretical bound but that the AF(0,2)while
exhibiting the same overall error behaviornow has an error that is approximately
6 dB larger than that for the TI WF. Recall that this is the worst-case expectation for
the exponential case with a frequency difference of p ; that is, the behavior of the a
priori error for the (nearly) noiseless narrowband AR(1) case isfor each
iterationclose to that for the corresponding exponential scenario. Comparing with
Figure 9.36, we note that the performance advantage of AF over WF has ipped into
a comparable disadvantage.
Figure 9.43 shows that the two-channel AF performance is still Wiener-like and
similar to that in Figure 9.33. We observe that only the convergence rate seems to
have been affected, not the steady-state performance.
In Figure 9.44 the real part of the AF(1, 2) weights is shown, together with a
zoomed version, as are the WF(1, 2) weights. The imaginary part of the weight
vector behaves the same way. The WF(1, 2) weight vector for this scenario is
[0:4950 0:8574j 1 0:4950 0:8574j]; that is, its rst and last component are the
same.
An indication of weight vector tracking, corresponding to Figure 9.38, is now
reected in Figure 9.45. The weight vector for WF(0, 2) is [1 0:5 0:8660j] and
Figure 9.42 Filter performance for the fast

TV
scenario:
409
pd 0:99e jp =3 ,
therefore is actually close to wrWF1;2 , the reference portion of WF(1, 2). The NLMS
does not appear to track the optimal solution well in an absolute sense, since the a
posteriori weight vector is not close to the hypothesized TV WF equivalent. However, the a posteriori NLMS weight vector still falls in the required manifold, as
inferred from Figure 9.46, where it produces the desired a posteriori estimate. The
difference between the actual and hypothesized weight vector behavior is transient
in nature. Simultaneously, the a priori error has become large, as seen in Figure 9.42,
because the a posteriori AF weight vector at time n is no longer close to the optimal
TV target at time n 1 due to its lagging behind one time interval. The referenceonly WF now performs better than its AF counterpart because the latter is subject to
a large lag error, while the former is not. The key difference between NLMS for the
scenario in Eqn. (9.76) versus the scenario in Eqn. (9.84) lies in the a priori weight
vectors and the corresponding errors. While the a posteriori behaviors, in Figures
9.39 and 9.46, respectively, are very similar, the a priori errors are very different, as
shown in Figures 9.36 and 9.42, respectively. Figure 9.38 shows that the weights at
time n are generally close to the weights at time n 1 and vary about the reference
portion of the Wiener solution, while in Figure 9.45 the weights at time n are not
near the reference portion of the two-channel Wiener solution (and, in this case, also
410
Figure 9.43 AF(1, 2) and WF(1, 2) performance for the fast TV scenario: pd 0:99e jp =3 ,
not near the WF(0, 2) solution). Furthermore, the TV portion of wTVWF0;2 changes
its direction by 1808 from one sample to the next. While the NLMS weight behavior
has the same features as its target solution, it is not tracking that target very well. The
fact that NLMS inherently lags one sample behind, since its tracking takes place a
posteriori, limits the parameterizations of the ANC scenario over which MSE
performance improvement can be observed.
9.8
TV NON-WIENER BEHAVIOR IN ALP AND AEQ
After our detailed treatment of the non-Wiener behavior in the ANC cases of
Sections 9.5, 9.6, and 9.7, we can now more readily address the nonlinear effects
question for the ALP and AEQ cases. The major distinction with the ANC case lies
in the use of different auxiliary and/or reference processes. In the ALP case the
auxiliary vector contains the immediate past of the desired signal (as in the ANC
case), while the reference vector contains the far past of the desired signal. We have
seen in Section 9.4.5 that this had no impact on the form of the transfer function
model for NLMS. In the AEQ case the auxiliary vector contains the interference
signal, which is totally uncorrelated with the desired signal, and the reference vector
9.8 TV NON-WIENER BEHAVIOR IN ALP AND AEQ
411
Figure 9.44 Real (a) and zoomed real (b) components of the AF(1, 2) and WF(1, 2) weights
for the fast TV scenario of Figure 9.43.
412
Figure 9.45 Real (a) and imaginary (b) components of the AF(0, 2) weights for the
fast TV scenario: pd 0:99e jp =3 , pr 0:99e j p =30:502p , SNRd SNRr 80 dB;
wAF0;2 n (solid, varying), wWFTV0;2 n (dotted, varying), wrWF1;2 (gray, constant).
413
Figure 9.46 A posteriori NLMS and hypothesized NLMS estimates for the fast TV
scenario: pd 0:99e jp =3 , pr 0:99e j p =30:502p , SNRd SNRr 80 dB; desired (o),
AF(0, 2) (*), WF(0, 2) (solid), wWFTV0;2 (.).
414
contains an interference-corrupted version of the desired signal. As we will see in

the sections to follow, in both cases the important difference from the ANC case lies
in the behavior of the corresponding linking sequences [2 5].
9.8.1
Narrowband AR(1) ALP
In the ALP scenario the input vector to the two-channel WF is as follows [2]:

an
un
:
9:87
rn
The auxiliary vector an is dened on the basis of the immediate past of the desired
signal, while the reference vector contains the far past of the desired signal. This
choice for the auxiliary vector is based on the knowledge that the best causal
predictor for dn uses its most recent past:
3
2
dn 1
6 dn 2 7
7
6
an 6
..
7 dn 1;
5
4
.
dn L
2
dn D
6 dn D 1
rn 6
..
4
.
9:88
3
7
7 dn D:
5
dn D M 1
At very high SNR, from Eqns. (9.2) and (9.3), the following relationship holds for an
AR(1) desired process:
dn pd dn 1 v0 n:
9:89
We recognize the rst term on the right-hand side of Eqn. (9.89) to be the best onestep predictor for dn on the basis of its immediate past. That estimate engenders an
MSE equal to the variance of v0 n. If we use Eqn. (9.89) to replace dn 1 on its
right-hand side, we nd the best two-step predictor, which engenders a larger MSE
than the one-step predictor. Assuming that L 1 and M 2 in Eqn. (9.88), the
desired data can be written as having the following structure:
dn pd dn 1 v0 n
d^ n v0 n
2
6
pd 0 04
dn 1
7
dn D 5 v0 n
dn D 1
wH
ar un v0 n:
9:90
415
Since the variance of v0 n is the lowest possible MSE, a two-channel WFof the
form implied by the rst right-hand term in Eqn. (9.90)would converge to the
solution war or its equivalent (if multiple solutions exist that produce the same MSE
performance).
The earlier linking sequence concept will be used in order to see how a referenceonly ALP can approach the performance associated with the optimal predictor.
Based on Eqns. (9.87) and (9.88), the following linking sequences between desired
and reference signals are of interest in the present case:
k D1 n D
k
dn 1
dn D
dn 1
:
n D 1
dn D 1
9:91
These linking sequences can be used to rewrite the optimal predictor from Eqn.
(9.90):
2
3
k D1 n Ddn D
6
7
d^ n pd 0 04
dn D
5
dn D 1

dn D
pd k D1 n D 0
dn D 1
9:92
_H
w
TVWF0;2 nrn:
Note that the end result represents a TV lter due to the linking sequence.
Alternatively, the optimal predictor can be rewritten as follows:
2
3
k D n D 1dn D 1
6
7
d^ n pd 0 04
dn D
5
dn D 1

dn D
D
0 pd k n D 1
dn D 1
9:93
H
w
TVWF0;2 nrn:
Consequently, the optimal predictor for the chosen scenario can be written as an
afne linear combination of the above two TV equivalents to the optimal Wiener
416
predictor:
_
TVWF0;2 nH rn
d^ n a w
TVWF0;2 n 1 a w

pd a k D1 n D 1 a k D n D 1
dn D
dn D 1

9:94
wH
TVWF0;2 nrn:
The particular behavior of this optimal predictor for the desired data, which can be
interpreted as the closest thing to the structure of the desired data (meaning the
lowest MSE-producing model of any kind), depends on the behavior of the linking
sequences.
Let h D n D 1 denote the prediction error associated with predicting
dn 1 based on dn D 1, that is, a D-step predictor. The linking sequence
behavior can then be written as follows:
k D1 n D
k
dn 1
h D1 n D
pdD1
dn D
dn D
dn 1
h D n D 1
pDd
:
n D 1
dn D 1
dn D 1
9:95
Substitution into the TV weight vector manifold, implied by the nal equality in
Eqn. (9.94), yields
"
#
*
a k D1 n D
wTVWF0;2 n pd*
1 a k D n D 1

3*
pd h D1 n D
D
a
p
d
7
6
dn D
7
6
6
7 :

D
5
4
p
h
n

D

1
d
D1
1 a pd
dn D 1
2
9:96
With the reference-only input vector implied by Eqn. (9.94), that is,

dn D
rn
;
dn D 1
9:97
we recognize that the constant component of TVWF(0, 2) generates an afne linear

combination of the optimal D-step linear predictor and the optimal D 1-step
linear predictor. The TI WF, for operation on the input vector given in Eqn. (9.97), is
417
equal to the rst element of the constant component of TVWF(0, 2) above:

"
wWF0;2
#
pDd *
0
9:98
The minimum norm interpretation of NLMS, together with the time-varying nature
of the structure that underlies the desired data, as given in Eqn. (9.94), produces the
possibility in NLMS to achieve a better predictor by combining D-step and D 1step linear predictorsalong the lines presented in Section 9.7.2 for the ANC case
in addition to the attempted tracking of the TV component of the data structure.
Recall that due to the equivalences above, the wTVWF0;2 n lter achieves the
same minimal MSE as the TI WF(1, 2). However, the AF(0, 2) that aims to track
wTVWF0;2 n will always be one step behind due to its a posteriori correction, and
therefore will incur a tracking error in addition to misadjustment.
For the ALP scenario in Section 9.2.4, we showed in Section 9.4.3.3 that a
substantial gap exists between the reference-only WF and the two-channel WF
performance. The results in Figure 9.21 demonstrated the existence of nonlinear
effects. For step-size m 0:7, which seems to be near-optimal for this scenario, the
absolute errors of the WF(0, 2) and AF(0, 2) lters are compared in Figure 9.47 over
the steady-state interval from iteration 4700 to 5000. We see that while the WF(0, 2)
error uctuates about its theoretically expected value, the AF(0, 2) error is generally
less. The performance improvement realized by AF(0, 2) over this interval is
3.99 dB. For comparison, the performance improvement over MMSE WF(0, 2)
realized by AF(1, 2) is 5.01dB, while MMSE WF(1, 2) for this case is 7.53dB better
than MMSE WF(0, 2). Note that the AF(1,2) performance suffers in this comparison,
because for the step-size of 0.7 it incurs misadjustment error.
The behavior of the real and imaginary parts of the AF(0,2) weights is shown in
Figures 9.48a and 9.48b, respectively, for the same interval and realization reected
in Figure 9.47.
It is evident from Figures 9.47 and 9.48 that the performance improvement of
AF(0, 2) over WF(0, 2) is paired with dynamic weight behavior. The TV WF
equivalent to WF(1, 2), as given in Eqn. (9.96), suggests that further performance
improvement would be obtained with AF(0,2) if the TV aspect of TVWF(0, 2) were
reduced. The latter resides in the prediction error variance, which can be reduced by
making the process more narrowband. Repeating the above experiment with a pole
radius of 0.99 rather than 0.95 produces an AF(0, 2) performance improvement of
3.93 dB over WF(0, 2). While this is slightly less than in the previous case, MMSE
WF(1, 2) is now only 7.04 dB better than MMSE WF(0, 2), so that actually a larger
fraction of the possible performance improvement has been realized. Figure 9.49
shows the error comparison. The AF(0, 2) error is observed to generally be less than
the WF(0, 2) error, which conforms nicely to its expected value. The weight behaviors for this more narrowband ALP case are shown in Figure 9.50a and 9.50b.
We observe that the time variation of the AF(0, 2) weights is less than it was for the
earlier more widerband ALP example. This behavior conrms that relatively
418
Figure 9.47
WF(0, 2) and AF(0, 2) errors for the ALP scenario.
better performance is achieved when the demand on tracking of TV weights is

reduced.
9.8.2
Narrowband AR(1) Interference-Contaminated AEQ
As argued in Section 9.4.3.4, in the AEQ scenario the input vector to the twochannel WF is as follows:

un

an
:
rn
9:99
The auxiliary vector an is dened on the basis of the interference signal, while the
reference vector contains the desired signal (QPSK in our example) additively
contaminated by narrowband AR(1) interference and white Gaussian measurement
noise. Recall that the interference is strong relative to the desired signal and that the
measurement noise is weak relative to the desired signal. Our interest is in the center
Figure 9.48
419
(a) Real part of AF(0, 2) weights for the ALP scenario.
tap of the reference vector, as reected in the following denitions:

3
in D L~
7
6
..
7
6
.
7
6
7
6
an 6 in D 7 in
7
6
..
7
6
.
5
4
~
in D L
3
2
~ in D M
~ vn D M
~
xn D M
7
6
..
7
6
.
7
6
7
6
rn 6
7 xn in vn:
xn D in D vn D
7
6
.
7
6
.
.
5
4
~ in D M
~ vn D M
~
xn D M
2
9:100
The number of taps in the auxiliary and reference channels now satisfy the relations
~ 1, respectively.
L 2L~ 1 and M 2M
420
Figure 9.48
(b) Imaginary part of AF(0, 2) weights for the ALP scenario.
The choice for the interference as the auxiliary vector is based on the knowledge
that the best estimate for the desired signal, xn D, results by removing the
interference signal in D from the observed reference signal rn D. The latter
reveals that the best structure to represent the underlying desired data is a twochannel structure [3, 4]:

x^ n D 0
1
0 1
0
in
rn
in D rn D
9:101
xn D vn D:
Note that in this ideal case, only the center elements of the auxiliary and reference
vectors are used. While this model is useful for guiding our direction, it is not
directly usable in practice, as the interference channel is not measurable in the AEQ
application. Nevertheless, the corresponding two-channel WF will provide an upper
bound on attainable performance, as it did in the ANC and ALP cases.
421
Figure 9.49
WF(0, 2) and AF(0, 2) errors for the narrowband ALP scenario.
Based on Eqn. (9.101), we can write the following structure for the desired signal:
3
in D
6 rn D 1 7
7
6
06
7 1n D
4 rn D 5
2
xn D 0:9968 0
0:9968
9:102
rn D 1
wH
ir un 1n D:
The structure given in Eqn. (9.102) is the two-channel WF for the AEQ scenario of
previous sections, with pi 0:9999e jp =3 , SNR 25 dB, and SIR 20 dB. For
simplicity of representation, we have chosen L 1 and M 3. Note that the desired
signal structure in Eqn. (9.102) is of the form of that in Eqn. (9.10), and it has
been veried that the corresponding AF(1, 3) yields the corresponding target for
small NLMS step-size, that is, weights converging to the TI WF(1, 3) weights
and performance approaching the optimal MSE performance.
As was done for the ANC and ALP cases, we dene a set of linking sequences in
order to derive a WF equivalent to the above that uses reference inputs only. As
422
Figure 9.50
(a) Real part of AF(0, 2) weights for the narrowband ALP scenario.
dictated by the structure in Eqn. (9.102), the following linking sequences are
dened:
l 1 n D 1
l 0 n D
l 1 n D 1
in D
rn D 1
in D
rn D
9:103
in D
:
rn D 1
These linking sequences let us make the following substitution for in D:

in D a 1 l 1 n D 1
2
3
rn D 1
6
7
4 rn D 5
rn D 1
a 0 l 0 n D a 1 l 1 n D 1
9:104
Figure 9.50
423
(b) Imaginary part of AF(0, 2) weights for the narrowband ALP scenario.
as long as the represented linear combination is afne, that is,
1
2 3
1
a 4 1 5 1:
1
1
9:105
Substituting for in D in Eqn. (9.102), using Eqn. (9.104), yields the equivalent
reference-only WF structure for the desired signal:
xn D wH
TVWF0;3 nrn 1n D
2
3
a 1 l 1 n D 1 *
6
7
wTVWF0;3 n 0:99684 1 a 0 l 0 n D 5 ;
a 1 l 1 n D 1
9:106
2
6
rn 4
rn D 1
7
rn D 5:
rn D 1
9:107
We recognize that this reference-only WF equivalentit yields the same optimal

performance as the TI WF(1, 3)is TV due to the nature of the linking sequences.
424
To understand the precise nature of the dynamic behavior of the TVWF(0, 3)

weights, we take a closer look at the dynamics governing the linking sequences.
Thereto we use the composition of the reference process together with the AR(1)
relationship for the interference process, similar to those in Eqn. (9.2), and write the
present interference in terms of its previous value and its innovation:
l 1 n D 1
in D
rn D 1
pi in D 1 vi n D
in D 1 xn D 1 vr n D 1
9:108
pi h 1 n D 1:
The same AR(1) relation can be used to write the past in terms of the future and an
innovation term:
l 1 n D 1
in D
rn D 1
p1
i in D 1 vi n D 1
in D 1 xn D 1 vr n D 1
9:109
1
p1
n D 1:
i h
The nal linking sequence is analogously rewritten as follows:
l 0 n D
in D
rn D
9:110
1 h n D:
All three linking sequences have thus been written as a constant contaminated by
a noise process, so that the TVWF(0, 3) in Eqn. (9.107) can be written in
corresponding terms:
3
1 1
h n D 1 *
a 1 p1
i a
7
6
7 :
wTVWF0;3 n 0:99686
1 a 0 a 0 h 0 n D
5
4
1
1 1
a pi a h n D 1
2
9:111
Note that the constant terms in the above weight vector undergo a rotation that
depends on the pole of the interference process. Generalizing the above, allowing M
to increase, results in a TI weight component proportional to the pole of the
interference process raised to a power equal to the distance of the element from the
center tap. The effect of such a component, operating on the corresponding element
425
of the reference vector, constitutes an estimate of the interference signal at the center
tap. In fact, for a particular set of afne combination coefcients, the TI component
coincides with the WF0; M solution.
In Figure 9.51 the WF(0, 51) weights are shown, together with the AF(0, 51)
weights during steady-state iterations 5000 through 10,000. As in Section 9.2.3, an
NLMS step-size of 0.8 is used.
We observe in Figure 9.51 that the weights do not change much over 5000 steadystate iterations (a uniformly spaced subset from 5000 successive iterations is
overlaid). However, the AF(0, 51) weights do not coincide with the WF(0, 51)
weights. As reported in Section 9.2.3, the performance of the AF is almost 5 dB
better than the performance of the TI WF. If the experiment is repeated, the same
behavior is observed, albeit centered about a different weight vector solution [4].
The latter shows that different solutions are appropriate, depending on the particular
realization. An AF can converge to these appropriate solutions and track them.
Recall that the step-size is a large 0.8, appropriate for tracking, and not so
appropriate for converging to a TI solution.
In Figure 9.52a and 9.52b the dynamic behavior of the real part of the AF(0, 51)
weights is shown. The weights are seen to be changing, in a slow, random-walk-like
type fashion, as predicted by the reference-only WF equivalent in Eqn. (9.111).
Figure 9.51
(a) Real part of weights for the AEQ scenario.
426
Figure 9.51
(b) Imaginary part of weights for the AEQ scenario.
While the very slowly varying weight behaviorfor any given realization
almost suggests that a TI solution could be appropriate, using a time-averaged
weight vector associated with one realization on a different realization generally
results in MSE higher than that for the WF(0, 51) weights. Furthermore, as the best
performance is realized at a large step-size, we must again reach the conclusion that
it is the TV nature of NLMS that facilitates the tracking of the TV nature of the
structure that underlies the desired data.
9.9 CONDITIONS FOR NONLINEAR EFFECTS IN
ANC APPLICATIONS
We now address the fundamental question as to the requisite conditions that lead to a
signicant nonlinear response when using an NLMS AF. In some applications the
nonlinear effects are benecial; in others they are not. As we have shown, however,
the nonlinear effects can totally dominate performance in realistic conditions, and it
is important to be able to predict when such nonlinear behavior is likely to occur.
9.9.1
Nonlinear Effects in Exponential ANC Scenarios
In the sinusoidal scenarios treated in Section 9.6, the reference-only Wiener solution
is the all-zero weight vector. In that case the MSE equals s 2d , the power in the
427
9.9 CONDITIONS FOR NONLINEAR EFFECTS IN ANC APPLICATIONS
Figure 9.51
(c) Zoomed-in view of real part of weights for the AEQ scenario.
desired signal. Consequently, any deviation in MSE from the desired signal power
constitutes a nonlinear effect. The MSE in the exponential scenarios is completely
governed by Eqn. (9.68). Dening the normalized MSE s~ 2e as follows,
s 2e
s 2d
2

1 ejv d v r
;

j
v

v
d
r
1 1 m e
s~ 2e
9:112
we note that MSE is completely determined by m , the NLMS step-size, and

v d v r , the frequency difference between the desired and reference signals. In
Figure 9.53 the MSE is plotted according to Eqn. (9.112).
Note that most advantageous nonlinear effects occur for larger step-sizes. For
m 1, reduced MSE occurs for any Dv , 0:1652p (or 59.48). For m 0:1,
reduced MSE occurs for any Dv , 0:052p (or 188). The upper limit of the range
of frequency differences over which MSE is reduced diminishes as NLMS stepsize decreases. From Eqn. (9.112), and reected in Figure 9.53, we can determine
that for m 0:1 and Dv 0:052p , MSE is reduced by 0.009 dB relative to the
desired signal power, leading to the value of just above 6 dB in Figure 9.28b
428
Figure 9.52
Figure 9.52
(a) Real part of center tap of AF(0, 51) for the AEQ scenario.
(b) Real part of off-center taps of AF(0, 51) for the AEQ scenario.
429
Figure 9.53 MSE for exponential ANC scenarios, parameterized on Dv [from Dv

0:012p (bottom) to Dv p (top) in increments of 0:012p ].
[10 log10 22 0:009 6:01]. For m 1 and Dv 0:052p , MSE is reduced by

10.093 dB relative to s 2d , leading to the value of just below 4 dB in Figure 9.28a
(6:02 10:09 4:07).
Another interesting observation is that for m [ 0; 1 the worst-case increase in
MSEof 6.02dBoccurs for m 1 and Dv 0:52p .
For the exponential ANC scenario, Eqn. (9.112) or Figure 9.53 gives complete
information about when nonlinear effects occur and whether they represent
performance improvement or deterioration, as well as dening the magnitude of
those effects.
9.9.2
Nonlinear Effects in AR(1) ANC Scenarios
For the ANC application, we have shown that the reference-only AF may
outperform the reference-only WF when there is a substantial gap in performance
between the reference-only WF and the two-channel WF (as analyzed in Section
9.4) and the TV equivalent to the two-channel LTI WF is substantially similar from
one time index to the next. The latter is a tracking condition, meaning that the a
priori AF weight vector is substantially in the direction of the a posteriori AF weight
vector (which is targeting the TV equivalent WF). The question now is whether we
430
can predict when both of the former conditions will be satised. The analysis in
Section 9.4 indicated that, generally, the processes need to be narrowband. The more
narrowband the processes are, the better the MSE estimate dened by the transfer
function model of Section 9.3.2 predicts the performance of NLMS. Consequently,
in the narrowband ANC scenario, we may be able to use the MSE estimate from the
transfer function model to determine when the reference-only AF is likely to
outperform its LTI WF counterpart.
Each of the subsequent gures shows the same type of information for a variety
of ANC scenarios. First, the solid black line on the bottom of each plot indicates the
theoretical limit of performance, min MSE, which equals limL;M!1 MSE WFL; M.
Above that are two sets of four graphs. The bottom set of four pertains to twochannel lters and the top set of four pertains to reference-only lters. The constant
gray dot-dash line in the top set and the constant solid gray line in the bottom set
show, respectively, the theoretical MSE expected for the M-tap reference-only WF,
WF0; M and for the two-channel WF WFL; M. The gray symbols with bars
indicate the mean and the 80 percent occurrence interval of the estimated MSE
achieved by the designed WF0; M and WFL; M for 10 different realizations.
Similarly, the black symbols and bars indicate the mean and the 80 percent
occurrence interval of the estimated MSE achieved by AF0; M and AFL; M for
the same realizations. The black nonconstant dotted and solid curves correspond to
the MSE estimate evaluated according to the LTI model for reference-only NLMS
(Section 9.3.2) and two-channel NLMS (Sections 4.4 and 4.5).
Figure 9.54a shows the results from 10 experiments for the scenario in Eqn.
(9.76). Figure 9.54b shows the results for a comparable scenario after changing the
SNRs to 20dB. The MSE estimate from the transfer function model is shown to
provide a good indication for the performance of the reference-only NLMS AF(0, 2)
and an even better one for the two-channel NLMS AF(1, 2) for this case. The number
of data points used in all of these simulation runs was 5000, explaining the relatively
high MSE results for AF(1, 2) for small step-sizes, since the lter has not had
sufcient time to converge at an SNR of 80 dB for m , 0:5. The nal 300 iterations
were used to obtain the results for estimated MSE. Note that 5000 iterations provides
for convergence at an SNR of 20 dB even at the smaller step-sizes. The performance
of WF(0, 2) and WF(1, 2) is very close to their respective theoretically expected
values. The nonlinear effects in AF(0, 2) are accurately predicted by the MSE
estimate derived from the LTI transfer function model for NLMS AF(0,2). From the
difference between MMSE WF(0, 2) (top - line) and the MSE estimate for AF(0, 2)
from the transfer function model (top ), we observe a maximum reduction in
MSE (occurring for m 1), due to nonlinear effects, of about 9 dB. This gure is
only slightly less than the 10dB MSE reduction for the corresponding sinusoidal
case (fth curve from the bottom in Figure 9.53).
Note in Figure 9.54a, where SNR 80 dB, that the MSE estimate variations are
much larger for WF(0, 2) and AF(0,2) than for WF(1, 2) and AF(1, 2). In the latter
cases, the data pretty much satisfy the (1, 2) structure, thereby yielding MSE close to
the minimum possible (with the higher result for AF(1, 2) due to misadjustment). For
the (0, 2) cases the data no longer t the model, which forces the errorengendered
by the wrong modelto be higher. In Figure 9.54b, with SNR 20 dB, both the
431
Figure 9.54 Simulation results and transfer function MSE estimates for the ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:052p . (a) SNRd SNRr 80 dB.
(b) SNRd SNRr 20 dB.
432
(1, 2) and (0, 2) variations are larger than in Figure 9.54a, due to the increased
measurement noise; however, the increase is a relatively larger fraction of the
randomness in the (1, 2) case.
Figure 9.55 provides a comparison of results that illustrate the effects of signal
bandwidth, pole angle separation, and lter order. The signal bandwidth is reduced
by approximately a factor of 10, the pole angle difference is decreased from 188 to
38, and results are obtained for both (1, 2)- and (10, 25)-tap lters. Figure 9.55a
reects a narrower bandwidth than Figure 9.54b. The MSE estimate from the
transfer function model shows similar reductions for both, about 9 dB, suggesting
that MSE may not be very sensitive to bandwidth directly. In Figure 9.55b the
frequency difference is smaller than in Figure 9.55a. We observe that the MSE
estimate from the transfer function model is a good indicator of NLMS AF(0, 2)
behavior and that the actual nonlinear effect comes within a few decibels of the
lower bound MMSE. Another observation is that the nonlinear effect seems to
saturate at approximately 10 dB and does so over a wide range of step-sizes. The
maximum MSE reduction for the reference-only AF (over the reference-only WF) is
approximately 17dB, which is far short of the maximum 25dB MSE reduction in the
comparable exponential case. However, the latter would violate the absolute lower
bound on MSE in the AR(1) situation. Another interesting observation linking the
performance in the exponential scenario to the performance in the AR(1) scenario is
that the shape of the MSE performance curves in Figures 9.54 and 9.55 is similar to
the comparable ones for the exponential case in Figure 9.53.
The effect of increased orderscomparing Figure 9.55a and 9.55b with Figures
9.55c and 9.55dseems to be mostly conned to the improved theoretical and
actual performance of the (10, 25)-tap AF and WF. In each case, the absolute lowerbound performance is approximated more closely. The MSE estimate from the
transfer function model again provides a good indicator of NLMS performance at
both sets of lter orders and for both the single- and two-channel NLMS lters. An
interesting observation, in the higher-order cases in Figures 9.55c and 9.55d, is that
the transfer function model based MSE estimate tends to overestimate the
AF(10, 25) performance.
Figure 9.56 shows simulation results for the maximally TV scenario (pole angle
difference of 1808) of Eqn. (9.84) and for SNRs of 80 and 20 dB. Note here how the
reference-only MSE estimate from the transfer function model successfully
indicates that NLMS performance will be worse than the corresponding WF
performance. Recall that the transfer function model for MSE, as shown in Section
9.4.5, is entirely based on LTI system blocks. In the transfer function development
there is no obvious connection to any TV behaviors. Again we observe that the
nonlinear effect on MSE, an increase in this case, follows the shape of the curve for
the exponential case, shown in Figure 9.53, for the corresponding parameterization.
In fact, in this case, its magnitude is the same as well.
Figure 9.57 shows the performance results for the scenario in Eqn. (9.76) for
(1, 2)-tap lters, but with a frequency difference of only 1.88. At this small frequency
difference, the MSE estimate from the transfer function model is saturated in Figure
9.57a. Figure 9.57b shows that when the bandwidth of the desired and reference
processes is decreased, the saturation level of the MSE estimate from the transfer
function model drops to about 25 dB relative to s 2d .
433
Figure 9.55 Simulation results for (1, 2)-tap [(a) and (b)] and (10, 25)-tap [(c) and (d)]
lters: SNRd SNRr 20 dB. (a) pd 0:999e jp =3 , pr 0:999e j p =30:052p ;
(b) pd 0:999e jp =3 , pr 0:999e j p =32p =120 ; (c) pd 0:999e jp =3 ,
pr 0:999e j p =30:052p ; (d) pd 0:999e jp =3 , pr 0:999e j p =32p =120 .
434
Figure 9.55
(continued )
435
Figure 9.56 Order (1, 2) simulation results and TF MSE for the ANC scenario:
pd 0:99e jp =3 , pr 0:99e j p =30:502p . (a) SNRd SNRr 80 dB;
(b) SNRd SNRr 20 dB.
436
Figure 9.57 Simulation results and TF MSE for the modied Eqn. (9.76) scenario:
SNRd SNRr 80 dB. (a) pd 0:99e jp =3 , pr 0:99e j p =30:0052p ;
(b) pd 0:999e jp =3 , pr 0:999e jp =30:0052p .
437
Note in Figure 9.57b that no AF convergence transients are observed, unlike with
the earlier results at 80dB SNR. The difference between the two behaviors lies in the
starting weight vector. All earlier adaptive lters were started with the all-zero
vector, while for illustrative purposes AF(1, 2) was started at WF(1, 2) to generate
Figure 9.57b.
From the above simulation results, we observed that the MSE estimate from the
transfer function model in the reference-only case tends to have the same behavior
with step-size as MSE for the exponential case, shown in Section 9.8.1. While in the
noise-free exponential case the absolute lower bound on MSE equals zero, in the
AR(1) case it is always strictly positive. In the AR(1) case the MSE estimates from
the transfer function modeland actual performancesaturate at some level above
the absolute lower bound for these WSS scenarios. The saturation phenomenon
becomes more prominent as the (pole) frequency difference gets smaller. The level
at which saturation occurs drops with reduction of the bandwidth of the reference
and desired processes. The MSE performance results for the exponential case,
together with the absolute lower bound on MSE, constitute a good indicator of
performance for the reference-only AF.
9.9.3
Nonlinear Effects in Narrowband ALP
As in the ANC case, a substantial gap between the reference-only and two-channel
WF performances is necessary for the reference-only AF to realize some of that
performance advantage. This was found to be the case in the examples provided in
Section 9.4.3.3.
For the examples used in Section 9.8.1, the performance can be summarized
along the lines of Section 9.9.2. Figure 9.58 shows the various minimum, realized,
and estimated MSEs for the ALP scenario of Section 9.8.1. We observe that the
AF(0, 2) MSE performance is very much in line with the AF(0, 3) performance seen
in Section 9.4.3.3 (Figure 9.21). A large gap is seen here between the WF(0, 2) and
WF(1, 2) MMSEs. This condition is suggestive of AF(0, 2) performance improvement as long as the TV aspects of the equivalent TV WF(0, 2) can be tracked
successfully. Clearly, some of the performance potential is being realizedin fact,
about 2 dB out of the possible 7 dB.
In Section 9.8.1 we argued that the TV nature of the equivalent TV WF(0, 2)
could be slowed by making the process more narrowband. Figure 9.59 shows the
MSE performance for the narrowband ALP example of Section 9.8.1. We note in
this case that, while the same absolute level of performance is reached, a larger
fraction of the potential performance is now realized in going from WF(0, 2) to
AF(0, 2). Approximately 4 dB of the maximum possible improvement of 7dB is
realized. As explained earlier, this is commensurate with a reduction in time
variation for the reference-only equivalent WF.
9.9.4
Nonlinear Effects in Narrowband AEQ
In Section 9.2.3 we showed that an AF(0, 51) could realize performance improvement over a WF(0, 51) in a narrowband interference-contaminated AEQ application. In Section 9.4.3.3 it was shown that an idealized two-channel WF could
438
Figure 9.58
Performance comparison for the ALP scenario.
perform better than the WF(0, 51) for that scenario. The numerical results obtained
in Section 9.2.3 indeed reected a performance improvement in AF(0, 51) over
WF(0, 51). The AF(0, 51) performance did not approach the performance of the
idealized two-channel WF.
In the idealized two-channel WF, the auxiliary channel contained the interference
signal itself. Consequently, the interference was provided to the WF(51, 51) without
error. In a somewhat more realistic scenario, the interference must be estimated,
thereby incurring estimation error. Since the interference model is known, its best
estimate is derived from its value at a tap next to the center tap by means of a onestep predictor. The latter would theoretically incur the innovation variance as
prediction error variance. Therefore, in addition to simply subtracting the
interference at the center tap, the observation noise variance is increased by the
interferences innovation variance (both are white processes). This leads to a more
realistic performance bound, referred to as the ideal interference predictor. While,
again, the interference itself is not available for such a one-step predictor, the SIR
and SNR combine to make the interference the component that dominates the
reference signal. It seems not unreasonable, then, to substitute the reference signal
for use in interference prediction.
The corresponding amendment of Figure 9.23 is shown in Figure 9.60.
9.10 SUMMARY
Figure 9.59
439
Performance comparison for the narrowband ALP scenario.
The performance of AF(0,51) is seen to approach the performance that would

have occurred had an ideal interference predictor been possible. The symbol for
AF(51, 51) is associated with the MSE realized when the interference signal itself is
used for the ideal auxiliary channel.
As indicated in Section 9.8.2, the performance improvement realized by
AF(0, 51) hinges on the MSE performance gap between the reference-only WF and a
two-channel WF, as well as the ability of the AF to track TV weights. The more
realistic the two-channel WF can be, the better the actual AF0; M performance
gain can be predicted. It appears that when the magnitude of the time variations of
the equivalent reference-only target weights is small, the reference-only AF can
realize almost all of the potential improvement.
9.10
SUMMARY
We have shown that nonlinear TV effects in AFs originate from the error feedback
used in the weight update, as the error reects the discrepancy between the desired
data and the current model for that desired data. These TV effects have been shown
to become prominent when applied in narrowband WSS ANC, ALP, and AEQ
scenarios, particularly in cases where the spectral content of the reference and
desired inputs to the lter are dissimilar. For such scenarios, it was shown that it is
440
Figure 9.60 Performance comparison for the AEQ scenario.
sometimes possible to modulate the reference inputs to create an improved

instantaneous estimate of the desired signal, one which has a spectral content better
matched to the desired response than that provided by the reference input directly.
For these scenarios, a substantial difference in performance was shown to exist
between the conventional WF, operating on present and past reference inputs, and a
two-channel WF operating on the reference channel inputs and on the past values of
the desired response. The latter was shown to provide the bound on performance. It
is further shown that, in all cases discussed, the performance gains are produced by a
nonlinear TV operation on the reference inputs. The steady-state dynamic weight
behavior of the AF is essential to the realization of performance gains and a short
adaptive time constant is needed to provide the required dynamic response.
A critical difference between the ANC, ALP, and AEQ applications is that the
past values of the desired signal are unimportant for the AEQ application since they
are samples of an uncorrelated random process. For the ANC and ALP applications,
however, the TV response is determined by both the desired and the reference
signals, and it is possible to link the two sequences to dene a TV weight response.
The TV effects in NLMS adaptive ltering are then explained by the existence of a
two-channel optimal TI WF lter that is being approximated by the adaptive
reference-only lter (i.e., a lter using only reference inputs), which may be TV.
ACKNOWLEDGMENT
441
For the ANC and ALP scenarios, we have shown that a manifold of optimal TV
reference-only WFs exists that forms the target for the a posteriori NLMS weight
vector. When the corresponding TV reference-only WF target weight vector can be
tracked reasonably well by NLMS, that is, when it is slowly TV, the AF may realize
a priori performance gain over the reference-only WF, which is TI. The conditions
under which nonlinear effects exist, as well as their magnitude, are given for
exponential ANC scenarios. For narrowband AR(1) ANC scenarios, we indicate
when prominent nonlinear effects can be expected. In the exponential ANC scenario
the linking sequence has constant amplitude and linear phase, while in the AR(1)
ANC scenario the linking sequence is at times nearly constant with linear phase.
Under this condition, the weight behavior is nearly periodic. It is also shown that the
linking sequence for the AR(1) input is subject to random uctuations, which
become especially pronounced near zero crossings of the reference signal. The MSE
estimate provided by the linear TI transfer function model for NLMS provides a
good indication of performance.
The TV nonlinear effects observed in the narrowband interference-contaminated
AEQ scenario can be explained by the existence of a two-channel WF where the
auxiliary channel contains values of the narrowband interference. This forms an
upper bound on performance since the AF must generate interference channel
estimates solely from present and past values of the reference signal. In this case, the
nonlinear response is again shown to be associated with TV weight behavior.
However, there is now a TI component to the weights that dominates their
magnitudes.
ACKNOWLEDGMENT
The authors wish to express their sincere thanks to Ms. Rachel Goshorn of SSC for
her gracious help, expertise, and effort in producing many of the gures in this
chapter.
The rst author acknowledges the support provided by the National Research
Council, in awarding him a Senior Research Associateship at SPAWAR Systems
Center, San Diego, during his Fall 2001 sabbatical there.
REFERENCES
1. A. A. (Louis) Beex, Efcient generation of ARMA cross-covariance sequences, IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP85), pp.
327 330, March 26 29, 1985, Tampa, FL.
2. A. A. (Louis) Beex and James R. Zeidler, Non-linear effects in adaptive linear
prediction, Fourth IASTED International Conference on Signal and Image Processing
SIP2002, pp. 21 26, August 12 14, 2002, Kauai, Hawaii.
3. A. A. (Louis) Beex and James R. Zeidler, Data structure and non-linear effects in
adaptive lters, 14th International Conference on Digital Signal Processing DSP2002,
pp. 659 662, July 1 3, 2002, Santorini, Greece.
442
4. A. A. (Louis) Beex and James R. Zeidler, Non-linear effects in interference

contaminated adaptive equalization, IASTED International Conference on Signal
Processing, Pattern Recognition, and Applications (SPPRA02), pp. 474 479, June 25
28, Crete, Greece.
5. A. A. (Louis) Beex and James R. Zeidler, Associating nonlinear effects in NLMS
adaptation with dynamic weight behavior, in addendum to Proceedings Defence
Applications of Signal Processing Workshop 2001 2002 (DASP02), pp. 348 352,
September 16 21, 2001 (actually held June 23 27, 2002), Barossa Valley Resort,
Australia.
6. J. C. M. Bermudez and N. J. Bershad, Non-Wiener behavior of the ltered LMS
algorithm, IEEE Trans. Circuits SystemsII, 46, 1110 1113, August 1999.
7. N. J. Bershad and J. C. M. Bermudez, Sinusoidal interference rejection analysis of an
LMS adaptive feedforward controller with a noisy periodic reference, IEEE Trans.
Signal Processing, 46, 1298 1313, May 1998.
8. N. J. Bershad and P. L. Feintuch, Non-Wiener solutions for the LMS algorithmsa time
domain approach, IEEE Trans. Signal Processing, 43, 1273 1275, May 1995.
9. N. J. Bershad and O. Macchi, Adaptive recovery of a chirped sinusoidal signal in noise:
II Performance of the LMS algorithm, IEEE Trans. Acoust., Speech, Signal Processing,
ASSP-39, 595 602, March 1991.
10. H. J. Butterweck, A steady-state analysis of the LMS adaptive algorithm without the
independence assumption, Proc. IEEE Int. Conf. Acoust., Speech, and Signal
Processing, 1404 1407, 1995.
11. H. J. Butterweck, A wave theory of long adaptive lters, IEEE Trans. Circuits
SystemsI, 48, 739 747, 2001.
transfer function approach, IEEE Trans. Acoust., Speech, Signal Processing, ASSP-35,
987 993, July 1987.
13. O. Dabeer and E. Masry, Analysis of mean-square error and transient speed of the LMS
adaptive algorithm, IEEE Trans. Information Theory, Vo. 48, No. 7, pp. 1873 1894,
July 2002.
14. V. E. DeBrunner and A. A. (Louis) Beex, Sensitivity analysis of digital lter structures,
SIAM J. Matrix Anal. Appl., 9, No. 1, 106 125, January 1988.
15. S. C. Douglas and W. Pan, Exact expectation analysis of the LMS adaptive lter, IEEE
Trans. Signal Processing, 43, 2863 2871, December 1995.
16. S. J. Elliott and P. Darlington, Adaptive cancellation of periodic, synchronously sampled
interference, IEEE Trans. Acoust., Speech, Signal Processing, ASSP-33, 715 717, June
1985.
17. S. J. Elliott, I. Stothers, and P. Nelson, A multiple error LMS algorithm and its
application to the active control of sound and vibration, IEEE Trans. Acoust., Speech,
Signal Processing, ASSP-35, 1423 1434, October 1987.
18. W. A. Gardner, Learning characteristics of stochastic-gradient-descent algorithms: a
general study, analysis, and critique, Signal Processing, 6, 113 133, April 1984.
19. J. R. Glover, Adaptive noise canceling applied to sinusoidal interference, IEEE Trans.
Acoust., Speech, Signal Processing, ASSP-25, 484491, December 1977.
20. G. C. Goodwin and K. S. Sin, Adaptive Filtering, Prediction, and Control. Prentice-Hall,
1984.
REFERENCES
443
21. J. Han, J. R. Zeidler, and W. H. Ku, Nonlinear effects of the LMS predictor for chirped
input signals, EURASIP Appl. Signal Processing, Special Issue on Nonlinear Signal and
Image Processing, Part II, pp. 21 29, January 2002.
22. M. Hayes, Statistical Digital Signal Processing and Modeling. Wiley, 1996.
23. S. Haykin, A. Sayed, J. R. Zeidler, P. Wei, and P. Yee, Tracking of linear time-variant
systems by extended RLS algorithms, IEEE Trans. Signal Processing, 45, 1118 1128,
May 1997.
24. S. M. Kay, Modern Spectral Estimation: Theory and Applications. Prentice-Hall, 1988.
25. S. M. Kuo and D. R. Morgan, Active Noise Control SystemsAlgorithms and DSP
Implementations. New York: Wiley, 1996.
26. O. Macchi and N. J. Bershad, Adaptive recovery of a chirped sinusoidal signal in noise:
I. Performance of the RLS algorithm, IEEE Trans. Acoust., Speech, Signal Processing,
ASSP-39, 583 594, March 1991.
27. O. Macchi, N. J. Bershad, and M. Mboup, Steady state superiority of LMS over RLS for
time-varying line enhancer in noisy environment, IEE Proc. F, 138, 354360, August
1991.
28. J. E. Mazo, On the independence theory of equalizer convergence, Bell Syst. Tech. J.,
58, 963 993, May/June 1979.
29. D. R. Morgan and J. Thi, A multi-tone pseudo-cascade ltered-X LMS adaptive notch
lter, IEEE Trans. Signal Processing, 41, 946 956, February 1993.
30. S. Olmos and P. Laguna, Steady-state MSE convergence of LMS adaptive lters with
deterministic reference inputs with applications to biomedical signals, IEEE Trans.
Signal Processing, 48, 2229 2241, August 2000.
31. K. J. Quirk, L. B. Milstein, and J. R. Zeidler, A performance bound of the LMS
estimator, IEEE Trans. Information Theory, 46, 1150 1158, May 2000.
32. M. Reuter, K. Quirk, J. Zeidler, and L. Milstein, Nonlinear effects in LMS adaptive
lters, Proceedings of Symposium 2000 on Adaptive Systems for Signal Processing,
Communications and Control, pp. 141146, 1 4 October 2000, Lake Louise, Alberta,
Canada.
33. M. Reuter and J. R. Zeidler, Nonlinear effects in LMS adaptive equalizers, IEEE Trans.
Signal Processing, 47, 1570 1579, June 1999.
34. M. J. Shensa, Non-Wiener solutions of the adaptive noise canceler with a noisy
reference, IEEE Trans. Acoust., Speech, Signal Processing, ASSP-28, 468 473, August
1980.
35. D. T. M. Slock, On the convergence hehavior of the LMS and the normalized LMS
algorithms, IEEE Trans. Signal Processing, ASSP-41, 2811 2825, September 1993.
36. S. A. Tretter, Introduction to Discrete-Time Signal Processing. Wiley, 1976.
37. H. L. Van Trees, Detection, Estimation, and Modulation Theory. Wiley, 1967.
38. B. Widrow, J. Glover, J. McCool, J. Kaunitz, C. Williams, R. Hearn, J. Zeidler, E. Dong,
Jr., and R. Goodin, Adaptive noise canceling: principles and applications, Proc. IEEE,
63, 1692 1716, December 1975.
39. B. Widrow, J. M. McCool, M. G. Larimore, and C. R. Johnson, Jr., Stationary and
nonstationary learning characteristics of the LMS adaptive lter, Proc. IEEE, 64, 1151
1162, August 1976.
10
ERROR WHITENING
WIENER FILTERS:
THEORY AND
ALGORITHMS
JOSE C. PRINCIPE, YADUNANDANA N. RAO, and DENIZ ERDOGMUS

Computational NeuroEngineering Laboratory, University of Florida, Gainesville, Florida
10.1
INTRODUCTION
The mean-squared error (MSE) criterion has been the workhorse of linear optimization theory due to the simple and analytically tractable structure of linear least
squares [16, 23]. In adaptive lter theory, the Wiener-Hopf equations are more
commonly used owing to the extension of least squares to functional spaces
proposed by Wiener [16, 23]. However, for nite impulse lters (vector spaces) the
two solutions coincide. There are a number of reasons behind the widespread use of
the Wiener lter: Firstly, the Wiener solution provides the best possible lter
weights in the least squares sense; secondly, there exist simple and elegant
optimization algorithms like least mean squares (LMS), normalized least mean
squares (NLMS), and recursive least squares (RLS) to nd or closely track the
Wiener solution in a sample-by-sample fashion, suitable for on-line adaptive signal
processing applications [16]. There are also a number of important properties that
help us understand the statistical properties of the Wiener solution, namely, the orthogonality of the error signal to the input vector space and the whiteness of the predictor
error signal for stationary inputs, provided that the lter is long enough [16, 23].
However, in a number of applications of practical importance, the error sequence
produced by the Wiener lter is not white. One of the most important is the case of noisy
inputs. In fact, it has long been recognized that these MSE-based lter optimization
approaches are unable to produce the optimal weights associated with noise-free input
due to the biasing of the input covariance matrix [autocorrelation in the case of nite
impulse response (FIR) lters] by the additive noise [33, 11]. Since noise is always
present in real-world signals, the optimal lter weights offered by the MSE criterion
445
446
ERROR WHITENING WIENER FILTERS: THEORY AND ALGORITHMS
and associated algorithms are inevitably inaccurate; this might hinder the performance of the designed engineering systems that require robust parameter estimations.
There are several techniques to suppress the bias in the MSE-based solutions in
the presence of noisy training data [7, 42, 38, 18]. Total least squares (TLS) is one of
the popular methods due to its principled way of eliminating the effect of noise on
the optimal weight vector solution [17, 19, 20]. Major drawbacks of TLS are the
requirements for accurate model order estimation, an identical noise variance in the
input and desired signals, and the singular value decomposition (SVD) computations
that severely limit its practical applicability [20, 33, 9, 11]. Total least squares is
known to perform poorly when these assumptions are not satised [42, 33]. Another
important class of algorithms that can effectively eliminate noise in the input data is
subspace Wiener ltering [16, 23, 31]. Subspace approaches try to minimize the
effect of noise on the solution by projecting the input data vector onto a lowerdimensional space that spans the input signal space. Traditional Wiener ltering
algorithms are then applied to the projected inputs, which exhibit an improved
signal-to-noise ratio (SNR). Many subspace algorithms are present in the literature;
to mention all of them is beyond the scope of this chapter. The drawbacks of these
methods include proper model order estimation, increased computational requirements and sufciently small noise power that helps discriminate signal and noise
during subspace dimensionality selection [31].
In this chapter, we will present a completely different approach to produce a
(partially) white noise sequence at the output of Wiener lters in the presence of
noisy inputs. We will approach the problem by introducing a new adaptation
criterion that enforces zero autocorrelation of the error signal beyond a certain lag,
hence the name error whitening Wiener lters (EWWF). Since we want to preserve
the on-line properties of the adaptation algorithms, we propose to expand the error
autocorrelation around a lag larger than the lter length using Taylor series. Hence,
instead of an error signal we end up with an error vector, with as many components
as the terms kept in the Taylor series expansion. A schematic diagram of the
proposed adaptation structure is depicted in Figure 10.1. The properties of this
solution are very interesting, since it contains the Wiener solution as a special case,
and for the case of two error terms, the same analytical tools developed for the
Wiener lter can be applied with minor modications. Moreover, when the input
signal is contaminated with additive white noise, the EWWF produces the optimal
Figure 10.1
Schematic diagram of EWWF adaptation.
10.2 MOTIVATION FOR ERROR WHITENING WIENER FILTERS
447
solution for the noise-free input signal, with the same computational complexity of
the Wiener solution.
The organization of this chapter is as follows: First, we will present the
motivation behind using the autocorrelation of the residual error signal in supervised
training of Wiener lters. This will clearly demonstrate the reasoning behind the
selected performance function, which will be called the error whitening criterion
(EWC). Second, an analytical investigation of the mathematical properties of the
EWWF and the optimal lter weight estimates will be presented. The optimal
selection of parameters will be followed by demonstrations of the theoretical
expectations on noise-rejecting properties of the proposed solution through Monte
Carlo simulations performed using analytical calculations of the necessary autocorrelation functions. Next, we will derive the recursive error whitening (REW)
algorithm that nds the proposed error whitening Wiener lter solution using
sample-by-sample updates in a fashion similar to the well-known RLS algorithm.
This type of recursive algorithm require On2 complexity in the number of weights.
Finally, we address the issues with the development of the gradient-based algorithm
for EWWF. We will derive a gradient-based LMS-type update algorithm for the
weights that will converge to the vicinity of the desired solution using stochastic
updates. Theoretical bounds on step-size to guarantee convergence and comparisons
with MSE counterparts will be provided.
10.2
MOTIVATION FOR ERROR WHITENING WIENER FILTERS
The classical Wiener solution yields a biased estimate of the reference lter weight
vector in the presence of input noise. This problem arises due to the contamination of
the input signal autocorrelation matrix with that of the additive noise. If a signal is
contaminated with additive white noise, only the zero-lag autocorrelation is biased
by the amount of the noise power. Autocorrelations at all other lags still remain at
their original values. This observation rules out MSE as a good optimization
criterion for this case. In fact, since the error power is the value of the error
autocorrelation function at zero lag, the optimal weights will be biased because they
depend on the input autocorrelation values at zero lag. The fact that the autocorrelation at nonzero lags is unaffected by the presence of noise will be proved
useful in determining an unbiased estimate of the lter weights.
10.2.1
Analysis of the Autocorrelation of the Error Signal
The question that arises is, what lag should be used to obtain the true weight vector
in the presence of white input noise? Let us consider the autocorrelation of the
training error at nonzero lags. Suppose noisy training data of the form (xt; dt) are
provided, where xt x~ t vt and dt d~ t ut, with x~ t being the sample
of the noise-free input vector at time t (time is assumed to be continuous), vt being
the additive white noise vector on the input vector, d~ t being the noise-free desired
output, and ut being the additive white noise on the desired output. Suppose that
the true weight vector of the reference lter that generated the data is wT (moving
448
average model). Then the error at time t is et d~ t ut ~xt vtT w,

where w is the estimated weight vector. Equivalently, when the desired response
belongs to the subspace of the input, that is, d~ t x~ T twT , the error can be written
as
et ~xT twT ut ~xt vtT w x~ T twT w ut vT tw: 10:1
Given these noisy training data, the MSE-based Wiener solution will not yield a
residual training error that has zero autocorrelation for a number of consecutive lags
even when the contaminating noise signals are white. From (10.1) it is easy to see
that the error will have a zero autocorrelation function if and only if
the weight vector is equal to the true weights of the reference model
the lag is beyond the Wiener lter length
During adaptation, the issue is that the lter weights are not set at wT , so the error
autocorrelation function generally will be nonzero. Therefore a criterion to determine the true weight vector when the data samples are contaminated with white
noise should be to force the long lags beyond the lter length of the error
autocorrelation function to zero by using an appropriate criterion. This is exactly
what the EWC that we propose here will do. There are two interesting situations that
we should consider: what happens when the selected autocorrelation lag is smaller
than the lter length? what happens when the selected autocorrelation lag is larger
than the lag at which the autocorrelation function of the input signal vanishes?
The answer to the rst question is simply that the solution will be still biased
since it will be obtained by inverting a biased input autocorrelation matrix. If the
selected lag is L , m (m order of the reference lter), the bias will occur at the Lth
subdiagonal of the autocorrelation matrix, where the zero-lag autocorrelation of the
input signal shows up. In the special case of MSE, the selected lag is zero and the
zeroth subdiagonal becomes the main diagonal; thus the solution is biased by the
noise power.
The answer to the second question is practically important. The MSE solution is
quite stable because it is determined by the inverse of a diagonally dominant
Toeplitz matrix. The diagonal dominance is guaranteed by the fact that the autocorrelation function of a real-valued function has a peak at zero lag. If other lags are
used in the criterion, it is important that the lag is selected such that the corresponding autocorrelation matrix (which will be inverted) is not ill-conditioned. If the
selected lag is larger than the length of the input autocorrelation function, then the
autocorrelation matrix becomes singular and a solution cannot be obtained. Therefore,
lags beyond the input signal correlation time should also be avoided in practice.
10.2.2
The Structure of the Error Whitening Wiener Filters
The observation that constraining the higher lags of the error autocorrelation
function to zero yields unbiased weight solutions is quite signicant. Moreover, the
449
algorithmic structure of this new solution and the lag-zero MSE solution are still
very similar. The noise-free case helps us understand why this similarity occurs.
Suppose that the desired signal is generated by the following equation:
d~ t x~ T twT , where wT is the true weight vector. Now multiply both sides by
x~ t D from the left and then take the expected value of both sides to yield
E~xt Dd~ t E~xt D~xT twT . Similarly, we can obtain E~xtd~ t D
E~xt~xT t DwT . Adding the corresponding sides of these two equations yields
E~xtd~ t D x~ t Dd~ t E~xt~xT t D x~ t D~xT twT :
10:2
This equation is similar to the standard Wiener-Hopf equation E~xtd~ t

E~xt~xT twT . Yet, it is different due to the correlations being evaluated at a lag
other than zero, which means that the weight vector can be determined by
constraining higher-order lags in the error autocorrelation.
10.2.3
How to Train EWWF
Now that we have described the structure of the solution, let us address the issue of
training this new class of optimum lters that we called error whitening Wiener
lters (EWWF). Adaptation exploits the sensitivity of the error autocorrelation with
respect to the weight vector of the adaptive lter. We will formulate the solution in
continuous time rst for the sake of simplicity. If the support of the impulse response
of the adaptive lter is of length m, we evaluate the derivative of the error autocorrelation function at lag D with respect to the weights, where D m. Assuming
that the noises in the input and desired output are uncorrelated to each other and to
the input signal, we get
@r e D
@w
@Eetet D
@w
@E~xT twT w ut vT tw~xT t DwT w ut D vT t Dw

@w
@EwT wT x~ t~xT t DwT w ut vT twut D vT t Dw

@w
@wT wT E~xt~xT t DwT w

@w
2E~xt~xT t Dwt w:
10:3
The identity in (10.3) immediately tells us that the sensitivity of the error autocorrelation with respect to the weight vector becomes zero; that is, @r e D=@w 0 if
450
wT w 0. This observation emphasizes the following practically important

conclusion: given training data that are generated by a linear lter, but contaminated
with white noise, it is possible to derive simple adaptive algorithms that could
determine the underlying lter weights without bias. Furthermore, if wT w is not
in the null space of E~xt~xT t D, then only wT w 0 makes r e D 0 and
@r e D=@w 0. But looking at (10.3), we conclude that a proper delay depends on
the autocorrelation of the input signal, which is, in general, unknown. Therefore, the
selection of the delay D is important. One possibility is to evaluate the error
autocorrelation function at different lags D m and check for a nonzero input
autocorrelation function for that delay, which will be very time-consuming and
inappropriate for on-line algorithms.
Instead of searching for a good lag D, consider the Taylor series approximation of
the autocorrelation function around a xed lag L, where L m:
1
r e D r e L r_ e LD L r e LD L2
2
1
Eetet L Eet_et LD L Eetet LD L2
2
10:4
In (10.4), e_ t and e t represent the derivatives of the error signal with respect to
the time index. Note that we do not take the Taylor series expansion around zero lag
for the reasons indicated above. Moreover, L should be less than the correlation time
of the input, such that the Taylor expansion has a chance of being accurate. But since
we bring more lags into the expansion, the choice of the lag becomes less critical
than in (10.3). In principle, the more terms we keep in the Taylor expansion, the
more constraints we are imposing on the autocorrelation of the error in adaptation.
Therefore, instead of nding the weight vector that makes the actual gradient in
(10.3) zero, we nd the weight vector that makes the derivative of the approximation
in (10.4) with respect to the weight vector zero.
If the adaptive lter is operating in discrete time instead of continuous time, the
differentiation with respect to time can be replaced by a rst-order forward
difference, e_ n en en L. Higher-order derivatives can also be approximated by their corresponding forward difference estimates, for example, e n
en 2en L en 2L. Although the forward difference normally uses two
consecutive samples, for reasons that will become clear in the following sections of
the chapter, we will utilize two samples separated by L samples in time. The rstorder truncated Taylor series expansion for the error autocorrelation function for lag
D evaluated at L becomes
r e D Eenen L Eenen en LD L

D LEe2 n 1 D LEenen L:
10:5
451
Analyzing (10.5), we note another advantage of the Taylor series expansion because
the familiar MSE is part of the expansion. Note also that as one forces D ! L, the
MSE term will disappear and only the lag-L error autocorrelation will remain. On
the other hand, as D ! L 1 only the MSE term will prevail in the autocorrelation
function approximation. Introducing more terms in the Taylor expansion will bring
in error autocorrelation constraints from lags iL.
10.2.4
The EWC
We are now in a position to formulate the Error Whitening Criterion. Motivated by

(10.5), we designed the EWC to involve an arbitrary weighting of the two terms en
and e_ n, because there is no clear understanding of the trade-offs. Therefore, the
EWC performance function for discrete time ltering can be written as
Jw Ee2 n b E_e2 n;
10:6
where b is a parameter, or equivalently,

Jw 1 2b Ee2 n 2b Eenen L;
10:7
which has the same form as (10.5). Note that when b 0, we recover the MSE in
(10.6) and (10.7). Similarly, we would have to select D L in order to make the
rst-order expansion identical to the exact value of the error autocorrelation
function. Substituting the identity 1 2b D L and using D L, we
observe that b 1=2 eliminates the MSE term from the criterion. Interestingly,
this value will appear in the following discussion, when we optimize b in order to
reduce the bias in the solution introduced by input noise.
The same criterion can also be obtained by considering performance functions of
the form
Jw Eken
p
p
b e_ n
g e n
T k22
Ee2 n b E_e2 n g Ee2 n ;
10:8
where the coefcients b , g , and so on are assumed to be positive. Note that (10.8) is
the L2 norm of a vector of criteria. The components of this vector consist of en,
e_ n, e n, and so on. Due to the equivalence provided by the difference approximations for the derivative, these terms constrain the error autocorrelation at
lags iL; as well as the error power as seen in (10.8). The number of terms included in
the Taylor series approximation for the error autocorrelation determines how many
constraints are present in the vector of criteria. Therefore, the EWWF utilizes an
error vector (see Fig. 10.1) instead of the scalar error signal utilized in the
conventional Wiener lter. Our aim is to force the error signal as close as possible to
becoming white (at lags exceeding the lter length), but these multiple lag options
have not yet been investigated.
452
In the following sections, we will elaborate on the properties of this performance

function. Specically, we will consider the gradient (sensitivity) of (10.6) with
respect to the weight vector of the adaptive lter and analyze the properties of the
solution that makes this gradient equal to zero, as suggested by (10.3). It will become
clear that in order to nd the true weight vector of a reference lter in discrete-time
operations, equating this mentioned gradient to zero will sufce. Even in the
presence of noise, the true weights will be accessible by proper selection of the
parameter b .
10.3
10.3.1
PROPERTIES OF THE EWC

Shape of the Performance Surface
Suppose that noise-free training data of the form ~xn; d~ n, generated by a linear
system with weight vector wT through d~ n x~ T nwT , are provided. Assume
without loss of generality that the adaptive lter and the reference lter are of the
same length. This is possible since it is possible to pad wT with zeros if it is shorter
than the adaptive lter. Therefore, the input vector x~ n [ Rm , the weight vector
wT [ Rm and the desired output d~ n [ R. The quadratic form in (10.6) denes the
specic EWC we are interested in, and its unique stationary point gives the optimal
solution for the EWWF. If b 0, then this stationary point is a minimum.
Otherwise, the Hessian of (10.6) might have mixed-sign eigenvalues or even allnegative eigenvalues. We demonstrate this fact with sample performance surfaces
obtained for two-tap FIR lters using b 1=2. For three differently colored
training data, we obtain the EWC performance surfaced shown in Figure 10.2. In
each row, the MSE performance surface, the EWC cost contour plot, and the EWC
performance surface are shown for the corresponding training data. The eigenvalue
pairs of the Hessian matrix of (10.6) are (2.35, 20.30), (6.13, 5.21), and (4.08,
4.14) for these representative cases in Figure 10.2. Clearly, it is possible for (10.6)
to have a stationary point that is a minimum, a saddle point, or a maximum, and we
start to see the differences brought about by the EWC. The performance surface is a
weighted sum of paraboloids, which will complicate gradient-based adaptation but
will not affect search algorithms utilizing curvature information.
10.3.2
Analysis of the Noise-Free Input Case
Theorem 10.1
The stationary point of the quadratic form in (10.6) is given by

~ ;
~ b S~ 1 P~ b Q
w* R
10:9
~
~ E~xn~xT n, S~ E x~_ n x~_ T n, P~ E~xnd~ n, and Q
where we dened R
~
~
Ex_ n d_ n.
453
Figure 10.2 The MSE performance surfaces, the EWC contour plot, and the EWC performance surface for three different training data sets and twotap adaptive FIR lters.
454
Proof Substituting the proper variables in (10.6), we obtain the following explicit
expression for Jw:
2
~ T w:
~ b S~ w 2P~ b Q
Jw Ed~ n b E d~_ 2 n wT R
10:10
Taking the gradient with respect to w and equating to zero yields

@Jw
~0
~ b S~ w 2P~ b Q
2R
@w
10:11
~ :
~ b S~ 1 P~ b Q
) w * R
Note that selecting b 0 in (10.6) reduces the criterion to MSE and the optimal
solution, given in (10.9), reduces to the Wiener solution. Thus, the Wiener lter is a
special case of the EWWF solution (though not optimal for noisy inputs, as we will
show later).
Corollary 1
An equivalent expression for the stationary point of (10.6) is given by

~ bR
~ L 1 1 2b P~ b P~ L ;
w* 1 2b R
10:12
~ L E~xn L~xT n x~ n~xT n L and the vector

where we dened the matrix R
~PL E~xn Ld~ n x~ nd~ n L. Note that the interesting choice b 1=2
~ 1 P~ L .
yields w* R
L
~ , and then recollecting terms to
~ , S~ , P~ , Q
Proof Substituting the denitions of R
~
~
obtain RL and PL yields the desired result:
~
~ b S~ 1 P~ b Q
w* R
(
)
E~xn~xT n b E~xn x~ n L~xn x~ n LT 1
E~xnd~ n b E~xn x~ n Ld~ n d~ n L

(
)
~ L 1
E~xn~xT n b E~xn~xT n E~xn L~xT n L R
E~xnd~ n b E~xnd~ n E~xn Ld~ n L P~ L

~ bR
~ L 1 1 2b P~ b P~ L :
1 2b R
10:13
From these results we deduce two interesting conclusions:
Lemma 1 (Generalized Wiener-Hopf Equations) In the noise-free case, the true
~ L wT P~ L . (This result is also true for noisy data.)
weight vector is given by R
10.3 PROPERTIES OF THE EWC
455
Proof This result follows immediately from the substitution of d~ n x~ T nwT

~ L and P~ L .
and d~ n L x~ T n LwT in the denitions of R
Lemma 2 In the noise-free case, regardless of the specic value of b , the optimal
solution is equal to the true weight vector, that is, w* wT .
Proof This result follows immediately from the substitution of the result in
Lemma 1 into the optimal solution expression given in (10.9).
The result in Lemma 1 is especially signicant, since it provides a generalization
of the Wiener-Hopf equations to autocorrelation and cross-correlation matrices
evaluated at different lags of the signals. In these equations, L represents the specic
correlation lag selected, and the choice L 0 corresponds to the traditional WienerHopf equations. The generalized Wiener-Hopf equations essentially state that the
true weight vector can be determined by exploiting correlations evaluated at
different lags of the signals, and we are not restricted to the zero-lag correlations, as
in the Wiener solution.
10.3.3
Analysis of the Noisy Input Case
Now, suppose that we are given noisy training data xn; dn, where xn
x~ n vn and dn d~ n un. The additive noises on both signals are zeromean, and uncorrelated with each other and with the input and desired signals.
Assume that the additive noise, un, on the desired is white (in time), and let the
autocorrelation matrices of vn be V EvnvT n and VL Evn LvT n
vnvT n L. Under these circumstances, we have to estimate the necessary
matrices to evaluate (10.9) using noisy data. These matrices evaluated using noisy
data, R, S, P, and Q, will become (see Appendix B for details)
~ V
R ExnxT n R
~ V R
~ L VL
S Exn xn Lxn xn LT 2R
P Exndn P~
10:14
Q Exn xn Ldn dn LT 2P~ P~ L :

Finally, the optimal solution estimate of EWC, when presented with noisy input and
desired output data, will be
^ * R b S1 P b Q
w
~ V R
~ L VL 1 P~ b 2P~ P~ L
~ V b 2R
R
~ L b VL 1 1 2b P~ b P~ L :
~ V b R
1 2b R
10:15
456
Theorem 10.2 (EWWF Noise-Rejection Theorem) In the noisy-input data case,

the optimal solution obtained using EWC will be identically equal to the true weight
~ L = 0, and VL 0. There are two situations to
vector if and only if b 1=2, R
consider:
When the adaptive linear system is an FIR lter, the input noise vector vk
consists of delayed versions of a single-dimensional noise process. In that
case, VL 0 if and only if L m, where m is the lter length and the singledimensional noise process is white.
When the adaptive linear system is an ADALINE, the input noise is a vector
process. In that case, VL 0 if and only if the input noise vector process is
white (in time) and L 1. The input noise vector may be spatially correlated.
Proof Sufciency of the rst statement is immediately observed by substituting the
provided values of b and VL . Necessity is obtained by equating (10.15) to wT and
substituting the generalized Wiener-Hopf equations provided in Lemma 1. Clearly,
~ L 0, then there is no equation to solve; thus, the weights cannot be uniquely
if R
determined using this value of L. The statement regarding the FIR lter case is easily
proved by noting that the temporal correlations in the noise vector diminish once the
autocorrelation lag becomes greater than or equal to the lter length. The statement
regarding the ADALINE structure is immediately obtained from the denition of a
temporally white vector process.
10.4
SOME PROPERTIES OF EWWF ADAPTATION
10.4.1
Orthogonality of Error to Input
An important question regarding the behavior of the optimal solution obtained using
the EWC criterion is the relationship between the residual error signal and the input
vector. In the case of MSE, we know that the Wiener solution results in an error
orthogonal to the input signal, that is, Eenxn 0 [16, 23]. Similarly, we can
determine what the EWC criterion will achieve.
Lemma 3 At the optimal solution of EWC, the error and the input random
processes satisfy b Eenxn L en Lxn 1 2b Eenxn for all
L m.
Proof We know that the optimal solution of EWC for any L m is obtained when
the gradient of the cost function with respect to the weights is zero. Therefore,
@J
2Eenxn 2b Een en Lxn xn L
@w
1 2b Eenxn b Eenxn L en Lxn 0:
10:16
10.4 SOME PROPERTIES OF EWWF ADAPTATION
457
It is interesting to note that if b 1=2, then we obtain Eenxn L

en Lxn 0. On the other hand, since the criterion reduces to MSE for b 0,
we obtain Eenxn 0. The result shown in (10.16), if interpreted in terms of
Newtonian physics, reveals an interesting insight into the behavior of the EWC
criterion at its optimal solution (regardless of the length of the reference lter that
created the desired signal). In a simplistic manner, this behavior could be
summarized by the following statement: The optimal solution of EWC tries to
decorrelate the residual error from the estimated future value of the input vector (see
Appendix C for details).
10.4.2
Relationship to Entropy Maximization
Another interesting property that the EWWF solution exhibits is its relationship with
entropy. Notice that when b , 0, the optimization rule tries to minimize MSE, yet it
tries to maximize the separation between samples of errors simultaneously. We
could regard the sample separation as an estimate of the error entropy. In fact, the
entropy estimation literature is full of methods based on sample separations [39, 5,
21, 3, 24, 2, 40]. Specically, the case b 1=2 nds the perfect balance between
entropy and MSE that allows us to eliminate the effect of noise on the solution.
Recall that the Gaussian density displays maximum entropy among distributions of
xed variance. In light of this fact, the aim of EWWF could be understood as nding
the minimum error variance solution while keeping the error close to Gaussian. Note
that, due to the central limit theorem, the error signal will be closely approximated
by a Gaussian density when there is a large number of taps.
10.4.3
Model Order Selection
Model order selection is another important issue in adaptive lter theory. The
purpose of an adaptive lter is to nd the right balance between approximating the
training data as accurately as possible and generalizing to unseen data with precision
[6]. One major cause of poor generalization is known to be excessive model
complexity [6]. Under these circumstances, the designers aim is to determine the
least complex adaptive system (which translates into a smaller number of weights in
the case of linear systems) that minimizes the approximation error. Akaikes
information criterion [1] and Rissanens minimum description length [36] are two
important theoretical results regarding model order selection. Such methods require
the designer to evaluate an objective function, which is a combination of MSE and
the lter length or the lter weights, using different lengths of adaptive lters. The
EWC criterion successfully determines the length of the true lter (assumed FIR),
even in the presence of additive noise, provided that the trained adaptive lter is
sufciently long. In the case of an adaptive lter longer than the reference lter, the
additional taps will decay to zero, indicating that a smaller lter is sufcient to
model the data. This is exactly what we would like an automated regularization
algorithm to achieve: determining the proper length of the lter without requiring
external discrete modications on this parameter. Therefore, EWC extends the
458
regularization capability of MSE to the case of noisy training data. Alternatively,

EWC could be used as a criterion for determining the model order in a fashion
similar to standard model order selection methods. Given a set of training samples,
one could start solving for the optimal EWC solution (using b 1=2) for various
lengths of the adaptive lter. As the length of the adaptive lter is increased past the
length of the true lter, the error power of the EWC solution will become constant.
Observing this point of transition from variable to constant error power will tell the
designer exactly what the lter order should be.
10.4.4
The Effect of b on the Weight Error Vector
The effect of the cost function free parameter b on the accuracy of the solution
(compared to the true weight vector that generated the training data) is another
crucial issue. In fact, it is possible to determine the dynamics of the weight error as a
function of b . This result is provided in the following lemma.
Lemma 4 (The Effect of b on the EWWF) In the noisy training data case, the
derivative of the error vector between the optimal EWC solution and the true weight
^ * wT , with respect to b is given by
vector, that is, 1^ * w
@1^ *
1 2b R V b RL 1 2R RL 1^ * RL wT :
@b
10:17
Note that @1^ * =@b jb !1=2 2wT .

Proof Recall from (10.15) that in the noisy data case, the optimal EWWF solution
^ * 1 2b R V b RL b VL 1 1 2b P b PL . Using
is given by w
the chain rule for the derivative and the fact that for any nonsingular matrix Ab ,
@A1 =@b A1 @A=@b A1 , the result in (10.17) follows from straightforward
derivation. In order to get the derivative as b ! 1=2, we substitute this value and
1^ * 0.
The signicance of Lemma 4 is that it shows that no nite b value will make this
error derivative zero. The matrix inversion, on the other hand, approaches to zero for
unboundedly growing b . In addition, it could be used to determine the Euclidean
error norm derivative, @k1^ * k22 =@b .
10.5 NUMERICAL CASE STUDIES USING THE THEORETICAL
SOLUTION
In the preceding sections, we built the theory of the EWC for linear adaptive lter
optimization. We investigated the behavior of the optimal solution as a function of
the cost function parameters and determined the optimal value of this parameter in
10.5 NUMERICAL CASE STUDIES USING THE THEORETICAL SOLUTION
459
the noisy training data case. This section demonstrates these theoretical results in
numerical case studies with Monte Carlo simulations.
Given the scheme depicted in Figure 10.3, it is possible to determine the true
analytic auto/cross-correlations of all signals of interest in terms of the lter
coefcients and the noise powers. Suppose that j , v , and u are zero-mean white noise
signals with powers s 2x , s 2v , and s 2u , respectively. Suppose that the coloring lter h
and the mapping lter w are unit norm. Under these conditions, we obtain
E~xn~xn D s 2x
M
X
10:18
hj hjD
j0

E~xn v~ n~xn D v~ n D
E~xn v~ nd^ n s 2v wD
N
X
s 2x s 2v ;
D0
E~xn~xn D; D = 0
wl E~xn~xn l D:
10:19
10:20
l0
For each combination of SNR from f10 dB; 0 dB; 10 dBg, b from f0:5;
0:3; 0; 0:1g, m from f2; . . . ; 10g, and L from fm; . . . ; 20g we have performed 100
Monte Carlo simulations using randomly selected 30-tap FIR coloring and n-tap
mapping lters. The length of the mapping lters and that of the adaptive lters were
selected to be equal in every case. In all simulations, we used an input signal power
of s 2x 1, and the noise powers s 2v s 2u are determined from the given SNR using
SNR 10 log10 s 2x =s 2v . The matrices R, S, P, and Q, which are necessary to
evaluate the optimal solution given by (10.15), are then evaluated using (10.18),
(10.19), and (10.20) analytically. The results obtained are summarized in Figure
10.4 and Figure 10.5, where for the three SNR levels selected, the average squared
error norm for the optimal solutions (in reference to the true weights) is given as a
function of L and n for different b values. In Figure 10.4, we present the average
normalized weight vector error norm obtained using EWC at different SNR levels
and using different b values as a function of the correlation lag L that is used in the
Figure 10.3 Demonstration scheme with coloring lter h, true mapping lter w, and the
uncorrelated white signals j , v~ , and u^ .
460
criterion. The lter length is 10 in these results. From the theoretical analysis, we
know that if the input autocorrelation matrix is invertible, then the solution accuracy
should be independent of the autocorrelation lag L. The results of the Monte Carlo
simulations presented in Figure 10.4 conform to this fact. As expected, the optimal
choice of b 1=2 determined the correct lter weights exactly.
Another set of results, presented in Figure 10.5, shows the effect of lter length
on the accuracy of the solutions provided by the EWC criterion. The optimal value
of b 1=2 always yields the perfect solution, whereas the accuracy of the optimal
weights degrades as this parameter is increased towards zero (i.e., as the weights
approaches the Wiener solution). An interesting observation from Figure 10.5 is that
for SNR levels below zero, the accuracy of the solutions using suboptimal b values
increases, whereas for SNR levels above zero, the accuracy decreases when the lter
length is increased. For zero SNR, on the other hand, the accuracy seems to be
roughly unaffected by the lter length.
The Monte Carlo simulations performed in the preceding examples utilized the
exact coloring lter and the true lter coefcients to obtain the analytical solutions.
In our nal case study, we demonstrate the performance of the batch solution of the
EWC criterion obtained from sample estimates of all the relevant auto- and crosscorrelation matrices. In these Monte Carlo simulations, we utilize 10,000 samples
corrupted with white noise at various SNR levels. The results of these Monte Carlo
simulations are summarized in the histograms shown in Figure 10.6. Each subplot of
Figure 10.6 corresponds to experiments performed using SNR levels of 10 dB,
0 dB, and 10dB for each column and adaptive lter lengths of 4 taps, 8 taps, and 12
Figure 10.4 The average squared error norm of the optimal weight vector as a function of
autocorrelation lag L for various b values and SNR levels.
461
Figure 10.5 The average squared error norm of the optimal weight vector as a function of
lter length m for various b values and SNR levels.
taps for each row, respectively. For each combination of SNR and lter length, we
performed 50 Monte Carlo simulations using MSE (b 0) and EWC (b 1=2)
criteria. The correlation lag is selected to be equal to the lter length in all
simulations due to Theorem 10.2. Clearly, Figure 10.6 demonstrates the superiority
of the EWC in rejecting noise that is present in the training data. Note that in all
subplots (i.e., for all combinations of lter length and SNR), EWC achieves a
smaller average error norm than MSE. The discrepancy between the performances
of the two solutions intensies with increasing lter length. Next, we demonstrate
the error-whitening property of the proposed EWC solutions.
From (10.1) we can expect that the error autocorrelation function will vanish at
lags greater than or equal to the length of the reference lter if the weight vector is
identical to the true weight vector. For any other value of the weight vector, the error
autocorrelation uctuates at nonzero values. A four-tap reference lter is identied
with a four-tap adaptive lter using noisy training data (hypothetical) at an SNR
level of 0 dB. The autocorrelation functions of the error signals corresponding to the
MSE solution and the EWC solution are shown in Figure 10.7. Clearly, the EWC
criterion determines a solution that forces the error autocorrelation function to zero
at lags greater than or equal to the lter length (partial whitening of the error).
Finally, we address the order selection capability and demonstrate how the EWC
criterion can be used to determine the correct lter order, even with noisy data,
provided that the given input desired output pair is a moving average process. For
this purpose, we determine the theoretical Wiener and EWC (with b 1=2 and
462
Figure 10.6 Histograms of the weight error norms (dB) obtained in 50 Monte Carlo simulations using 10,000 samples of noisy data using MSE
(empty bars) and EWC with b 1=2 (full bars). The subgures in each row use lters with 4, 8, and 12 taps, respectively. The subgures in each
column use noisy samples at 10, 0, and 10dB SNR, respectively.
Figure 10.7
463
Error autocorrelation function for MSE (dotted) and EWC (solid) solutions.
L m, where m is the length of the adaptive lter) solutions for a randomly selected
pair of coloring lter h and mapping lter w at different adaptive lter lengths. The
noise level is selected to be 20 dB, and the length of the true mapping lter is 5. We
know from our theoretical analysis that if the adaptive lter is longer than the
reference lter, the EWC will yield the true weight vector padded with zeros. This
will not change the MSE of the solution. Thus, if we plot the MSE of the EWC
versus the length of the adaptive lter, starting from the length of the actual lter, the
MSE of the EWC solution will remain at, whereas the Wiener solution will keep
decreasing the MSE, contaminating the solution by learning the noise in the data.
Figure 10.8a shows the MSE of the Wiener solution as well as the EWC obtained for
different lengths of the adaptive lter using the same training data described above.
Note (in the zoomed-in portion) that the MSE of the EWC remains constant starting
from 5, which is the lter order that generated the data. On the other hand, if we were
to decide on the lter order looking at the MSE of the Wiener solution, we would
select a model order of 4, since the gain in MSE is insignicantly small compared to
the previous steps from this point on.
Figure 10.8b shows the norm of the weight vector error for the solutions obtained
using the EWC and MSE criteria, which conrms that the true weight vector is
indeed attained with the EWC criterion once the proper model order is reached.
This section aimed at experimentally demonstrating the theoretical concepts set
forth in the preceding sections of the chapter. We have demonstrated with numerous
Monte Carlo simulations that the analytical solution of the EWC criterion eliminates
the effect of noise completely if the proper value is used for b . We have also
demonstrated that the batch solution of EWC (estimated from a nite number of
samples) outperforms MSE in the presence of noise, provided that a sufcient
464
Figure 10.8 Model order selection using the EWC criterion: (a) MSE Ed 2 n of the
EWWF (solid) and the Wiener solutions (dotted) versus lter length. (b) Norm of the weight
vector error as a function of lter length for EWWF (solid) and Wiener solutions (dotted).
number of samples are given so that the noise autocorrelation matrices diminish as
required by the theory.
Although we have presented a complete theoretical investigation of the proposed
criterion and its analytical solution, in practice, on-line algorithms that operate on a
sample-by-sample basis to determine the desired solution are equally valuable.
Therefore, in the sequel, we will focus on designing computationally efcient online algorithms to solve for EWC in a fashion similar to the well-known LMS and
RLS algorithms. In fact, we aim to come up with algorithms that have the same
computational complexity as these two widely used algorithms. The advantage of
the new algorithms will be their ability to provide better estimates of the model
weights when the training data are contaminated with white noise.
10.6
THE RECURSIVE ERROR WHITENING (REW) ALGORITHM
In this section, we will present an on-line recursive algorithm to estimate the optimal
solution for the EWC. Given the estimate of the lter tap weights at time instant
(n 1), the goal is to determine the best set of tap weights at the next iteration n that
would track the optimal solution. This algorithm, which we call recursive error
whitening (REW), is similar to recursive least squares (RLS). The strongest
motivation behind proposing the REW algorithm is that it is truly a xed-point-type
algorithm that tracks, at each iteration, the optimal solution.
This tracking nature results in the faster convergence of the REW algorithm [34].
This, however, comes at an increase in the computational cost. The REW algorithm
is Om2 in complexity (the same as in the RLS algorithm), and this is a substantial
increase in complexity when compared with simple gradient methods that will be
465
10.6 THE RECURSIVE ERROR WHITENING (REW) ALGORITHM
discussed in a later section. We know that the optimal solution for the EWC is given
by
w* R b S1 P b Q:
10:21
Letting Tn Rn b Sn and Vn Pn b Qn, we obtain the following

recursions:
Tn Tn 1 1 2b xnxT n b xn LxT n b xnxT n L
Tn 1 2b xnxT n b xn LxT n
xnxT n b xnxT n L
Tn 1 2b xn b xn LxT n xnxn b xn LT :
10:22
The well known Sherman-Morrison-Woodbury identity or the matrix inversion
lemma [19] states that
A BCDT 1 A1 A1 BC1 DT A1 B1 DT A1 :
10:23
Substituting A Tn 1, B 2b xn b xn Lxn, C I22 , a 2 2

identity matrix, and D xn xn b xn L, we see that (10.22) is
obtained. Therefore, the recursion for the inverse of Tn becomes
1
T1 n T1 n 1 T1 n 1BI22 DT T1 n 1B1 DT Tn 1:

10:24
Note that the computation of the above inverse is different (and more involved) than
that of the conventional RLS algorithm. It requires the inversion of an extra 2 2
1
matrix I22 DT Tn 1B. The recursive estimator for Vn is a simple
correlation estimator given by
Vn Vn 1 1 2b dnxn b dnxn L b dn Lxn:
10:25
Using T1 n and Vn, an estimate of the lter weight vector at iteration index n is
wn T1 nVn:
10:26
We will dene a gain matrix analogous to the gain vector in the RLS case [23] as
kn T1 n 1BI22 DT T1 n 1B1 :
10:27
Using the above denition, the recursive estimate for the inverse of Tn becomes,
T1 n T1 n 1 knDT T1 n 1:
10:28
466
Once again, the above equation is analogous to the Ricatti equation for the RLS
algorithm. Multiplying (10.27) from the right by I22 DT T1 n 1B, we
obtain
knI22 DT T1 n 1B T1 n 1B
kn T1 n 1B knDT T1 n 1B
10:29
T1 nB:
In order to derive an update equation for the lter weights, we substitute the
recursive estimate for Vn in (10.26):
wn T1 nVn 1 T1 n1 2b dnxn b dnxn L
b dn Lxn
10:30
Using (10.28) and recognizing that wn 1 T1 n 1Vn 1, the above

equation can be reduced to
wn wn 1 knDT wn 1
T1 n1 2b dnxn b dnxn L b dn Lxn:
Using the denition for B 2b xn b xn L
10:31
xn, we can easily see that
1 2b dnxn b dnxn L b dn Lxn

dn
:
B
dn b dn L
10:32
From (10.29) and (10.32), the weight update equation simplies to

dn
:
wn wn 1 knD wn 1 kn
dn b dn L
10:33
Note that the product DT wn 1 is nothing but the matrix of the outputs
yn yn b yn LT ,
where
yn xT nwn 1 and
yn L
T
x n Lwn 1. The a priori error matrix is dened as

en
dn yn
dn yn b dn L yn L
en
en b en L

: 10:34
467
10.6 THE RECURSIVE ERROR WHITENING (REW) ALGORITHM
Using all the above denitions, we will formally state the weight update equation for
the REW algorithm as
wn wn 1 knen:
10:35
The overall complexity of (10.35) is Om2 , which is comparable to the complexity of the RLS algorithm. Unlike the stochastic gradient algorithms that are
easily affected by the eigenspread of the input data and the type of the stationary
point solution (minimum, maximum, or saddle), the REW algorithm is immune to
these problems. This is because it inherently makes use of more information about
the performance surface by computing the inverse of the Hessian matrix R b S. A
summary of the REW algorithm is given in Table 10.1.
The convergence analysis of the REW algorithm is similar to the analysis of the
RLS algorithm, which is dealt with in detail in [23]. In this chapter, we will not dwell
further on the convergence issues of REW algorithm. The REW algorithm as given
by (10.35) works for stationary data only. For nonstationary data, tracking becomes
an important issue. This can be handled by including a forgetting factor in the
estimation of Tn and Vn. This generalization of the REW algorithm with
forgetting factor is trivial and very similar to the exponentially weighted RLS
(EWRLS) algorithm [23].
The instrumental variables (IV) method proposed as an extension to the leastsquares (LS) has a similar recursive algorithm for solving the problem of parameter
estimation in white noise [43]. This method requires choosing a set of instruments
that are uncorrelated with the noise in the input. Specically, the IV method
computes the solution w Exk xTkD 1 ExkD dk ; where D is the chosen lag for the
instrument vector. Notice that there is a similarity between the IV solution and the
recursive EWC solution w R1
L PL : However, the EWC formulation is based on
TABLE 10.1
Summary of the REW Algorithm
Initialize T0 cI; c is the large positive constant

w0 0
At every iteration, compute
B 2b xn b xn L
xn and
D xn xn b xn L
kn Tn 1BI22 DT Tn 1B1
yn xT nwn 1 and yn L xT n Lwn 1

dn yn
en
en
dn yn b dn L yn L
en b en L
wn wn 1 knen
Tn Tn 1 knDT Tn 1
468
the error whereas the IV method does not have an associated error cost function.
Also, the Toeplitz structure of RL can be exploited to derive fast converging (and
robust) minor components based recursive EWC algorithms [44].
10.6.1
Estimation of System Parameters in White Noise Using REW
The REW algorithm can be used effectively to solve the system identication
problem in noisy environments. As we have seen before, setting the value of
b 0:5, noise immunity can be gained for parameter estimation. We generated a
purely white Gaussian random noise of length 50,000 samples and added this to a
colored input signal. The white noise signal is uncorrelated with the input signal.
The noise-free, colored input signal was ltered by the unknown reference lter, and
this formed the desired signal for the adaptive lter. Since the noise in the desired
signal would be averaged out for both RLS and REW algorithms, we decided to use
the clean desired signal itself. This will bring out only the effects of input noise on
the lter estimates. Also, the noise added to the clean input is uncorrelated with the
desired signal. In the experiment, we varied the SNR in the range 10 dB to 10 dB.
The number of desired lter coefcients was also varied from 4 to 12. We then
performed 100 Monte Carlo runs and computed the normalized error vector norm
given by

error 20 log 10

kwT w* k
;
kwT k
10:36
where w is the weight vector estimated by the REW algorithm with b 0:5 after
50,000 iterations or one complete presentation of the input data and wT is the true
weight vector. In order to show the effectiveness of the REW algorithm, we
performed Monte Carlo runs using the RLS algorithm on the same data to estimate
the lter coefcients. Figure 10.9 shows a histogram plot of the normalized error
vector norm given in (10.36). The solid bars show the REW results, and the unlled
bars denote the results of RLS. It is clear that the REW algorithm is able to perform
better than the RLS at various SNR and tap length settings. In the high-SNR cases,
there is not much of a difference between RLS and REW results. However, under
noisy circumstances, the reduction in the parameter estimation error with REW is
orders of magnitude higher when compared with RLS. Also, the RLS algorithm
results in a rather useless zero weight vector; that is, w 0 when the SNR is lower
than 10dB.
10.6.2
Effect of b and Weight Tracks of REW Algorithm
Since we have a free parameter b to choose, it would be worthwhile to explore the

effect of b on the parameter estimates. The SNR of the input signal is xed at 0dB
and 10 dB, the number of lter taps is set to four, and the desired signal is noise
free, as before. We performed 100 Monte Carlo experiments and analyzed the
average error vector norm values for 1 b 1. The results of the experiment are
shown in Figure 10.10. Note that there is a dip at b 0:5 (indicated by a * in the
469
Figure 10.9
Histogram plots showing the normalized error vector norm for REW and RLS algorithms.
470
Figure 10.10 Performance of the REW algorithm with (a) SNR 0 dB and (b)
SNR 10 dB over various beta values.
gure), and this clearly gives us the minimum estimation error. For b 0 (indicated
by a o in the gure), the REW algorithm reduces to the regular RLS, giving a fairly
signicant estimation error. Next, the parameter b is set to 0.5 and SNR to 0 dB,
and the weight tracks are estimated for the two algorithms. Figure 10.11 shows the
averaged weight tracks for both REW and RLS algorithms over 50 Monte Carlo
trials. Asterisks on the plots indicate the true parameters. The tracks for the RLS
algorithm are smoother, but they converge to wrong values, which we have observed
quite consistently. The weight tracks for the REW algorithm are noisier than those of
the RLS, but they eventually converge to values very close to the true weights.
We have observed that the weight tracks for the REW algorithm can be quite
noisy in the initial stages of adaptation. This may be attributed to the poor
Figure 10.11
Weight tracks for REW and RLS algorithms.
10.7 STOCHASTIC GRADIENT ALGORITHMS
471
conditioning that is mainly caused by the smallest eigenvalue of the estimated

Hessian matrix, which is Rn b Sn for the REW algorithm. The same holds true
for the RLS algorithm, where the minimum eigenvalue of Rn affects the sensitivity
[23]. The instability issues of the RLS algorithm during the initial stages of
adaptation have been well reported in the literature, and effects of roundoff error
have been analyzed and many solutions have been proposed to make the RLS
algorithm robust to such effects [23, 28, 8]. Similar analysis on the REW algorithm
is yet to be done. This will be addressed in future work on the topic.
10.7
STOCHASTIC GRADIENT ALGORITHMS
Stochastic gradient algorithms have been at the forefront in optimizing quadratic

cost functions like the MSE. Owing to the presence of a global minimum in
quadratic performance surfaces, gradient algorithms can elegantly accomplish the
task of reaching the optimal solution at minimal computational cost. In this section,
we will derive the stochastic gradient algorithms for the EWC. Since the EWC
performance surface is a weighted sum of quadratics, we can expect difculties to
arise. Assume that we have a noisy training data set of the form xn; dn, where
xn [ Rm is the input and dn [ R is the output of a linear system with coefcient
vector wT . The goal is to estimate the parameter vector wT using the EWC. We know
that the EWC cost function is given by
Jw Ee2 n b E_e2 n;
10:37
where, e_ n en en L, w is the estimate of the parameter vector, and L m,

the size of the input vector. For convenience, we will restate the following
denitions: x_ n xn xn L, d_ n dn dn L, R ExnxT n,
S E_xn_xT n, P Exndn, and Q E_xnd_ n. Using these denitions,
we can rewrite the cost function in (10.37) as
2
Jw Ed 2 n b Ed_ n wT R b Sw 2P b QT w:
10:38
It is easy to see that both Ee2 n and E_e2 n have parabolic performance surfaces
as their Hessians have positive eigenvalues. However, the value of b can invert the
performance surface of E_e2 n. For b . 0, the stationary point is always a global
minimum and the gradient of (10.38) can be written as the sum of the individual
gradients as follows:
@Jw
2R b Sw 2P b Q 2Rw P 2b Sw Q:
@w
10:39
The above gradient can be approximated by the stochastic instantaneous gradient by

removing the expectation operators:
@Jwn
enxn b e_ n_xn:
@wn
10:40
472
Thus we can write the weight update for the stochastic EWC-LMS algorithm for
b . 0 as
wn 1 wn h nenxn b e_ n_xn;
10:41
where h n . 0 is a nite step-size parameter that controls convergence. For b , 0,

the stationary point is still unique, but it can be a saddle point, a global maximum, or
a global minimum. Evaluating the gradient as before and using the instantaneous
gradient, we get the EWC-LMS algorithm for b , 0:
wn 1 wn h nenxn jb j_en_xn;
10:42
where h n is again a small step-size. However, there is no guarantee that the above
update rules will be stable for all choices of step-sizes. Although (10.41) and (10.42)
are identical, we will use jb j in the update, (10.42), to analyze the convergence of
the algorithm specically for b , 0. The reason for the separate analysis is that the
convergence characteristics of (10.41) and (10.42) are very different.
Theorem 10.3 The stochastic EWC algorithms asymptotically converge in the
mean to the optimal solution given by
w* R b S1 P b Q;
w* R jb jS1 P jb jQ;
b .0
b , 0:
10:43
We will make the following mild assumptions typically applied to stochastic

gradient algorithms [23, 27, 25, 4] that can be easily satised:
A.1 The input vectors xn are derived from at least a wide sense stationary
(WSS) colored random signal with a positive denite autocorrelation matrix
R ExnxT n.
A.2 The matrix RL ExnxT n L xn LxT n exists and has full rank.
A.3 The sequence of weight vectors wn is bounded with probability 1.
A.4 The update functions hwn enxn b e_ n_xn for b . 0 and
hwn enxn jb j_en_xn for b , 0 exist and are continuously
differentiable with respect to wn, and their derivatives are bounded in
time.
A.5 Even if hwn has some discontinuities, a mean update vector h wn
limn!1 Ehwn exists.
Assumption A.1 is easily satised. A.2 requires that the input signal have
sufcient correlation with itself for at least L lags.
10.7.1
Proof of EWC-LMS Convergence for b . 0
We will rst consider the update equation in (10.41), which is the stochastic EWCLMS algorithm for b . 0. Without loss of generality, we will assume that the input
473
vectors xn and their corresponding desired responses dn are noise-free. The mean
update vector h wn is given by
dwt
Eenxn b e_ n_xn
h wn
dt
Rwn Pn b Swn Qn:
10:44
The stationary point of the ordinary differential equation (ODE) in (10.44) is given
by
w* R b S1 P b Q:
10:45
We will dene the error vector at time instant n as jn w* wn. Therefore,

jn 1 jn h nenxn b e_ n_xn;
10:46
and the norm of the error vector at time n 1 is simply

kjn 1k2 kjnk2 2h njT nenxn b jT n_en_xn
h 2 nkenxn b e_ n_xnk2 :
10:47
Imposing the condition that kjn 1k2 , kjnk2 for all n, we get an upper bound
on the time varying step-size parameter h n which is given by
h n ,
2jT nenxn b jT n_en_xn

:
kenxn b e_ n_xnk2
10:48
Simplifying the above equation using the fact that jT nxn en and
j n_xn e_ n, we get
T
h n ,
2e2 n b e_ 2 n
;
kenxn b e_ n_xnk2
10:49
which is a more practical upper bound on the step-size, as it can be directly estimated
from the input and outputs. As an observation, we say that if b 0, then the bound
in (10.49) reduces to,
h n ,
2
;
kxnk2
10:50
which, when included in the update equation, reduces to a variant of the normalized
LMS (NLMS) algorithm. In general, if the step-size parameter is chosen according
to the bound given by (10.49), then the norm of the error vector jn is a
474
monotonically decreasing sequence converging asymptotically to zero, that is,

limn!1 kjnk2 ! 0, which implies that limn!1 wn ! w* . In addition, the upper
bound on the step-size ensures that the weights are always bound with probability 1,
satisfying assumption A.3 made before. Thus the weight vector wn converges
asymptotically to w* , which is the only stable stationary point of the ODE in
(10.44). Note that (10.41) is an Om algorithm.
10.7.2
Proof of EWC-LMS Convergence for b , 0
We analyze the convergence of the stochastic gradient algorithm for b , 0 in the

presence of white noise because this is the relevant case (b 0:5 eliminates the
bias due to noise added to the input). From (10.42), the mean update vector h wn
is given by
dwt
Eenxn jb j_en_xn
h wn
dt
10:51
Rwn Pn jb jSwn Qn
As before, the stationary point of this ODE is
w* R jb jS1 P jb jQ:
10:52
The eigenvalues of R jb jS decide the nature of the stationary point. If they are all
positive, then we have a global minimum; if they are all negative, we have a global
maximum. In these two cases, the stochastic gradient algorithm in (10.42) with a
proper xed sign step-size would converge to the stationary point, which would be
stable. However, we know that the eigenvalues of R jb jS can also take both
positive and negative values, resulting in a saddle stationary point. Thus, the
underlying dynamical system would have both stable and unstable modes making it
impossible for the algorithm in (10.42) with xed sign step-size to converge. This is
well known in the literature [22]. However, as will be shown next, this difculty can
be removed for our case by appropriately utilizing the sign of the update equation
(remember that this is the only stationary point of the quadratic performance
surface). The general idea is to use a vector step-size (one step-size per weight)
having both positive and negative values. One unrealistic way (for an on-line
algorithm) to achieve this goal is to estimate the eigenvalues of R jb jS.
Alternatively, we can derive the conditions on the step-size for guaranteed
convergence. As before, we will dene the error vector at time instant n as jn
w* wn. The norm of the error vector at time instant n 1 is given by
kjn 1k2 kjnk2 2h njT nenxn jb jjT n_en_xn
10:53
h 2 nkenxn jb j_en_xnk2 :
475
Taking the expectations on both sides, we get

Ekjn 1k2 Ekjnk2 2h nEjT nenxn jb jjT n_en_xn
h 2 nEkenxn jb j_en_xnk2 :
10:54
The mean of the error vector norm will monotonically decay to zero over time; that
is, Ekjn 1k2 , Ekjnk2 if and only if the step-size satises the following
inequality:
jh nj ,
2jEjT nenxn jb jjT n_en_xnj

:
Ekenxn jb j_en_xnk2
10:55
Let xn x~ n vn and dn d~ n un, where x~ n and d~ n are the clean

input and desired data, respectively. We will further assume that the input noise
vector vn and the noise component in the desired signal un are uncorrelated.
Also, the noise signals are assumed to be independent of the clean input and desired
signals. Furthermore, the lag L is chosen to be more than m, the length of the lter
under consideration. Since the noise is assumed to be purely white, EvnvT n
L Evn LvT n 0 and EvnvT n V. We have
jT nenxn w* wnT d~ n un wT n~xn wT nvn
~xn vn:
10:56
Simplifying this further and taking the expectations, we get

T
~
EjT nenxn vard~ n 2P~ wn wT nRw
Rwn
wT nVwn wT* Vwn
10:57
JMSE wT* Vwn;

~ E~xn~xT n, P~ E~xnd~ n and
where R
~ Vwn vard~ n 2P~ T wn:
JMSE wT nR
10:58
Similarly, we have
jT n_en_xn w* wnT d~ n un wT n~xn vn
d~ n L un L wT n~xn L vn L
~xk vk x~ kL vkL :
10:59
476
Evaluating the expectations on both sides of (10.59) and simplifying, we obtain

~ wn
EjT n_en_xn vard~ n d~ n L 2Q
T
wT nS~ wn 2wT nWvn 2wT* Vwn
10:60
JENT 2wT* Vwn;

where we have used the denitions S~ E~xn x~ n L~xn x~ n LT ,
~ E~xn x~ n Ld~ n d~ n L, and
Q
~ T wn:
JENT wT nS~ 2Vwn vard~ n d~ n L 2Q
10:61
Using (10.57) and (10.60) in (10.55), we get an expression for the upper bound on
the step-size as
jh nj ,
2jJMSE jb jJENT 1 2jb jwT* Vwnj

Ekenxn jb j_en_xnk2
10:62
This expression is not usable in practice as an upper bound because it depends on the
optimal weight vector. However, for b 0:5, the upper bound on the step-size
reduces to
jh nj ,
2jJMSE 0:5JENT j
:
Ekenxn 0:5_en_xnk2
10:63
From (10.58) and (10.61), we know that JMSE and JENT are positive quantities.
However, JMSE 0:5JENT can be negative. Also, note that this upper bound is
computed by evaluating the right-hand side of (10.63) with the current weight vector
wn. Thus, as expected, it is very clear that the step-size at the nth iteration can take
either positive or negative values based on JMSE 0:5JENT ; therefore, sgnh n
must be the same as sgnJMSE 0:5JENT evaluated at wn. Intuitively speaking, the
term JMSE 0:5JENT is the EWC cost computed with the current weights wn and
b 0:5, which tells us where we are on the performance surface, and the sign tells
which way to go to reach the stationary point. It also means that the lower bound on
the step-size is not positive, as in traditional gradient algorithms. In general, if the
step-size we choose satises (10.62), then the mean error vector norm decreases
asymptotically; that is, Ekjn 1k2 , Ekjnk2 and eventually becomes zero,
which implies that limn!1 Ewn ! w* . Thus, the weight vector Ewn
converges asymptotically to w* , which is the only stationary point of the ODE in
(10.51). We conclude that the knowledge of the eigenvalues is not needed to
implement gradient descent in the EWC performance surface, but (10.63) is still not
appropriate for a simple LMS-type algorithm.
10.7.3
477
On-Line Implementations of EWC-LMS for b , 0
As mentioned before, computing JMSE 0:5JENT at the current weight vector would
require reusing the entire past data at every iteration. As an alternative, we can
extract the curvature at the operating point and include that information in the
gradient algorithm. By doing so, we obtain the following stochastic algorithm:
wn 1 wn h sgnwT nRn jb jSnwnenxn
jb j_en_xn;
10:64
where Rn and Sn are the estimates of R and S, respectively, at the nth time
instant.
Corollary Given any quadratic surface Jw, the following gradient algorithm
converges to its stationary point:
wn 1 wn h sgnwT nHwn
@J
:
@wn
10:65
Proof Without loss of generality, suppose that we are given a quadratic surface of
the form Jw wT Hw, where H [ Rmm and w [ Rm1 . H is restricted to be
symmetric; therefore, it is the Hessian matrix of this quadratic surface. The gradient
of the performance surface with respect to the weights, evaluated at point w0 , is
@J=@w0 2Hw0 , and the stationary point of Jw is the origin. Since the
performance surface is quadratic, any cross section passing through the stationary
point is a parabola. Consider the cross section of Jw along the line dened by the
local gradient that passes through the point w0 . In general, the Hessian matrix of this
surface can be positive or negative denite; it might as well have mixed eigenvalues.
The unique stationary point of Jw, which makes its gradient zero, can be reached
by moving along the direction of the local gradient. The important issue is the
selection of the sign, that is, whether to move along or against the gradient direction
to reach the stationary point. The decision can be made by observing the local
curvature of the cross section of Jw along the gradient direction. The performance
surface cross section along the gradient direction at w0 is
Jw0 2h Hw0 wT0 I 2h HT HI 2h Hw0
wT0 H 4h H2 4h 2 H3 w0 :
10:66
From this, we deduce that the local curvature of the parabolic cross section at w0 is
4wT0 H3 w0 . If the performance surface is locally convex, this curvature is positive. If
the performance surface is locally concave, this curvature is negative. Also, note that
sgn4wT0 H3 w0 sgnwT0 Hw0 . Thus, the update equation with the curvature
478
information in (10.65) converges to the stationary point of the quadratic cost

function Jw, irrespective of the nature of the stationary point.
From the above corollary and utilizing the fact that the matrix R jb jS is
symmetric, we can conclude that the update equation in (10.64) asymptotically
converges to the stationary point w* R jb jS1 P jb jQ. On the down side,
however, the update equation in (10.64) requires Om2 computations, which makes
the algorithm unwieldy for real-world applications. Also, we can use the REW
algorithm instead, which has a similar complexity.
For an Om algorithm, we have to go back to the update rule in (10.42). We will
discuss only the simple case of b 0:5, which also turns out to be the more useful
case. We propose to use an instantaneous estimate of the sign with the current
weights given by
wn 1 wn h nsgne2 n 0:5_e2 nenxn 0:5_en_xn;
10:67
where h n . 0 and is bound by (10.63). It is possible to make mistakes in the sign

estimation when (10.67) is utilized, which will not affect the convergence in the
mean but will penalize the misadjustment. The argument that misadjustment will be
greater for the EWC algorithm in (10.67) than for the standard LMS algorithm is
currently under investigation.
10.7.4
Estimation of System Parameters in White Noise
The experimental setup is the same as the one we used to test the REW algorithm.
We varied the SNR between 10 dB and 10 dB and changed the number of lter
parameters from 4 to 12. We set b 0:5 and used the update equation in (10.67)
for the EWC-LMS algorithm. A time-varying step-size magnitude was chosen in
accordance with the upper bound given by (10.63) without the expectation
operators. This greatly reduces the computational burden but makes the algorithm
noisier. However, since we are using 50,000 samples for estimating the parameters,
we can expect the errors to average out over iterations. For the LMS algorithm, we
chose the step-size that gave the least error in each trial. A total of 100 Monte Carlo
trials were performed, and histograms of normalized error vector norms were
plotted. Figure 10.12 shows the error histograms for both LMS and EWC-LMS
algorithms. The EWC-LMS algorithm performs signicantly better than the LMS
algorithm at low SNR values. Their performances are on par for SNRs greater than
20 dB. Figure 10.13 shows a sample comparison between the stochastic and
recursive algorithms for 0 dB SNR and four lter taps. Interestingly, the performance of the EWC-LMS algorithm is better than that of the REW algorithm
in the presence of noise. Similarly, the LMS algorithm is much better than the
RLS algorithm. This tells us that the stochastic algorithms reject more noise
than the xed-point algorithms. Researchers have made this observation before,
although no concrete arguments exist to account for the smartness of the adaptive
algorithms [35]. Similar conclusions can be drawn in our case for EWC-LMS and
REW.
479
Figure 10.12 Histogram plots showing the normalized error vector norm for EWC-LMS and LMS algorithms.
480
Figure 10.13
10.7.5
Comparison of stochastic and recursive algorithms.
Weight Tracks and Convergence
The steady-state performance of a stochastic gradient algorithm is a matter of great

importance. We will now verify experimentally the steady-state behavior of the
EWC-LMS algorithm. The SNR of the input signal is set to 10 dB, and the number of
lter taps is xed to two for display convenience. Figure 10.14 shows the contour
Figure 10.14
Contour plot with weight tracks.
481
plot of the EWC cost function with noisy input data. Clearly, the Hessian of this
performance surface has both positive and negative eigenvalues, thus making the
stationary point an undesirable saddle point. On the same plot, we have shown the
weight tracks of the EWC-LMS algorithm in (10.67) with b 0:5. Also, we have
used a xed value of 0.001 for the step-size. From the gure, it is clear that the
EWC-LMS algorithm converges stably to the saddle point solution, which is
theoretically unstable when a single-sign step-size is used. Note that due to the
constant step-size, there is misadjustment in the nal solution. Although no
analytical expressions for misadjustments are derived in this chapter, we have done
some preliminary work on estimating the misadjustment and excess error for EWCLMS [32, 33].
In Figure 10.15, we show the individual weight tracks for the EWC-LMS
algorithm. The weights converge to the vicinity of the true lter parameters, which
are 0.2 and 0.5, respectively, within 1000 samples. In order to see if the algorithm
in (10.67) converges to the saddle point solution in a robust manner, we ran the same
experiment using different initial conditions on the contours. Figure 10.16 shows a
few plots of the weight tracks originating from different initial values over the
contours of the performance surface. In every case, the algorithm converged to the
saddle point in a stable manner. Note that the misadjustment in each case is almost
the same. Finally, in order to see the effect of reducing the SNR, we repeated the
experiment with 0 dB SNR. Figure 10.17 (left) shows the weight tracks over the
contour, and we can see that there is more misadjustment now. However, we have
observed that by using smaller step-sizes, the misadjustment can be controlled to be
within acceptable limits. Figure 10.17 (right) shows the weight tracks when the
algorithm is used without the sign information for the step-size. Note that
convergence is not achieved in this case, which substantiates our previous argument
that a xed-sign step-size will never converge to a saddle point.
Figure 10.15
Weight tracks.
482
Figure 10.16
10.8
Contour plot with weight tracks for different initial values for the weights.
SUMMARY AND DISCUSSION
Mean square error has been the criterion of choice in many function approximation
tasks including adaptive lter optimization. There are alternatives and enhancements to MSE that have been proposed in order to improve the robustness of
learning algorithms in the presence of noisy training data. In FIR lter adaptation,
noise present in the input signal is especially problematic since MSE cannot
eliminate this factor. A powerful enhancement technique called total least squares,
on the one hand, fails to work if the noise levels in the input and output signals are
not equal. The alternative method of subspace Wiener ltering, on the other hand,
requires the noise power to be strictly smaller than the signal power to improve SNR.
Figure 10.17 Contour plot with weight tracks for the EWC-LMS algorithm with sign
information (left) and without sign information (right) (0dB SNR and two lter taps case).
10.8 SUMMARY AND DISCUSSION
483
In this chapter, we have proposed an extension to the traditional MSE criterion in

lter adaptation, which we have named the error-whitening criterion (EWC). This
new cost function is inspired by the observations made on the properties of the error
autocorrelation function. Specically, we have shown that using nonzero lags of the
error autocorrelation function, it is possible to obtain unbiased estimates of the
model parameters even in the presence of white noise in the training data.
The new EWC criterion offers a parametric family of optimal solutions. The
classical Wiener solution remains a special case corresponding to the choice b 0,
whereas total noise rejection is achieved for the special choice of b 1=2. We
have shown that the optimal solution yields an error signal uncorrelated with the
predicted next value of the input vector, based on analogies with Newtonian
mechanics of motion. On the other hand, the relationship with entropy through the
stochastic approximation reveals a clearer understanding of the behavior of this
optimal solution; the true weight vector that generated the training data marks the
lags at which the error autocorrelation will become zero. We have exploited this fact
to optimize the adaptive lter weights without being affected by noise.
The theoretical analysis has been complemented by on-line algorithms that
search on a sample-by-sample basis for the optimum of the EWC. We have shown
that the EWC may have a maximum, a minimum, or a saddle point solution for the
more interesting case of b , 0. Searching such surfaces brings difculties for
gradient descent, but search methods that use the information of the curvature work
without difculty. We have presented a recursive algorithm to nd the optimum of
the EWC, which is called recursive error whitening (REW). The REW algorithm
has the same structure and complexity as the RLS algorithm. We also presented
a gradient-based algorithm to search the EWC function. The algorithm, which we
called EWC-LMS, has complexity Om and requires the estimation of the sign
of the update for the case b 0:5. We have estimated the sign using the
instantaneous estimate of the cost of the two independent functions (related to the
error and its derivative). This procedure does not affect the convergence of the
algorithm in the mean, but it may affect the misadjustment. This analysis is left for
further research.
All in all, we have introduced a new class of Wiener-type lter (the EWWF) that
is able to nd the optimal weights when the input data (generated by a Moving
Average process) are corrupted by additive white noise. We also developed a
practical sample-by-sample xed-point algorithm (REW) similar to RLS and one
gradient-based algorithm (EWC-LMS) similar to LMS. This new class of Wiener
lters represents a major advantage in many real-world applications of importance
in signal processing, controls, and bioengineering. We studied the simplest of this
class of cost functions, where only one extra term (the rst derivative) in the error
vector is included. It will be important to characterize the advantages of using
higher-order Taylor series in the error vector in other applications such as the
correlated additive noise case, nonstationary data, and modeling of Auto Regressive
Moving Average (ARMA) systems. Further research on the gradient-based algorithms is also warranted. This chapter presents sufcient detail at the theoretical and
algorithmic levels to enable immediate applications to real data.
484
APPENDIX A
This appendix aims to achieve an understanding of the relationship between entropy
and sample differences. In general, the parametric family describing the error
probability density function (pdf) in supervised learning is not analytically available.
In such circumstances, nonparametric approaches such as Parzen windowing [29]
could be employed. Given the independent and identically distribution (iid) samples
fe1; . . . ; eNg of a random variable e, the Parzen window estimate for the
underlying pdf fe : is obtained by
N
1X
k s x ei;
f^e x
N i1
A:1
where k s : is the kernel function, which itself is a pdf, and s is the kernel size that
controls the width of each window. Typically, Gaussian kernels are preferred, but
other kernel functions such as the Cauchy density or the members of the generalized
Gaussian family can be employed.
Shannons entropy for a random variable e with pdf fe : is dened as [37]
He
1
1
fe x log fe xdx Ee fe e:
A:2
Given iid samples, this entropy could be estimated using [12]

!
N
N
X
X
1
1
log
k s e j ei :
H^ e
N j1
N i1
A:3
This estimator uses the sample mean approximation for the expected value and the
Parzen window estimator for the pdf. Viola proposed a similar entropy estimator, in
which he suggested dividing the samples into two subsets: one for estimating the
pdf, the other for evaluating the sample mean [41]. In order to approximate a
stochastic entropy estimator, we approximate the expectation by evaluating the
argument at the most recent sample, ek . In order to estimate the pdf, we use the L
previous samples. The stochastic entropy estimator then becomes
!
L
X
1
H e log
k s ek ei :
L i1
A:4
For supervised training of an ADALINE (or an FIR lter) with weight vector
w [ Rn , given the input (vector)-desired training sequence xn; dn, where
xn [ Rm and dn [ R, the instantaneous error is given by en dn
wT nxn. The stochastic gradient of the error entropy with respect to the weights
APPENDIX B
485
becomes
L
X
@H X

k 0s en en ixn xn i
@w
i1
L
X
k s en en i;
i1
where en i dn i wT nxn i is also evaluated using the same weight

vector as en [14]. For the specic choice of a single error sample ek L for pdf
estimation and a Gaussian kernel function, (A.5) reduces to
@H X
en en Lxn xn L=s 2 :
@w
A:6
We easily note that the expression in (A.6) is also a stochastic gradient for the cost
function J Een en L2 =2s 2 .
APPENDIX B
Consider the correlation matrices R, S, P, and Q estimated from noisy data. For R,
we write
R ExnxT n E~xn vn~xn vnT
E~xn~xT n x~ nvT n vnxT n vnvT n
B:1
~ V:
E~xn~xT n vnvT n R
For S, we obtain
S ExnxT n xnxT n xnxT n L xn LxT n
2R E~xn vn~xn L vn LT ~xn L
vn L~xn vnT
~ V
2R
"
#
x~ n~xT n L x~ nvT n L vn~xT n L vnvT n L
E
x~ n L~xT n x~ n LvT n vn L~xT n vn LvT n
~ V E~xn~xT n L x~ n L~xT n
2R
EvnvT n L vn LvT n
~ V R
~ L VL :
2R
B:2
486
Similarly, for P and Q we get

P Exndn E~xn vnd~ n un
E~xnd~ n x~ nun vnd~ n vnun
B:3
E~xnd~ n P~
Q Exn xn Ldn dn L
Exndn xndn L xn Ldn xn Ldn L
2P Exndn L xn Ldn
2P~ E~xn vnd~ n L un L ~xn L
vn Ld~ n un
B:4
2
3
x~ nun L vnd~ n L
6
7
2P~ E~xnd~ n L x~ n Ld~ n E4 vnun L x~ n Lun 5
vn Ld~ n vn Lun
2P~ P~ L :
APPENDIX C
Recall that the optimal solution of EWC satises (10.9), which is equivalently
E1 2b enxn b enxn L enxn L 0:
C:1
Rearranging the terms in (C.1), we obtain

Eenxn b xn L 2xn xn L 0:
C:2
Note that the combination of x-values that multiply b form an estimate of the
acceleration of the input vector xn. Specically for b 1=2, the term that
multiplies en becomes a single-step prediction for the input vector xn (assuming
zero velocity and constant acceleration), according to Newtonian mechanics. Thus,
the optimal solution of the EWC criterion tries decorrelating the error signal from
the predicted next input vector.
Acknowledgments
This work is partially supported by NSF Grant ECS-9900394.
REFERENCES
487
REFERENCES
1. H. Akaike, A New Look at the Statistical Model Identication, IEEE Trans. Automatic
Control, vol. 19, pp. 716 723, 1974.
2. C. Beck and F. Schlogl, Thermodynamics of Chaotic Systems. Cambridge University
Press, Cambridge, 1993.
3. J. Beirlant and M. C. A. Zuijlen, The Empirical Distribution Function and Strong Laws
for Functions of Order Statistics of Uniform Spacings, Journal of Multivariate Analysis,
vol. 16, pp. 300 317, 1985.
4. A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic
Approximations. Springer-Verlag, Berlin, 1990.
5. P. J. Bickel and L. Breiman, Sums of Functions of Nearest Neighbor Distances, Moment
Bounds, Limit Theorems and a Goodness-of-Fit Test, Annals of Statistics, vol. 11,
pp. 185 214, 1983.
6. C. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
7. J. A. Cadzow, Total Least Squares, Matrix Enhancement, and Signal Processing,
Digital Signal Processing, vol. 4, pp. 21 39, 1994.
8. M. Chansarkar and U. B. Desai, A Robust Recursive Least Squares Algorithm, IEEE
Trans. Signal Processing, vol. 45, pp. 1726 1735, 1997.
9. B. de Moor, Total Least Squares for Afnely Structured Matrices and the Noisy
Realization Problem, IEEE Trans. Signal Processing, vol. 42 , pp. 3104 3113, 1994.
10. S. C. Douglas and W. Pan, Exact Expectation Analysis of the LMS Adaptive Filter,
IEEE. Trans. Signal Processing, vol. 43, pp. 2863 2871, 1995.
11. S. C. Douglas, Analysis of an Anti-Hebbian Adaptive FIR Filtering Algorithm, IEEE
Trans. Circuits and SystemsII: Analog and Digital Signal Processing, vol. 43, pp. 777
780, 1996.
12. D. Erdogmus, Information Theoretic Learning: Renyis Entropy and its Applications to
Adaptive System Training, Ph.D. dissertation, University of Florida, Gainesville, FL,
2002.
13. D. Erdogmus and J. C. Principe, An On-Line Adaptation Algorithm for Adaptive System
Training with Minimum Error Entropy: Stochastic Information Gradient, Proceedings
of ICA01, pp. 7 12, San Diego, CA, 2001.
14. D. Erdogmus and J. C. Principe, Generalized Information Potential Criterion for
Adaptive System Training, to appear in IEEE Trans. Neural Networks, vol. 13, no. 5, pp.
1035 1044, Sept. 2002.
15. D. Erdogmus, J. C. Principe, and K. E. Hild II, Do Hebbian Synapses Estimate
Entropy?, accepted by NNSP02, pp. 199 208, Martigny, Switzerland, Sept. 2002.
16. B. Farhang-Boroujeny, Adaptive Filters: Theory and Applications, Wiley, New York,
1998.
17. D. Z. Feng, Z. Bao, and L. C. Jiao, Total Least Mean Squares Algorithm, IEEE Trans.
Signal Processing, vol. 46, pp. 2122 2130, 1998.
18. K. Gao, M. O. Ahmad, and M. N. S. Swamy, A Constrained Anti-Hebbian Learning
Algorithm for Total Least Squares Estimation with Applications to Adaptive FIR and IIR
Filtering, IEEE Trans. Circuits and Systems Part 2, vol. 41, pp. 718 729, 1994.
488
19. G. H. Golub and C. F. van Loan, An Analysis of the Total Least Squares Problem,
SIAM J. Numerical Analysis, vol. 17, pp. 883893, 1979.
20. G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins University Press,
Baltimore, 1989.
21. P. Hall, Limit Theorems for Sums of General Functions of m-Spacings, Mathematical
Proceedings of the Cambridge Philosophical Society, vol. 96 pp. 517 532, 1984.
22. S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, New York,
1994.
23. S. Haykin, Adaptive Filter Theory, Prentice-Hall, Upper Saddle River, NJ, 1996.
24. L. F. Kozachenko and N. N. Leonenko, Sample Estimate of Entropy of a Random
Vector, Problems of Information Transmission, vol. 23, pp. 95 101, 1987.
25. H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and
Unconstrained Systems, Springer-Verlag, New York, 1978.
26. P. Lemmerling, Structured Total Least Squares: Analysis, Algorithms, and Applications,
Ph.D. dissertation, Katholeike University, Leuven, Belgium, 1999.
27. L. Ljung, Analysis of Recursive Stochastic Algorithms, IEEE Trans. Automatic
Control, vol. AC-22, pp. 551 575, 1977.
28. M. Mueller, Least-Squares Algorithms for Adaptive Equalizers, Bell Systems Technical
Journal, vol. 60, pp. 1905 1925, 1981.
29. E. Parzen, On Estimation of a Probability Density Function and Mode, in Time Series
Analysis Papers, Holden-Day, San Diego, CA, 1967.
30. J. C. Principe, N. Euliano, and C. Lefebvre, Neural and Adaptive Systems: Fundamentals
Through Simulations, Wiley, New York, 1999.
31. Y. N. Rao, Algorithms for Eigendecomposition and Time Series Segmentation, M.S.
thesis, University of Florida, Gainesville, FL, 2000.
32. Y. N. Rao, D. Erdogmus, and J. C. Principe, Error Whitening Criterion for Adaptive
Filtering, in review IEEE Trans. Signal Processing, Oct. 2002.
33. Y. N. Rao and J. C. Principe, Efcient Total Least Squares Method for System Modeling
Using Minor Component Analysis, Proc. IEEE Workshop on Neural Networks for
Signal Processing XII, pp. 259 258, Sep. 2002.
34. P. A. Regalia, Adaptive IIR Filtering in Signal Processing and Control. Marcel Dekker,
New York, 1995.
35. M. Reuter, K. Quirk, J. Zeidler, and L. Milstein, Non-Linear Effects in LMS Adaptive
Filters, Proceedings of IEEE 2000 AS-SPCC, pp. 141 146, October, 2000.
36. J. Rissanen, Stochastic Complexity in Statistical Inquiry, World Scientic, London, 1989.
37. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University
of Illinois Press, Urbana, 1964.
38. H. C. So, Modied LMS Algorithm for Unbiased Impulse Response Estimation in
Nonstationary Noise, IEE Electronics Letters, vol. 35, pp. 791 792, 1999.
39. F. P. Tarasenko, On the Evaluation of an Unknown Probability Density Function, the
Direct Estimation of the Entropy from Independent Observations of a Continuous
Random Variable, and the Distribution-Free Entropy Test of Goodness-of-Fit,
Proceedings of IEEE, vol. 56, pp. 2052 2053, 1968.
40. A. B. Tsybakov and E. C. van der Meulen, Root-n Consistent Estimators of Entropy for
Densities with Unbounded Support, Scandinavian Journal of Statistics, vol. 23, pp. 75
83, 1994.
REFERENCES
489
41. P. Viola, N. Schraudolph, and T. Sejnowski, Empirical Entropy Manipulation for RealWorld Problems, Proceedings of NIPS95, pp. 851 857, 1995.
42. A. Yeredor, The Extended Least Squares Criterion: Minimization Algorithms and
Applications, IEEE Trans. Signal Processing, vol. 49, pp. 74 86, 2000.
43. T. Soderstrom, P. Stoica, System Identication, Prentice-Hall, London, UK, 1989.
44. Y. N. Rao, D. Erdogmus, G. Y. Rao, J. C. Principe, Fast Error Whitening Algorithms for
System Identication and Control, submitted to IEEE Workshop on Neural Networks
for Signal Processing, April 2003.
INDEX
Acoustic echo cancellation FIR lter,
151
Acoustic echo control, 209
Active tap detection: Heuristics, 162
Adaptive equalization, 309, 367
Adaptive linear combiners, 1
Adaptive linear predictor, 364
Adaptive plant identication, 3
Adaptive process, 49, 61
Adaptive process (small step-size), 38
Afne projection algorithms (APA),
241, 242
APA as a contraction mapping, 252
block exact APA, 269
block fast afne projection: summary,
272
Almost-sure convergence, 95
Alternate single-channel time-varying
equivalent to the two-channel LTI
Wiener lter, 375
Analysis of autocorrelation of the error
signal, 447
Asymptotic behavior of learning rate
matrix, 326
Asymmetry of the probability
distribution, 100
Basic elements of a wave theory for

adaptive ltering, 41
Bayesian priors and sparse
regularization, 298
Block processing, 181
Colored input signal, 166
Combined acoustic echo control

methods, 235
Complexity, 318
Computational complexity, 114
Conditions for nonlinear effects in
adaptive noise canceller
applications, 426
Connection between APA and RLS, 255
Control in different processing
structures, 179
Control of LMS-type adaptive lters,
175
Convergence and tracking, 319
Convergence in the absence of
measurement noise, 191
Convergence in the presence of
measurement noise, 187
Convergence of the MSE, 27
Convergence without regularization,
194
Convergence without step-size control
(only with regularization), 197
Corrections for nite line length, 55
Decision-feedback equalization, 309
Derivation of psuedo-optimal control
parameters, 199
Derivation of pseudo-optimal
regularization, 203
Derivation of pseudo-optimal step-size,
199
Detecting active taps, 162
Detection of remote single-talk, 225
Detection of system changes, 228
Dimension analysis for
least-mean-square algorithms, 145

491
492
INDEX
Echo cancellation, 315

Effect of b on the weight error vector,
458
Effect of b and weight tracks of REW
algorithm, 468
Energy conservation and the learning
ability of LMS adaptive lters, 79
Ensemble average excess MSE, 15
Ensemble of nonstationary adaptive
processes, 14
Equivalent optimal time-varying Wiener
solutions, 373
Error measures and energy relation, 83
Error-whitening criterion, 451
Error-whitening Wiener lters: theory
and algorithms, 445
Estimation and detection schemes, 217
Estimation of background noise level,
220
Estimation of system parameters in
white noise, 478
Estimation of system distance, 222
Estimation of system parameters in
white noise using REW, 468
Exact Newtons method, 18
Exponential windowing, 135
Extended proportionate adaptation
paradigms, 302
Fast afne projections (FAP), 256

FAP approximation, 266
Fast adaptive coefcient vector
calculation, 258
Fast calculation of a priori Kalman gain,
328
Fast normalized residual echo vector
calculation, 262
Fast residual echo vector calculation,
257
Fast time varying (TV) optimal
equivalent scenarios, 405
Filtered errors and the normalized LMS
algorithm, 140
Finite and innite horizon causal

Wiener lters, 356
Finite-horizon Wiener lter design and
performance, 358
Fullband processing, 180
Gradient estimation: the LMS
algorithm, 20
How dimension affects LMS
convergence, 153
How to train error-whitening Wiener
lters, 449
H 1 approach, 117
H 1 norm and ltering, 321
H 1 norm bounds for the RLS
algorithm, 141
H 1 -optimality of the LMS algorithm,
126
H 1 -optimality of the PNLMS
algorithm, 323
Langevin equation, xii, 299
Leaky LMS algorithm, 137
Learning curves, xii, 9, 81
Lease-mean-squares estimation, 112
Least-mean-square (LMS), xi, 4, 80,
107, 124
LMS lters with all-pass sections, 58
LMS/Newton, 4
LMS and LMS/Newton in a
nonstationary environment, 13
LMS with active tap detection, 165
LMS with a time-varying learning rate,
140
Long LMS lter, 65
Loss function, 320
LTI transfer function model for NLMS,
353
MAP estimators and regularized least
squares, 111
INDEX
Mean-square convergence, 93
Misadjustment, 8
Misadjustment due to gradient noise,
17
Misadjustment due to lag, 17
Mixed H 2 =H 1 problems, 141
Maximum likelihood estimators, 110
Model-order selection, 457
MSE learning curve time constants, 10
MSE optimality of the RPNLMS
algorithm, 327
Motivation for error-whitening Wiener
lters, 447
Multichannel adaptive or Wiener
ltering scenario, 357
Narrowband AR(1) ALP, 414

Narrowband AR(1)
interference-contaminated
adaptive equalizer, 418
Newton mechanics, 300
NLMS interpretations and performance
estimates, 351
Nonlinear behavior in adaptive
equalizer (AEQ) scenarios, 345
Nonlinear behavior in adaptive linear
prediction (ALP) scenarios, 347
Nonlinear behavior in ANC scenarios,
343
Nonlinear effects in narrowband AEQ,
437
Nonlinear effects in narrowband ALP,
437
Nonlinear effects in various application
scenarios, 339
Nonstationary signals and tracking, 134
Nonuniqueness of the H 1 -optimal
predictor, 128
Nonuniqueness of time-varying Wiener
lter equivalents, 376
Normalized least-mean-square
algorithm, 68
Numerical case studies using the
theoretical solution, 458
493
Numerical stability, 114

On-line implementations of EWC-LMS
for b , 0, 477
On the efciency of adaptive
algorithms, 1
On the robustness of LMS lters, 105
Optimality of LMS/Newton in a
stationary environment, 12
Partial update APA, 273
PANLMS algorithm, 301
PNLMS algorithm, 295
Projection and minimum-norm
least-squares interpretations of
NLMS, 351
Projections onto an afne subspace, 245
Proof of error-whitening criterion-LMS
convergence for b , 0, 474
Proof of error-whitening criterion-LMS
convergence for b . 0, 472
Properties of long lters, 36
Properties of the error-whitening
criterion, 452
Proportionate adaptation and Kalman
ltering, 302
Proportionate adaptation: new
paradigms in adaptive lters, 293
Proportionate afne projection
adaptation, 307
Question of robustness, 115
Rectangular windowed fast Kalman
algorithm: summary, 286
Recursive error whitening (REW)
algorithm, 464
Recursive proportionate adaptation, 304
Reference-only equivalents for
narrowband AR(1) ANC, 397
Regulation, 215, 248
Risk-sensitive optimality, 133
494
INDEX
RLS algorithm, 108

Robustness, 318
Robustness and adaptation, 116
RPNLMS and RLS lters, 306
Short-term power estimation, 219
Signal environment assumptions, 153,
162
Signed-regressor proportionate
adaptation, 308
Sliding windowed stabilized fast
transversal lter: summary, 287
Some stochastic interpretations, 110
Some properties of error-whitening
Wiener lter adaptation, 456
Speech signals, 213
Stability, 40, 52
Stability analysis of LMS-type adaptive
lters, 186
Steady ow learning with the LMS
algorithm, 8
Steady ow learning with the LMS/
Newton algorithm, 11
Steady-state dynamic weight behavior
in (N)LMS adaptive lters, 335
Steady-state weight uctuations (small
step-size), 39
Step-size selection, 160
Stochastic gradient algorithms, 471
Stochastic interpretation, 131
Structure of the error whitening Wiener
lters, 448
Subband processing, 183
Summary of the REW algorithm, 467
Time-varying non-Wiener behavior in

adaptive linear prediction and
adaptive equalization, 410
WSS exponential adaptive noise
canceller, 380
WSS AR(1) process ANC, 393
Transfer function approach of LMS
adaptive ltering, 70
Transfer function model for
two-channel NLMS, 369
Transient analysis, 85
Transient learning: excess error energy,
17
Traveling-wave model of long LMS
lters, 35
Two-channel narrowband AR(1) ANC
scenario, 395
Variable-dimension lters, 161

Variance analysis, 97
Weight-error correlations, 46
Weight-error correlations with delay, 74
Weight tracks and convergence, 480
Wideband adaptive noise canceller
(ANC) scenario, 340
Wiener solution, 4
Zeroth-order solution for small

step-sizes, 44


Least Mean Square Adaptive Filters

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Least Mean Square Adaptive Filters

Uploaded by

Copyright:

Available Formats

LEAST-MEAN-SQUARE

S. Haykin and B. Widrow

A JOHN WILEY & SONS, INC. PUBLICATION

This book is printed on acid-free paper.

Introduction: The LMS Filter (Algorithm)

On the Efciency of Adaptive Algorithms

Traveling-Wave Model of Long LMS Filters

Energy Conservation and the Learning Ability of LMS

On the Robustness of LMS Filters

Dimension Analysis for Least-Mean-Square Algorithms

Control of LMS-Type Adaptive Filters

Afne Projection Algorithms

Proportionate Adaptation: New Paradigms in Adaptive Filters

Steady-State Dynamic Weight Behavior in (N)LMS Adaptive

Error Whitening Wiener Filters: Theory and Algorithms

A. A. (LOUIS ) BEEX , Systems GroupDSP Research Laboratory, The Bradley

V. H. NASCIMENTO , Department of Electronic Systems Engineering, University of

INTRODUCTION: THE LMS FILTER

INTRODUCTION: THE LMS FILTER (ALGORITHM)

INTRODUCTION: THE LMS FILTER (ALGORITHM)

BERNARD WIDROW and MAX KAMENETSKY

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

Pz w1;o wK;o zK1 ) through the minimization of the output error en in

and the set of weights of the adaptive transversal lter is designated by

Adaptive plant identication.

The nth output sample is

The square of this error is

The mean square error (MSE), j , dened as the expected value of e2 n, is

Ed2 n 2EdnuT nw wT EunuT nw

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

1.2 LEARNING WITH A FINITE NUMBER OF DATA SAMPLES

LEARNING WITH A FINITE NUMBER OF DATA SAMPLES

where U is an N K rectangular matrix

e is an N element error vector

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

and d is an N element vector of desired responses

This sum multiplied by 1/N is an estimate j^ of the MSE j . Thus,

Figure 1.3 Small- and large-sample MSE curves.

1.2 LEARNING WITH A FINITE NUMBER OF DATA SAMPLES

The minimum of a small-sample-size function can be found by differentiating

where the error in the weight vector is

Expectation is taken over the above-described ensemble. Equation (1.19) can be

The covariance matrix of the weight error vector e , Ee e T , can be computed as

e U T U1 U T d wo UT U1 U T d Uwo :

Then, for small e ,2

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

The misadjustment is a dimensionless measure of how far on average the optimal

STEADY FLOW LEARNING WITH THE LMS ALGORITHM

Gradient methods are commonly used to adjust adaptive parameters in order to

The parameter m controls stability and rate of convergence. An instantaneous

1.3 STEADY FLOW LEARNING WITH THE LMS ALGORITHM

Therefore, there will be K different modes of convergence corresponding to K

we can nd the corresponding K time constants t i of the weight relaxation process as

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

ETre ne T nunuT n:

j excess m j min KEu2 n m TrRj min :

1.4 STEADY FLOW LEARNING WITH THE LMS/NEWTON ALGORITHM

1.4 STEADY FLOW LEARNING WITH THE LMS/NEWTON

ON THE EFFICIENCY OF ADAPTIVE ALGORITHMS

This is identical to Eq. (1.38).

1.5 OPTIMALITY OF LMS/NEWTON IN A STATIONARY

4t MSE number of independent training samples

1.6 LMS AND LMS/NEWTON IN A NONSTATIONARY ENVIRONMENT

LMS AND LMS/NEWTON IN A NONSTATIONARY ENVIRONMENT