Professional Documents
Culture Documents
AbstractThis paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for
deriving its architectures for high-speed and low-complexity
implementation. It is shown that the direct-form LMS adaptive
filter has nearly the same critical path as its transpose-form
counterpart, but provides much faster convergence and lower
register complexity. From the critical-path evaluation, it is further
shown that no pipelining is required for implementing a directform LMS adaptive filter for most practical cases, and can be
realized with a very small adaptation delay in cases where a
very high sampling rate is required. Based on these findings, this
paper proposes three structures of the LMS adaptive filter: (i)
Design 1 having no adaptation delays, (ii) Design 2 with only one
adaptation delay, and (iii) Design 3 with two adaptation delays.
Design 1 involves the minimum area and the minimum energy
per sample (EPS). The best of existing direct-form structures
requires 80.4% more area and 41.9% more EPS compared to
Design 1. Designs 2 and 3 involve slightly more EPS than the
Design 1 but offer nearly twice and thrice the MUF at a cost of
55.0% and 60.6% more area, respectively.
Index TermsAdaptive filters, least mean square algorithms,
LMS adaptive filter, critical-path optimization.
I. I NTRODUCTION
Adaptive digital filters find wide application in several
digital signal processing (DSP) areas, e.g., noise and echo
cancelation, system identification, channel estimation, channel equalization, etc. The tapped-delay-line finite-impulseresponse (FIR) filter whose weights are updated by the famous
Widrow-Hoff least-mean-square (LMS) algorithm [1] may be
considered as the simplest known adaptive filter. The LMS
adaptive filter is popular not only due to its low-complexity,
but also due to its stability and satisfactory convergence
performance [2]. Due to its several important applications
of current relevance and increasing constraints on area, time,
and power complexity, efficient implementation of the LMS
adaptive filter is still important.
To implement the LMS algorithm, one has to update the
filter weights during each sampling period using the estimated
error, which equals the difference between the current filter
output and the desired response. The weights of the LMS
adaptive filter during the nth iteration are updated according
to the following equations.
wn+1 = wn + en xn
(1a)
where
Manuscript submitted on xx, xx, xxxx. (Corresponding author: S. Y. Park.)
The authors are with the Institute for Infocomm Research, Singapore,
138632 (e-mail: pkmeher@i2r.a-star.edu.sg; sypark@i2r.a-star.edu.sg).
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
Input Sample, xn
Filter Output, yn
_ Desired Signal,dn
wn
D
New Weights,wn+1
Error, en
Weight-Update Block
Fig. 1.
Structure of conventional
LMS adaptive filter.
clms.pdf
en = dn yn
yn =
(1b)
wnTDesired
xn Signal,dn
(1c)
Input vector
Sample,xx
n
input
with
weight vectorBlock
wn at the nth iteration
n and
Error-Computation
given by, respectively,
n
xn = [xn , xn1 , w,
xnN +1 ]T
Error, en
w
D , wn (N 1)]TmD
mDn = [wn (0), wn (1),
and where dn is the
desiredwn+1
response, yn is the filter output
New Weights,
of the nth iteration,
e
denotes
the error computed
during the
n
en-m
xn-m
Weight-Update
Blockthe weights, is the
nth iteration, which is used
to update
convergence factor or step-size, which is usually assumed to
be a positive number, and N is the number of weights used in
dlms.pdf
the LMS adaptive filter. The structure of a conventional LMS
adaptive filter is shown in Fig. 1.
Since all weights are updated concurrently in every cycle to
compute the output according to (1), direct-form realization of
the FIR filter is a natural candidate for implementation. However, the direct-form LMS adaptive filter is often believed to
have a long critical path due to an inner product computation to
obtain the filter output. This is mainly based on the assumption
that an arithmetic operation starts only after the complete input
operand words are available/generated. For example, in the
existing literature on implementation of LMS adaptive filters,
it is assumed that the addition in a multiply-add operation
(shown in Fig. 2) can proceed only after completion of the
multiplication, and with this assumption, the critical path of the
multiply-add operation (TMA ) becomes TMULT + TADD , where
TMULT and TADD are the time required for a multiplication
and an addition, respectively. Under such assumption, the
critical path of the direct-form LMS adaptive filer (without
pipelining) can be estimated as T = 2TMULT + (N + 1)TADD .
Since this critical-path estimate is quite high, it could exceed
the sample period required in many practical situations, and
calls for a reduction of critical-path delay by pipelined implementation. But, the conventional LMS algorithm does not
C
2L
2L
2L+1
0.13-m
L=8
L = 16
90-nm
L=8
L = 16
TFAC
TFAS
TADD
TMULT
TMA
0.21 0.23
0.14 0.18
1.08
2.03
2.43
4.51
2.79
4.90
0.10
0.1 0.16
0.74
1.42
1.54
3.00
1.83
3.29
clms.pdf
MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM
Desired Signal,dn
Input Sample,xn
mD
en-m
Weight-Update Block
Error, en
wn
mD
direct-form LMS
direct-form DLMS (m=5)
direct-form DLMS (m=10)
-10
Error-Computation Block
-20
-30
N=32
-40
-50
-60
Fig. 3.
dlms.pdf
xn
xn
wn(0)
wn(0)
wn(1)
wn(1)
wn(2)
wn(2)
wn(3)
wn(3)
wn(N-2)
-70
0
w (N-1)w(N-1)
n
wn(N-2)
Fig. 6.
Fig. 4.
300
400
Iteration Number
500
600
700
Direct-form
adaptive filters with different values of adaptadlms_ec.pdf
tion delay are simulated for a system identification problem,
where the system is defined by a bandpass filter with impulse
response given by
dn
hn =
200
dlms_ec.pdf
en
100
(n 7.5)
(n 7.5)
(3)
en-m
en-m
respectively.
dlms_wu.pdf Fig. 6 shows the learning curves for identification
xen-m
(2a)
(2b)
where
and
yn =
wnT xn .
(2c)
N
1
X
(4a)
k=0
(4b)
wn(N-1)
wn(2)
wn(1)
wn(0)
dn
yn
en
Error-Computation Block
xn
wn(N-1)
wn(2)
wn(1)
wn(0)
tf_dlms.pdf
yn
dn
en
D
mD
en-m
en-m
xn-m
mD
Weight-Update Block
-20
direct-form
LMS
direct-form DLMS
D
D
transpose-form
LMS
-40
wn(0)
wn(1)
wn(2)
N=16
wn(N-2)
wn(N-1)
-60
-80
0
50
100
150
200
250
Iteration Number
300
350
400
(a)
direct-form LMS
direct-form DLMS
transpose-form LMS
-20
(5)
but, the equation to compute the filter output remains the same
as that of (4a). The structure of the transpose-form DLMS
adaptive filter is shown in Fig. 7.
It is noted that in (4a), the weight values used to compute
the filter output yn at the nth cycle are updated at different
cycles, such that the (k+1)th weight value wnk (k) is updated
k cycles back, where k = 0, 1, N 1. The transpose-form
LMS is, therefore, inherently a delayed LMS and consequently
provides slower convergence performance. To compare the
convergence performance of LMS adaptive filters of different
configurations, we have simulated the direct-form LMS, directform DLMS, and transpose-form LMS for the same system
identification problem, where the system is defined by (3)
using the same simulation configuration. The learning curves
thus obtained for filter length N = 16, 32, and 48 are shown
in Fig. 8. We find that the direct from LMS adaptive filter
provides much faster convergence than the transpose LMS
adaptive filter in all cases. The direct-form DLMS adaptive
filter with delay 5 also provides faster convergence compared
to the transpose-form LMS adaptive filter without any delay.
However, the residual mean-square error is found to be nearly
the same in all cases.
From Fig. 7, it can be further observed that the transposeform LMS involves significantly higher register complexity
over the direct-form implementation, since it requires an
additional signal-path delay line for weight updating, and the
registers on the adder-line to compute the filter output are at
least twice the size of the delay line of the direct-form LMS
adaptive filter.
N=32
-40
-60
-80
0
100
200
300
400
Iteration Number
500
600
700
(b)
0
Mean Squared Error (dB)
direct-form LMS
direct-form DLMS
transpose-form LMS
-20
(6)
N=48
-40
T = max{CERROR , CUPDATE }.
(7)
Using (6) and (7), we discuss in the following the critical paths
of direct-form and transpose-form LMS adaptive filters.
-60
-80
0
T = CERROR + CUPDATE
200
400
600
800
Iteration Number
1000
1200
(c)
MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM
x(0)
w(0)
MULT-1
x(1)
w(1)
MULT-2
a0
a1
b0
a2
b1
a2W
b2W
a2W-1
b2W-1
b2
FA
FA co
s
c2W-1
d2W-1
c0
d0
c2
d2
c1
d1
FA
co
c2W
d2W
FA co
FA co
HA co
co
FA co
HA co
e2W
f2W
e2W-1
f2W-1
e2
f2
e1
f1
f0
e2W+1
f2W+1
g2W+1
g2W
g2W-1
FA co
FA co
HA co
FA co
s
FA co
e0
g2
g1
g0
(a)
cp_analysis.pdf
CO
A
x(0)
x(0) w(0)
x(1)
MULT-1
x(2)
w(2)
x(2)
w(2) E x(3) w(3)
ADD-1
x(1) w(1)
MULT-2
MULT-3
B
w(3)
ADD-1
MULT-4
ADDMULT-4
MULT-3
C
3
D
ADD-2
x(3)
CO
A
B
HACO
FA
A1
A2
A3
A
B
FA
CO
S
CI
F
(b)
CI
ADD-2
CO = A B
S = AB
MULT-2
w(1)
MULT-1
w(0)
HA
CO = A B
= AB)
B+ (CI (A B))
CO =S (A
S = A B CI
CO = (A B) + (CI (A B))
Z = A1 A2 A3
S = A B CI
(c)
TABLE II
S YNTHESIS R ESULTgate.pdf
OF I NNER P RODUCT C OMPUTATION T IME (ns) U SING
TSMC 0.13-m AND 90-nm CMOS T ECHNOLOGY L IBRARIES
Inner Product Length
2
4
8
Critical Path: TMUL+ (log2N)(TFAD+2TXOR
16)
= TFAC + TFAS .
(8)
32
T
:
delay
from
port
MUL adder (ADD-3) (andA0 to port P2W-1 in
addition of the second-level
0.13-m
L=8
L = 16
2.79
3.19
3.60
4.00
4.40
4.90
5.30
5.71
6.11
6.51
90-nm
L=8
L = 16
1.83
2.10
2.37
2.64
2.90
3.29
3.56
3.82
4.09
4.36
multiplier
Similarly, the
hence the inner-product computation
of length
is completed
TFAD
: delay4)from
port A to port CO in full adder
in time TMULT + 2. In general, an inner product of length N for carry-and-sum generation in a 1-bit full-adder, obtained
TXOR : delay from port A3 to port Z in 2-input XOR
(shown in Fig. 4) involves a delay of
from Table I, we find that the results shown in Table II are in
conformity with those given by (9). The critical path of the
TIP = TMULT + dlog2 N e.
(9) error-computation block therefore amounts to
In order to validate (9), we show in Table II the time required
CERROR = TMULT + (dlog2 N e + 1).
(10)
for the computation of inner products of different length
For computation of the weight-update unit shown in Fig. 5,
for word-length 8 and 16 using TSMC 0.13-m and 90-nm
process libraries. Using multiplication time and time required if we assume the step-size to be a power of 2 fraction,
(11)
Using (6), (10), and (11), we can find the critical path of the
non-pipelined direct-form LMS adaptive filter to be
T = 2TMULT + (dlog2 N e + 2).
(12)
(13)
(14)
From (13) and (17), we can find that the critical path of
the transpose-form DLMS adaptive filter is nearly the same
as that of direct-form implementation where weight updating
and error computation are performed in two separate pipeline
stages.
(16)
MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM
Input Samples, xn
en
MUX
DMUX
DMUX
enxn-1
xnwn(0)
wn(N-1)
xn-N+1
DMUX
enxn
MUX
wn(2)
D
en
DMUX
select
xn-2
MUX
wn(1)
en
0
MUX
xn-2
en
0
wn(0)
xn-1
+
enxn-N+1
enxn-2
xn-1wn(1)
xn-2wn(2)
xn-N+1wn(N-1)
en
en
_
+
Desired Response, dn
Fig. 10.
Fig. 10, which consists of N multipliers. The input samples computation of en , and during the second half cycle, the
are fed to the multipliers from a common tapped delay en value is fed to the multipliers though the multiplexors to
line. The N weight values (stored in N registers) and the calculate en xn and de-multiplexed out to be added to the
estimated error value (after right-shifting by a fixed number stored weight values to produce the new weights according to
of locations to realize multiplication by the step size ) are fed (2a). The computation during the second half of a clock period
to the multipliers as the other input through a 2:1 multiplexer. is completed once a new set of weight values is computed. The
ADDER weight
ADDER values are used in the first half-cycle of the
Apart from this, the proposed structure requires NADDER
adders ADDER
for updated
modification of N weights, and an adder tree to add the output next clock cycle for computation of the filter output and for
of N multipliers for computation of the filter output. Also, it subsequent error estimation. When the next cycle begins, the
ADDER
ADDER
requires a subtractor to compute the error value and N 2:1 weight registers are also updated by the new weight values.
de-multiplexors to move the product values either towards the Therefore, the weight registers are also clocked at the rising
adder tree or weight-update circuit. All the multiplexors and ADDER
edge of each clock pulse.
The time required for error computation is more than that of
de-multiplexors are controlled by a clock signal.
weight
updating. The system clock period could be less if we
The registers in the delay line are clocked at the rising
just
perform
these operations one after the other in every cycle.
edge of the clock pulse and remain unchanged for a complete
This
is
possible
since all the register contents also change
clock period since the structure is required to take one new
once
at
the
beginning
of a clock cycle, but we cannot exactly
sample in every clock cycle. During the first half of each
determine
when
the
error
computation is over and when weight
clock period, the weight values stored in different registers are
updating
is
completed.
Therefore,
we need to perform the error
fed to the multiplier through the multiplexors to compute the
computation
during
the
first
half-cycle
and the weight updating
filter output. The product words are then fed to the adder tree
during
the
second
half-cycle.
Accordingly,
the clock period of
though the de-multiplexors. The filter output is computed by
the
proposed
structure
is
twice
the
critical-path
delay for the
the adder tree and the error value is computed by a subtractor.
error-computation
block
C
,
which
we
can
find
using (14)
ERROR
Then the computed error value is right-shifted to obtain en
as
and is broadcasted to all N multipliers in the weight-update
circuits. Note that the LMS adaptive filter requires at least
one delay at a suitable location to break the recursive loop.
A delay could be inserted either after the adder tree, after the
en computation, or after the en computation. If the delay is
placed just after the adder tree, then the critical path shifts
to the weight-updating circuit and gets increased by TADD .
Therefore, we should place the delay after computation of en
or en , but preferably after en computation to reduce the
register width.
The first half-cycle of each clock period ends with the
(18)
n-2
Input Samples
wn(1)
n-5
xn-1
xn-3
wn(3)
n-N
xn-(N-2)
wn(N-2)
n-(N+1)
xn-2
wn(2)
n-4
xn
wn(0)
n-3
en-2
wn(N-1)
xn-(N-1)
yn-1
Desired Response, dn-1
en-1
en-2
Fig. 11.
(19)
(20)
MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM
10
TABLE III
C OMPARISON OF H ARDWARE AND T IME C OMPLEXITIES OF D IFFERENT A RCHITECTURES
Design
Critical-Path Delay
Adaptation
Delay (m)
# adders
2TMULT + (dlog2 N e + 2)
2TMULT + [dlog2 (N + 1)e + 1]
TMULT +
TMULT
TMULT
TMULT +
TADD
2[TMULT + 2TMUX + (dlog2 N e + 1)]
TMULT + (dlog2 N e + 1)
max{TMULT + , TAB }
log2 N + 1
log2 N + 3
N/4 + 3
5
0
1
2
2N
2N
2N
2N
2N + 1
2N
10N + 2
2N
2N
2N
Hardware Elements
# multipliers
2N
2N
2N
2N
2N
2N
0
N
2N
2N
# delays
2N 1
3N 2
3N + 2 log2 N + 1
6N + log2 N + 2
7N + 2
5N + 3
2N + 14 + E
2N
2N + 1
2.5N + 2
= 24, 40 and 48 for N = 8, 16 and 32, respectively. Proposed Design 1 needs additional N MUX and N DMUX. It is assumed in all these
structures that multiplication with the step-size does not need a multiplier.
TABLE IV
P ERFORMANCE C OMPARISON OF DLMS A DAPTIVE F ILTER C HARACTERISTICS BASED ON S YNTHESIS U SING TSMC 90-nm L IBRARY
Design
Proposed Design 1
(no adaptation delay)
Proposed Design 2
(one adaptation delay)
Proposed Design 3
(two adaptation delays)
Filter
Length, N
DAT
(ns)
MUF
(MHz)
Adaptation
Delay
Area
(sq.um)
ADP
(sq.umns)
PCMUF
(mW)
Normalized
Power (mW)
EPS
(mWns)
8
16
32
8
16
32
8
16
32
8
16
32
8
16
32
8
16
32
8
16
32
3.14
3.14
3.14
3.06
3.06
3.06
3.27
3.27
3.27
2.71
2.81
2.91
7.54
8.06
8.60
3.75
4.01
4.27
3.31
3.31
3.31
318
318
318
326
326
326
305
305
305
369
355
343
132
124
116
266
249
234
302
302
302
6
7
8
5
7
11
5
5
5
0
0
0
1
1
1
2
2
2
48595
97525
196017
50859
102480
206098
50717
102149
205270
50357
95368
185158
26335
53088
108618
41131
82639
166450
42729
85664
171979
152588
306228
615493
155628
313588
630659
165844
334027
671232
136467
267984
538809
198565
427889
934114
154241
331382
710741
141432
283547
569250
7.16
14.21
28.28
8.53
17.13
34.35
6.49
12.39
22.23
8.51
14.27
26.44
1.99
3.73
7.00
4.00
7.43
13.96
5.02
9.92
19.93
1.21
2.42
4.82
1.40
2.82
5.65
1.16
2.23
4.04
1.25
2.19
4.22
0.79
1.58
3.16
0.83
1.65
3.31
0.91
1.81
3.65
22.25
44.16
87.89
25.88
51.94
104.13
20.97
40.00
71.66
22.88
39.70
76.05
14.48
28.84
58.37
14.74
29.23
58.42
16.42
32.39
65.10
DAT: data arrival time, MUF: maximum usable frequency, ADP: area-delay product, PCMUF: power consumption at maximum usable frequency,
normalized power: power consumption at 50 MHz, and EPS: energy per sample.
80.4% more area and 41.9% more EPS than proposed Design
1. Proposed Design 3 involves relatively fewer adaptation
delays and provides similar MUF as the structures of [10] and
[5]. It involves slightly less ADP but provides around 16% to
26% of savings in EPS over the others. Proposed Design 2 and
Design 3 involve nearly the same (slightly more) EPS than the
proposed Design 1 but offer nearly twice or thrice the MUF at
the cost of 55.0% and 60.6% more area, respectively. However,
proposed Design 1 could be the preferred choice instead of
proposed Design 2 and Design 3 in most communication
applications, since it provides adequate speed performance,
and involves significantly less area and EPS.
R EFERENCES
[1] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Englewood
Cliffs, NJ: Prentice-Hall, 1985.
[2] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Filters. Hoboken, NJ: Wiley-Interscience, 2003.
[3] G. Long, F. Ling, and J. G. Proakis, The LMS algorithm with delayed
coefficient adaptation, IEEE Trans. Acoust., Speech, Signal Process.,
vol. 37, no. 9, pp. 13971405, Sep. 1989.
[4] M. D. Meyer and D. P. Agrawal, A modular pipelined implementation
of a delayed LMS transversal adaptive filter, in Proc. IEEE International Symposium on Circuits and Systems, May 1990, pp. 19431946.
[5] L. D. Van and W. S. Feng, An efficient systolic architecture for the
DLMS adaptive filter and its applications, IEEE Trans. Circuits Syst.
II, Analog Digit. Signal Process., vol. 48, no. 4, pp. 359366, Apr. 2001.
[6] L.-K. Ting, R. Woods, and C. F. N. Cowan, Virtex FPGA implementation of a pipelined adaptive LMS predictor for electronic support
measures receivers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 13, no. 1, pp. 8699, Jan. 2005.
MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM
11