Professional Documents
Culture Documents
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
ACKNOWLEDGMENT
The authors would like to thank National Chip Implementation
Center (CIC), Taiwan for technical support in simulations. The authors
would also like to thank Y.-R. Cho and S.-W. Chen for their assistance
in simulations and layouts.
REFERENCES
[1] H. Kawaguchi and T. Sakurai, A reduced clock-swing flip-flop
(RCSFF) for 63% power reduction, IEEE J. Solid-State Circuits, vol.
33, no. 5, pp. 807811, May 1998.
[2] A. G. M. Strollo, D. De Caro, E. Napoli, and N. Petra, A novel high
speed sense-amplifier-based flip-flop, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 13, no. 11, pp. 12661274, Nov. 2005.
[3] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper,
Flow-through latch and edge-triggered flip-flop hybrid elements, in
IEEE Tech. Dig. ISSCC, 1996, pp. 138139.
[4] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta,
R. Heald, and G. Yee, A new family of semi-dynamic and dynamic flip
flops with embedded logic for high-performance processors, IEEE J.
Solid-State Circuits, vol. 34, no. 5, pp. 712716, May 1999.
[5] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J.
Sullivan, and T. Grutkowski, The implementation of the Itanium 2
microprocessor, IEEE J. Solid-State Circuits, vol. 37, no. 11, pp.
14481460, Nov. 2002.
[6] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,
Comparative delay and energy of single edge-triggered and dual edge
triggered pulsed flip-flops for high-performance microprocessors, in
Proc. ISPLED, 2001, pp. 207212.
[7] B. Kong, S. Kim, and Y. Jun, Conditional-capture flip-flop for statistical power reduction, IEEE J. Solid-State Circuits, vol. 36, no. 8, pp.
12631271, Aug. 2001.
[8] N. Nedovic, M. Aleksic, and V. G. Oklobdzija, Conditional precharge
techniques for power-efficient dual-edge clocking, in Proc. Int. Symp.
Low-Power Electron. Design, Monterey, CA, Aug. 1214, 2002, pp.
5659.
[9] P. Zhao, T. Darwish, and M. Bayoumi, High-performance and low
power conditional discharge flip-flop, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477484, May 2004.
[10] C. K. Teh, M. Hamada, T. Fujita, H. Hara, N. Ikumi, and Y. Oowaki,
Conditional data mapping flip-flops for low-power and high-performance systems, IEEE Trans. Very Large Scale Integr. (VLSI) Systems,
vol. 14, pp. 13791383, Dec. 2006.
[11] S. H. Rasouli, A. Khademzadeh, A. Afzali-Kusha, and M. Nourani,
Low power single- and double-edge-triggered flip-flops for high speed
applications, Proc. Inst. Electr. Eng.Circuits Devices Syst., vol. 152,
no. 2, pp. 118122, Apr. 2005.
[12] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, Ultra low
power clocking scheme using energy recovery and clock gating, IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, pp. 3344, Jan.
2009.
[13] P. Zhao, J. McNeely, W. Kaung, N. Wang, and Z. Wang, Design of
sequential elements for low power clocking system, IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., to be published.
[14] Y.-H. Shu, S. Tenqchen, M.-C. Sun, and W.-S. Feng, XNOR-based
double-edge-triggered flip-flop for two-phase pipelines, IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 53, no. 2, pp. 138142, Feb. 2006.
[15] V. G. Oklobdzija, Clocking and clocked storage elements in a multigiga-hertz environment, IBM J. Res. Devel., vol. 47, pp. 567584, Sep.
2003.
I. INTRODUCTION
Due to the explosive growth of multimedia application, the demand
for high-performance and low-power digital signal processing (DSP)
is getting higher and higher. Finite-impulse response (FIR) digital filters are one of the most widely used fundamental devices performed
in DSP systems, ranging from wireless communications to video and
image processing. Some applications need the FIR filter to operate at
high frequencies such as video processing, whereas some other applications request high throughput with a low-power circuit such as multiple-input multiple-output (MIMO) systems used in cellular wireless
communication. Furthermore, when narrow transition-band characteristics are required, the much higher order in the FIR filter is unavoidable. For example, a 576-tap digital filter is used in a video ghost canceller for broadcast television, which reduces the effect of multipath
signal echoes. On the other hand, parallel and pipelining processing are
two techniques used in DSP applications, which can both be exploited
to reduce the power consumption. Pipelining shortens the critical path
by interleaving pipelining latches along the datapath, at the price of
increasing the number of latches and the system latency, whereas parallel processing increase the sampling rate by replicating hardware so
that multiple inputs can be processed in parallel and multiple outputs
are generated at the same time, at the expense of increased area. Both
techniques can reduce the power consumption by lowering the supply
voltage, where the sampling speed does not increase. In this paper, parallel processing in the digital FIR filter will be discussed. Due to its
linear increase in the hardware implementation cost brought by the increase of the block size L, the parallel processing technique loses its
advantage in practical implementation. There have been a few papers
Manuscript received July 30, 2010; revised September 20, 2010, October 22,
2010; accepted November 20, 2010. Date of publication December 30, 2010;
date of current version January 18, 2012.
The authors are with Department of Electrical and Computer Engineering,
Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail: ytsao@iit.edu;
kchoi@ece.iit.edu).
Digital Object Identifier 10.1109/TVLSI.2010.2095892
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
367
N=L
N N=L
L
L N
Consider an
form as
y(n) =
N01
i=0
h(i)x(n 0 i);
n = 0; 1; 2; . . . ; 1
(1)
Y0 = H0 X0 + z02 H1 X1
Y1 = (H0 + H1 )(X0 + X1 ) 0 H0 X0 0 H1 X1 :
(5)
N=
N=
N=
L = 3)
B. 3 2 3 FFA (
Y0 = H0 X0 0 z03 H2 X2 + z03
2 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]
L01
L01
L01
Y1 = [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
Yp (zL )z0p = Xq (zL )z0q Hr (zL )z0r (2)
0 (H0 X0 0 z03 H2 X2 )
p=0
q=0
r=0
Y2 = [(H0 + H1 + H2 )(X0 + X1 + X2 )]
1 z 0k x(Lk + q);Hr
X
=
=
where
q
k
=0
0 [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
(N=L)01 0k
1 z 0k x(Lk + p);
z x(Lk + r);Yp
=
k=0
k=0
0 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]:
(6)
for p; q; r = 0; 1; 2; . . . ; L 0 1. From this FIR filtering equation, it
2
shows that the traditional FIR filter will require L -FIR subfilter
The hardware implementation of (6) requires six length-N=3 FIR subblocks of length N=L for implementation.
filter blocks, three preprocessing and seven postprocessing adders, and
three N multipliers and 2N + 4 adders, which has reduced approxiA. 2 2 2 FFA (L = 2)
mately one third over the traditional three-parallel filter hardware cost.
The implementation obtained from (6) is shown in Fig. 2.
Y0 = H0 X0 + z02 H1 X1 ;
Y1 = H0 X1 + H1 X0:
(4)
N=
FOR
To utilize the symmetry of coefficients, the main idea behind the proposed structures is actually pretty intuitive, to manipulate the polyphase
decomposition to earn as many subfilter blocks as possible which contain symmetric coefficients so that half the number of multiplications
in the single subfilter block can be reused for the multiplications of
whole taps, which is similar to the fact that a set of symmetric coefficients would only require half the filter length of multiplications in a
single FIR filter. Therefore, for an -tap -parallel FIR filter the total
368
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
N= L
L = 2)
A. 2 2 2 Proposed FFA (
Y0 =
1
2
[(
H0 + H1 )(X0 + X1 )
H0 0 H1 )(X0 0 X1 )] 0 H1 X1
+(
z02 H1 X1 ;
(7)
Fig. 5. Proposed three-parallel FIR filter implementation.
to the proposed two-parallel FIR filter structure, and the top two
subfilter blocks will be as
Fig. 6. Comparison of subfilter blocks between existing FFA and the proposed
FFA three-parallel FIR structures.
(8)
As can be seen from the example above, two of three subfilter blocks
from the proposed two-parallel FIR filter structure, 0 0 1 and 0 +
1 , are with symmetric coefficients now, as (8), which means the subfilter block can be realized by Fig. 4, with only half the amount of multipliers required. Each output of multipliers responds to two taps. Note
that the transposed direct-form FIR filter is employed. Compared to
the existing FFA two-parallel FIR filter structure, the proposed FFA
structure leads to one more subfilter block which contains symmetric
coefficients. However, it comes with the price of the increase of amount
of adders in preprocessing and postprocessing blocks. In this case, two
additional adders are required for = 2.
H H
Fig. 7. Comparison of subfilter blocks between existing FFA and the proposed
FFA four-parallel FIR structures.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
369
B. 3 2 3 Proposed FFA (L = 3)
With the similar approach, from (6), a three-parallel FIR filter can
also be written as (9). Fig. 5 shows implementation of the proposed
three-parallel FIR filter. When the number of symmetric coefficients
is the multiple of 3, the proposed three-parallel FIR filter structure
presented in (9) enables four subfilter blocks with symmetric coefficients in total, whereas the existing FFA parallel FIR filter structure
has only two ones out of six subfilter blocks. A comparison figure is
shown in Fig. 6, where the shadow blocks stand for the subfilter blocks
which contain
Y0 = 2 [(H0 + H1 )(X0 + X1 )
+ (H0 0 H1 )(X0 0 X1 )] 0 H1 X1
03
+ z f(H0 + H1 + H2 )(X0 + X1 + X2 )
0 (H0 + H2 )(X0 + X2 )
0 21 [(H0 + H1 )(X0 + X1 )
0 (H0 0 H1 )(X0 0 X1 )] 0 H1 X1g
1
Y1 = 2 [(H0 + H1 )(X0 + X1 )
0 (H0 0 H1 )(X0 0 X1 )]
03 1 [(H0 + H2 )(X0 + X2 )
+z
2
+(H0 0 H2 )(X0 0 X2 )]
0 12 [(H0 + H1 )(X0 + X1 )
1
L;
N=
H0 0 H1 )(X0 0 X1 )] + H1 X1
N
N=
+ (
Y2 = 21 [(H0 + H2 )(X0 + X2 )
0 (H0 0 H2 )(X0 0 X2 )] + H1 X1
N=
(9)
When an -parallel FIR filter comes with a set of symmetric coefficients of length
the number of required multipliers for the proposed
parallel FIR filter structures is provided by (10) and (11).
N;
370
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
TABLE I
COMPARISON OF PROPOSED AND THE EXISTING FFA STRUCTURES NUMBER
OF REQUIRED MULTIPLIERS (M.), REDUCED MULTIPLIERS (R.M.), NUMBER
OF REQUIRED ADDERS IN SUBFILTER SECTION (SUB.), NUMBER OF REQUIRED
ADDERS IN PRE/POSTPROCESSING BLOCKS (PRE/POST.), AND NUMBER OF
THE INCREASED ADDERS (I.A.)
TABLE II
COMPARISON OF STRUCTURES FOR A 144-TAP FIR FILTER NUMBER OF
REQUIRED MULTIPLIERS (M.), NUMBER OF REQUIRED ADDERS (A.), NUMBER
OF REQUIRED DELAY ELEMENTS (D.)
TABLE III
COMPARISON OF AREA
TABLE IV
COMPARISON OF POWER
TABLE V
COMPARISON OF CRITICAL PATH DELAY
Asub =
Case 1:
When
r
i=1
Li is even;
M=
r
i=1
Li
i=1
Mi 0 S2 :
(10)
Case 2:
When
r
i=1
Li is odd;
M = rN Li Mi 0 S2
i=1
i=1
r
Li 0 1 : (11)
Li is the small parallel block size such as (2 2 2) or (3 2 3) FFA.
r is the number of FFAs used. Mi is the number of subfilter blocks
r
i=1
r
i=1
Mi
r
i=1
Li 0 1 :
(12)
A comparison between the proposed and the existing FFA structures for
even symmetric coefficients with different length under different level
of parallelism is summarized in Table I. Also, a comparison between
the proposed structures and other structures for a 144-tap FIR filter with
parallel block 4 and 8 is shown in Table II.
V. IMPLEMENTATION AND EXPERIMENTAL RESULT
The proposed FFA structures and the existing FFA structures are implemented in Verilog HDL with filter length of 24 and 72, word length
16-bit and 32-bit, respectively. Two sets of the ideal low-pass FIR filter
symmetric coefficients of length 24 and 72 are generated by MATLAB
using Remez Exchange algorithm. The maximum absolute difference
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
(MAD) algorithm introduced in [1], [2] is used for coefficients quantization. The subfilter is based on canonical signed digit (CSD) structure
and Carry-Save adders are used. Tables III, IV, and V show the results
of area, power, and critical path delay, synthesized by Design Compiler
[10] with 45-nm technology.
VI. CONCLUSION
In this paper, we have presented new parallel FIR filter structures,
which are beneficial to symmetric convolutions when the number of
taps is the multiple of 2 or 3. Multipliers are the major portions in hardware consumption for the parallel FIR filter implementation. The proposed new structure exploits the nature of even symmetric coefficients
and save a significant amount of multipliers at the expense of additional adders. Since multipliers outweigh adders in hardware cost, it is
profitable to exchange multipliers with adders. Moreover, the number
of increased adders stays still when the length of FIR filter becomes
large, whereas the number of reduced multipliers increases along with
the length of FIR filter. Consequently, the larger the length of FIR filters is, the more the proposed structures can save from the existing FFA
structures, with respect to the hardware cost. Overall, in this paper, we
have provided new parallel FIR structures consisting of advantageous
polyphase decompositions dealing with symmetric convolutions comparatively better than the existing FFA structures in terms of hardware
consumption.
REFERENCES
[1] D. A. Parker and K. K. Parhi, Low-area/power parallel FIR digital
filter implementations, J. VLSI Signal Process. Syst., vol. 17, no. 1,
pp. 7592, 1997.
[2] J. G. Chung and K. K. Parhi, Frequency-spectrum-based low-area
low-power parallel FIR filter design, EURASIP J. Appl. Signal
Process., vol. 2002, no. 9, pp. 444453, 2002.
[3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.
[4] Z.-J. Mou and P. Duhamel, Short-length FIR filters and their use in
fast nonrecursive filtering, IEEE Trans. Signal Process., vol. 39, no.
6, pp. 13221332, Jun. 1991.
[5] J. I. Acha, Computational structures for fast implementation of L-path
and L-block digital filters, IEEE Trans. Circuit Syst., vol. 36, no. 6, pp.
805812, Jun. 1989.
[6] C. Cheng and K. K. Parhi, Hardware efficient fast parallel FIR filter
structures based on iterated short convolution, IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 51, no. 8, pp. 14921500, Aug. 2004.
[7] C. Cheng and K. K. Parhi, Furthur complexity reduction of parallel
FIR filters, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 2005),
Kobe, Japan, May 2005.
[8] C. Cheng and K. K. Parhi, Low-cost parallel FIR structures with
2-stage parallelism, IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
54, no. 2, pp. 280290, Feb. 2007.
[9] I.-S. Lin and S. K. Mitra, Overlapped block digital filtering, IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 8,
pp. 586596, Aug. 1996.
[10] Design Compiler User Guide, ver. B-2008.09, Synopsys Inc., Sep.
2008.
371
I. INTRODUCTION
Design of area- and power-efficient high-speed data path logic systems are one of the most substantial areas of research in VLSI system
design. In digital adders, the speed of addition is limited by the time
required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the
previous bit position has been summed and a carry propagated into the
next position.
The CSLA is used in many computational systems to alleviate the
problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [1]. However,
the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering
carry input Cin = 0 and Cin = 1, then the final sum and carry are
selected by the multiplexers (mux).
The basic idea of this work is to use Binary to Excess-1 Converter
(BEC) instead of RCA with Cin = 1 in the regular CSLA to achieve
lower area and power consumption [2][4]. The main advantage of this
BEC logic comes from the lesser number of logic gates than the n-bit
Full Adder (FA) structure. The details of the BEC logic are discussed
in Section III.
This brief is structured as follows. Section II deals with the delay
and area evaluation methodology of the basic adder blocks. Section III
presents the detailed structure and the function of the BEC logic. The
SQRT CSLA has been chosen for comparison with the proposed design as it has a more balanced delay, and requires lower power and
area [5], [6]. The delay and area evaluation methodology of the regular
and modified SQRT CSLA are presented in Sections IV and V, respectively. The ASIC implementation details and results are analyzed in
Section VI. Finally, the work is concluded in Section VII.
Manuscript received May 12, 2010; revised October 28, 2010; accepted December 15, 2010. Date of publication January 24, 2011; date of current version
January 18, 2012.
The authors are with the School of Electronics Engineering, VIT University,
Vellore 632 014, India (e-mail: ramkumar.b@vit.ac.in; kittur@vit.ac.in).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2010.2101621