You are on page 1of 4

IMPLEMENTATION OF CHANNEL DEMODULATOR

FOR DAB SYSTEM


Chien-Ming Wu', Ming-Der Shieh', Hsin-Fu
I

Lo', and Min-Hsiung HuZ

of Science & Technology, Taiwan


Yunlin University of Science & Technology, Taiwan

Graduate School of Engineering Science & Technology, National Yunlin University


2

Department of Electronic Engineering, National

Division of Design Service, Nation Science Council C h i p Implementation C e n t e r (CIC), Taiwan

ABSTRACT
This paper describes the VLSI implementation of Fast Fourier
Transform (FIT) for the. Eureka-147 Digital Audio Broadcasting
(DAB) system. We emphasize how 'to m i n i i e the hardware
requirement and efficiently manage the memory to meet the DAB
requirement. Implementation results demonstrate the applicability of
our work with the characteristics of modular design, consuming less
silicon area, and facilitating the extension for high transmission rate
applications. The core size of the resulting chip implementation is
2086x1806 pmz based o n the TSMC 0.35 p 1P4M CMOS
process. Performance evaluation reveals that our design for the
targeted channel demodulator outperform previous solutions.

1. INTRODUCTION
The Digital Audio Broadcasting (DAB) system, described in the
European Eureka-I47 standard [I], offers high-quality audio
services, supports multimedia data to mobile reception and might
replace the traditional radio system. Basically, two strategies are
employed to implement the DAB receiver: the DSP-based
architecture [Z,31 and the ASIC-based implementation [4, 51. The
former has the characteristics of maximum flexibility, ease of use
and sImple programming, but it can only provide limited processing
c a p a b t y . 0n.the contrary, the ASIC-based implementation has the
potentials of: supporting real-time symbol decoding and low-cost
Implementation.
Figure 1: shows an overview of the DAB system, in which the
ISONPEG coding is adopted for source coding and COFDM
(Coded Orthogonal Frequency Division Multiplexing) for channel
coding and' modulation [I]. After convolutional coding, the
generated codewords are interleaved in frequency for the fast
information channel and in both time and.frequency for the main
service channel, and then the OFDM modulation is performed. In
this paper, we focus on the design and implementation of the
channel demodulator, which essentially perform a Fast Fourier
Transform (FFT). In general, two basic types of F'FT architectures
can be found in the literature: the pipelined orchirecture with each
stage consisting of a butterfly unit 16, 71 and the single burrerfly
architecture 1.5, 81 that employs just one radix-r butterfly unit. The
main concern is the trade-off between hardware overhead and speed
requirement.
Although the pipelined architecture can provide a higher
throughput rate than the single butterfly implementation, we are still
interested in the single butterfly architecture because of the
specifcations of the channel demodulator as well as the hardware
considerations on the implementation of DAB receivers. For the
single butterfly Implementation, a basic problem that arises is how
to eEciently mange memory readwrite accesses for the purposes
of increasing its throughput rate. The common solutions include: (1)
Use the high-radix implementation to reduce the total number of

0-7803-7761-31031117.00
02003 IEEE

memory accesses at the expense of increasing the arithmetic


complexity, i.e., the hardware requirement of a high-radix butterfly,
unit. (2) Partition the memory into several banks in order to allow
concurrent accesses of multiple data with a more complicated
addressing scheme, which might correspond to a higher routing area.
In this paper, we describe the design and implementation of the
FIT for the DAB channel demodulator. We show our experiences
on how to use the conflict-free memory addressing arrangement in
191 to minimize the hardware requirement and to match the DAB
requirement. Implementation results demonstrate the applicability of
our work to the targeted channel demodulator and the advantages
over previous solutions [ 5 , 71 in terms of hardware requirement.
The rest of this paper is organized as follows: Section 2 reviews the
background and our previous work [9] related to this paper. Section
3 describes the resulting architecture and design of FFT processor.
Then, the corresponding chip implementation and performance
evaluation are shown in Section 4. Finally, Section 5 concludes this
work.
conYolulional d i n g

ding

OFDM

transminer

inrerIEaving

Chaskd

N o m and Retlcclion

Figure 1. An overview of the DAB system [SI.

2. PRELIMINARY RESULTS
The N-point Discrete Fourier Transform (DFT) of a sequence
x(k) is defmed as

where n = 0, 1, ..., N-l and W = e-J2"". From Eq. ( I ) , we know


that N2 multiplications and N(N-1) additions are needed to directly
perform the required computations. By applying the FIT, the
computational complexity can be down to a number in O(N log M.
If the number of sampled points is a power of the radix r, then it
is easy to compute the D F I by using a radix-r FF'I algorithm In
such a case, the N-point DFT can be decomposed into a set of
recursively related r-point transforms. The decimation in time (DIT)
and decimation in frequency (DIF) are two basic classes of FIT
algorithm [lo]. Specifically, the DIT FF'I algorithm is based on
decomposing the input sequence x(k) into successively smaller and
smaller subsequences. The DIF F'FT algorithm is to decompose the
output sequence X(n) into smaller subsequences in the same way.
Figure 2 shows a DIT 8-point FIT algorithm, in which the data in
each stage can be processed based on the so-called butterfly units.

E137

datapath widths are either 8 or 16 bits. The details of the VLSI


realization are described in the following subsections.

eh
Conlml Unit

Cacff.

Butlsrtly
Unit

ROM

Figure 3. Block diagram of the FFT processor,

3.1 Memory Arrangement


Figure 2. The data flow graph of DIT E-point FFI computation

In general, an N-point FFI computation requires (N/r)xlog,N


radix-r butterfly computations and either the pipelined architecture
or the single butterfly architecture can be selected for a dedicated
application. For the single butterfly implementation, it implies
2Nxlog,N memory accesses, which are the main bottlenecks for fast
FFT computation. Therefore, we need an efficient memory
management strategy to overcome this problem, i.e., to reduce the
number of memory accesses or to increase the memory bandwidth.
In our previous work [91, we have presented a set of simple but
efficient equations to partition the memory into a number of
memory banks such that the equivalent memory bandwidth can be
increased with simple interconnection networks.
As known, let m be the number of stages for the FFT
computation, then the value m can be computed by
m=llogr

(2)

Following the notation of conventional number system, it is


assumed that the original memory address (4, is expressed in
unsigned radix-r representation defined as

(4.( & . I . c L . 2 . .. ..da, d,. do),


(3)
where di is an integer and 0 5 di 5 r-I. In consequence, a feasible
solution to partition the memory into r banks can be easily obtained
as shown in Eq.(4), which implies that the original address (4, will
be distributed into the bank number B(d, r). The correctness of Eq.
(4) is assured by observing that for a given butterfly index, the
equation contains the distinguishable variable at each stage.
B(d, r) = (d,,., + d,,,. + ... + dz + d l + 4 )mod r
(4)
Finally, we consider the mapping of (4,
into one of the address
locations of the selected bank B(d, r). To simp@ the hardware
implementation, the assigned address BA(d, r ) in the bank B(d, r) is
obtained by discarding the least significant digit of the original
address. Equation ( 5 ) causes no conflict due to the fact that for two
original addresses that differ in only the least signifcant digit, they
are distributed into different banks based on Eq. (4) because of 0 S
d0Sr-l.
BA(d. r) = (dn,.t.d,,,.2. ..., 4.4,

(5)

3. FFT DESIGN AND IMPLEMENTATION


Figure 3 depicts the block diagram of the single butterfly
architecture for our FIT processor. It operates on a 24.576 MHz
clock and consists of a simple radix-2 DIT butterfly unit, a singleport FFT RAM, a coefficient ROM, a control unit, and an addressgenerate unit (AGU). AU variables are complex and the intemal

For memory arrangement, first we have to.decide whether the


ping-pong mode or in-place mode is to be applied to store the
intermediate values when implementing the FFT RAM. The main
disadvantage of the former is that twice as many memory spaces are
required in comparison with the in-place operation, but the control
circuit is easy, For in-place scheduling, exactly one memory space is
needed for storing the intermediate values and the old computed
values are immediately overwritten by the newly computed values.
This is an important feature for the realization of long FFTs due to
the fact that area for storing the large amount of intermediate results
will occupy a significant fraction of the avatlable chip area. For this
reason, we consider only in-place schemes in this work. Basically,
the memory addresses of the in-place schedule can be generated
with little hardware overhead based on cyclically rotational property
[Ill.
As known, the.lower hardware cost of the single butterfly
architecture is achieved at the price of degrading the throughput
rate of the pipelined version. According to the operational mode I
defined in the Eureka-147 standard. we know that a ZME-point
FFT operation should be completed within 1.25 m. Under such a
circumstance, it will be not possible to complete the desired FFr
operation based on the radix-2 solution without memory partition
given the chosen operational frequency of 24.576 MHz. In order to
make the single buttemy architecture meet the DAB requirement,
memory partitioning becomes a cost-effective solution. In our
implementation, the single-port FFT RAM is divided into r = 2
banks to meet the timing requirement and the in-place scheduling
scheme is applied for saving memory spaces.
The address-generate unit shown in Figure 4 is designed to
generate addresses for two memory banks and the coefficient ROM.
The butterfly counter is used to sequentially generate the required
buttemy indices at stage one. The two barrel shifters first
concatenate their indices, respectively, with the current butterfly
index and then emulate the right rotational property of addresses at
the present stages specified by the stage counter. Finally, the MUX
is to distribute the addresses based on Eqs. (2)-(5) such that the
output of each barrel shifter can be directed into the correct
memory bank. For the radix-? implementation. the control signal
Bank-index is derived by performing bit-wise XOR operation on
the original addresses according to Eq. (4).
In addition, the contents of the coefficient ROM and the
corresponding addressing rules can be easily decided by following
the data flow graph of DIT FFT computation. Note that we only
need to store half the twiddle coefficients due to their symmetric
property. Let the radix-2 twiddle coefficient W p = e - j l n x P i Nbe
stored in the pth ROM address. Then, the ROM contents can be
accessed based on the current butterfly index BI and the present

U-138

itage number r according to following equations. k t the binary


representation of the current butterfly index be given by
Bl = (bn,.2,bn,.3
.........b2.bl.bO)2
where m = l o g , N

memory write operation. To.reduce the critical path delay, we divide


the whole operations of the buttertly unit into (s+?) different steps
(the fust step for memory read operation, the following s steps for
arithmetic operation. and the last step for memory write operation)
as indicated in Figure 6. Due to the in-place computation. we have
to schedule the tasks assigned to the pipelined butterfly unit such
that no control hazard occurs during memory accesses. A control
hazard (see Figure 7(a)) results from the conflict when the butterfly
unit intends to access more than two data in the same memory bank.
Figure 7(b) shows the schedule to eliminate the control hazard
providing that only the single-port memory h available in the
implementation. The arrangement of Figure 7(b) results in only 50%
hardware utilization of the pipelined butterfly unit. On the contrary,
100% hardware utilization can be achieved if the dual-port memory
is employed in the design. Note that the area occupied by the
memory module is not only proportional to the number of stored
data, but it is also proponional to the number of ports. Obviously,
the chip area of a dud-port memory is much higher than that of a
single-port memory.
Since we use a 24.576 MHz clock in our FFT processor, the
arithmetic operation can be fnished within one clock cycle (s =I).
Each buttertly operation. thus, only takes three clock cycles, each
for memory read operation, arithmetic operation. and memory write
operation. In addition, only 50% hardware utilization is achieved
because the single-port memory is employed in our design to reduce
the hardware cost.

(6)

is the number of stages for the rad&-?

implementation. From the data flow graph. the elements hi's of BI


can be used as variables in conjunction with the value t to generate
proper ROM addresses. Specifically, we first generate a vector from
the ,present r value based on Eq. (7) and then the desired ROM
address p(B1, r ) can be computed by using the vector as a mask to
filter out unwanted b,'s according to Eq. (8).
2r-,

-1 = [ q , , , ~ 2 . q n , ~ J . . . ~ . qfor
1 , qr O
= 11,22,. .... m (7)

Equation (7) can be easily implemented by resetting a s M register


and then shifting in a "one" from the least significant bit when the
stage advances once. And. Eq. (8) represents the masked output of
the bit reversal of the current butterfly index. In both cases, their
implementation cost is almost negligible.
Bank-index

C"",

iz

02
w

m- m.

I Read I
Figure 4. The block diagram of the address-generate unit

Computation

IWntc

Figure 6. Radix-2 DIT pipelined butterfly unit


T . T ,

3.2 Buttemy Unit


The butterfly unit is the core of F l T processors to determine
the desired clock speed and the resulting throughput. In this work,
the butterfly unit was designed with the simple rad&-2 DIT-FFT
algorithm. As shown in Figure 5 , the arithmetic operations consist
of calculating a pair of complex values, A'=A+BW and B'=A-BW,
from a pair of complex inputs, A and B, and the twiddle coefficient
W.

os.

r . - - , ~ , . , T . . T . - - T T . T

I R I C~ I c. I4

L - - - ~ _ _ _ _ - - - _ - _ _Mulipliar
_____~

Figure 5 . The arithmetic of radix-2 DIT-FFT algorithm


For a butterfly unit without employing pipelining, the critical
path is the summation of the memory read operation. arithmetic
operation (multiplication and addition of complex numbers), and

11-139

(b)
Figure 7. (a) The control hazard. (b) The reconcile for control
hazard.

4. CHIP REALIZATION AND COMPARISON


AU the modules in our design have been successfully
implemented based on the TSMC (Taiwan Semiconductor
Manufacturing Company) 0.35 jnn lP4M CMOS process and
simulated using Synopsys and Cadence tool. Based on the
speciiicatians of DAB channel demodulator, the resulting FFT
processor is capable of completing the four operational modes
(mode I: 2048 points, mode II: 512 points, mode I11 256 points,
and mode I V 1024 points) with a clock frequency of 24.576 MHz.
The corresponding physical layout is shown in Figure 8, in which it
includes 2 x 1 0 2 4 ~ 1 6SRAMs (two banks, each containing 1024x16
bits) and 2 x 1 0 2 4 ~ 8ROMs (one for the real part and another for the
imaginary part). In terms of the 2-input NAND gate, the total
number of gate counts is 4351, excluding the used memories. The
resulting core s u e of the chip implementation is about 2086x1806
pn2 and the overall chip size including U 0 pads is 2856x2594 pn'.

Figure 8. The layout of the developed FFT processor,


We compare the performance of our implementation with the
following FFT implementations: the pipelined architecture I71 and
the single butterfly architecture IS]. The circuit complexities of
these designs are compiled in Table I.The pipelined architecture in
171 might be the preferred choice for high-speed applications, but it
is not suitable for the application of DAB system. The memory
bandwidth problem of [5] is solved by introducing more
complicated structure (the radix-4 butterfly unit) and utilizing more
memory resources. Note that the operation frequency of [5] is
12.288MHz. By taking advantages of efficient memory partition and
employing the pipelined butterfly unit, our design can reduce the
required area complexity and it still fits in the DAB specifications.
For DAB applications, it is clear that our design outperforms
Delaruelle's work.

Results show that our implementation has the potentials of


consuming less silicon area and facilitating the extension for high
transmission rate requirement.

REFERENCES
[I]

1997.
121 J. A. Husiken. F. V. Lax. A. Delaruelle, and N. J. L. Philips.
"Specification. partitioning and design of a DAB channel decoder." in
Proc.VLSI Signal Processing Workhap, pp. 21-29. 1993.
131 M. B o k . D. Clawin, K. Gieske. F. Hofmnn. T. Mlasko, M. J. Ruf. and
G. Spreitz "The receiver engine chipset for digital audio broadcasting,"
in hoc. URSI Int. Symp. Signals. System. and Electronics. pp. 338-342,
1998.

141 A. Delamelk, J. Huisken. 1. V. Loan. and F. Welten. "A chip set for a
digital audio broadcasting channel decoder." in hoc. IEEE Custom
Integrated Circuit Coni.. pp. 13.4.1-13.4.4. 1995.
151 A. Delaruelle. J. Huisken. 1. van Laan, and F. Welten. "A channel
demodulator IC for digital audio broadcasting,'' in hoc. IEEE Custom
Integrated Circuits Conf. 1994. pp. 47-50. 1994.
161 S. He. and M. Torkelson. "Design and implementation of a 1024-point
pipeline F l T processor." in Proc. IEEE Custom Integrated Circuits Coni,,
pp. 131-134,1998.
171 E. Bidet, D. Castelain. C. Jaanblanq. and P. Senn. "A fast single-chip
implementation of 8192 complex paint FTT." IEEE I. Solid-State
Circuits, vol. 30. no. 3. pp. 300-305, March 1995.
181 E. Cedn. Richard C. S . Morling and I. Kale. "An extensible complex fast
Fourier transform processor chip for real-time specmm analysis and
m~suremenf."IEEE Trans. Instrumentation and Measuremnt. vol. 47.
no. 1. pp.95-99, Feb. 1998.
191 H. F. Lo, M. D. Shieh. and C. M. Wu, "Design of an efficient FFI
processor far DAB system" in Proc. IEEE Inl. Symp. Circuits and
System. 654-657.2001
[IO1 E. 0.Brigham The Fnsf Fourier Tonsform and ifs Applications.
Prentice-Hall Inc.. 1988.
[Ill M. Biver, H. Kaeslin, and C. TormMsini. "In-place updating of path
metiics in Viterbi decaders," IEEE J. Solid-State Circuits. vol. 24.pp.
1158-1159,Aug.1989.
Table 1. Comparisons of different implementations

E. Bidet
171
No. of butterfly

unit

logy, radix-r

A. Delaruelle

Gate counts of
arithmetic

components
Memory size
No. of clock
cycles
N = 2048

Sub"'

8160*( log: -1)

+896* log:
2048 (dual- ort)

I , radix-4

I I . radix-2

CM

4 *log: Adder'"
4*log:

Proposed

151

3*( logy -1) CM"'


Arithmetic
components

5. CONCLUSION
Up to date, lots of efforts have been devoted to the
development of low-cost DAB products. Of the key techniques to
build a DAB receiver. the FFT is one of the key components, which
is very suitable for ASIC implementation. This paper explores
efficient solutions for hardware implementations of the FFT
processor such that they can fit in the specification of the Eureka147 standard under limited hardware resources. AU the functional
blocks are designed, simulated, and verified using the Synopsys and
Cadence software and the fmd layout is ready for VLSI fabrication
based on the 0.35 p n TSMC process and Compass cell library.

ETS 300 401, "Radio broadcasting system: Digital audio broadcasting


(DAB)to mobile. portable and fixed receivers", ETSI, 2'edition.. May

1 CM
1 Adder
Sub

4 Adder

4 Sub

9156

4 Registe

2954

2x2048
4xA,")

2458

1 I264

22528

Note: (1) C M %bit complex-number multiplier, (2) A d d 16-bits


N
and ( 5 ) A2 =
adder, (3) S u b 16-bit subtractor. (4) A , = --log:,
4

11-140

You might also like